Class Document (0.2.0a0)

Document(
    shards: List[google.cloud.documentai_v1.types.document.Document],
    gcs_bucket_name: Optional[str] = None,
    gcs_prefix: Optional[str] = None,
)

Represents a wrapped Document.

This class hides away the complexities of using Document protobuf response outputted by BatchProcessDocuments or ProcessDocument methods and implements convenient methods for searching and extracting information within the Document.

Optional. The name of the gcs bucket.

Format: gs://bucket/optional_folder/target_folder/ where gcs_bucket_name=bucket.

:type: Optional[str]

(List[Entity]): A list of Entities in the Document.

Attributes

Name	Description
`gcs_prefix`	`Optional[str]` Optional. The prefix of the json files in the target_folder. Format: `gs://bucket/optional_folder/target_folder/` where gcs_prefix=`optional_folder/target_folder`. For more information please take a look at https://cloud.google.com/storage/docs/json_api/v1/objects/list .
`pages`	`Optional[str]` (List[Page]): A list of Pages in the Document.

Methods

entities_to_bigquery

entities_to_bigquery(
    dataset_name: str, table_name: str, project_id: Optional[str] = None
)

Adds extracted entities to a BigQuery table.

Parameters

Name	Description
`dataset_name`	`str` Required. Name of the BigQuery dataset.
`table_name`	`str` Required. Name of the BigQuery table.
`project_id`	`Optional[str]` Optional. Project ID containing the BigQuery table. If not passed, falls back to the default inferred from the environment.

Returns

Type	Description
`bigquery.job.LoadJob`	The BigQuery LoadJob for adding the entities.

entities_to_dict

entities_to_dict()

Returns Dictionary of entities in document.

Returns

Type	Description
`Dict`	The Dict of the entities indexed by type.

from_document_path

from_document_path(document_path: str)

Loads Document from local document_path.

Parameter

Name	Description
`document_path`	`str` Required. The path to the document.json file.

Returns

Type	Description
`Document`	A document from local document_path.

from_documentai_document

from_documentai_document(
    documentai_document: google.cloud.documentai_v1.types.document.Document,
)

Loads Document from local documentai_document.

Parameter

Name	Description
`documentai_document`	`documentai.Document` Optional. The Document.proto response.

Returns

Type	Description
`Document`	A document from local documentai_document.

from_gcs

from_gcs(gcs_bucket_name: str, gcs_prefix: str)

Loads Document from Cloud Storage.

Parameters

Name	Description
`gcs_bucket_name`	`str` Required. The gcs bucket. Format: Given `gs://{bucket_name}/{optional_folder}/{operation_id}/` where gcs_bucket_name=`{bucket_name}`.
`gcs_prefix`	`str` Required. The prefix to the location of the target folder. Format: Given `gs://{bucket_name}/optional_folder/target_folder` where gcs_prefix=`{optional_folder}/{target_folder}`.

Returns

Type	Description
`Document`	A document from gcs.

get_entity_by_type

get_entity_by_type(target_type: str)

Returns the list of Entities of target_type.

Parameter

Name	Description
`target_type`	`str` Required. target_type.

Returns

Type	Description
`List[Entity]`	A list of Entity matching target_type.

get_form_field_by_name

get_form_field_by_name(target_field: str)

Returns the list of FormFields named target_field.

Parameter

Name	Description
`target_field`	`str` Required. Target field name.

Returns

Type	Description
`List[FormField]`	A list of FormField matching target_field.

search_pages

search_pages(target_string: Optional[str] = None, pattern: Optional[str] = None)

Returns the list of Pages containing target_string or text matching pattern.

Parameters

Name	Description
`target_string`	`Optional[str]` Optional. target str.
`pattern`	`Optional[str]` Optional. regex str.

Returns

Type	Description
`List[Page]`	A list of Pages.

split_pdf

split_pdf(pdf_path: str, output_path: str)

Splits local PDF file into multiple PDF files based on output from a Splitter/Classifier processor.

Parameters

Name	Description
`pdf_path`	`str` Required. The path to the PDF file.
`output_path`	`str` Required. The path to the output directory.

Returns

Type	Description
`List[str]`	A list of output pdf files.

Class Document (0.2.0a0) Stay organized with collections Save and categorize content based on your preferences.

Attributes

Methods

entities_to_bigquery

entities_to_dict

from_document_path

from_documentai_document

from_gcs

get_entity_by_type

get_form_field_by_name

search_pages

split_pdf

Class Document (0.2.0a0)