Document(
shards: typing.List[google.cloud.documentai_v1.types.document.Document],
gcs_bucket_name: typing.Optional[str] = None,
gcs_prefix: typing.Optional[str] = None,
gcs_uri: typing.Optional[str] = None,
gcs_input_uri: typing.Optional[str] = None,
)
Represents a wrapped Document
.
This class hides away the complexities of using the Document
protobuf
response outputted by BatchProcessDocuments
or ProcessDocument
methods and implements convenient methods for searching and
extracting information within the Document
.
Attributes |
|
---|---|
Name | Description |
shards
:noindex: |
List[google.cloud.documentai.Document]
Required. A list of documentai.Document shards of the same Document .
Each shard consists of a number of pages in the Document .
|
gcs_bucket_name
:noindex: |
Optional[str]
Optional. The name of the gcs bucket. Format: gs://{bucket_name}/{optional_folder}/{target_folder}/ where gcs_bucket_name=bucket .
|
gcs_prefix
:noindex: |
Optional[str]
Optional. The prefix of the json files in the target_folder. Format: gs://{bucket_name}/{optional_folder}/{target_folder}/ where gcs_prefix={optional_folder}/{target_folder} .
For more information, refer to https://cloud.google.com/storage/docs/json_api/v1/objects/list
|
gcs_input_uri
:noindex: |
str
Optional. The gcs uri to the original input file. Format: gs://{bucket_name}/{optional_folder}/{target_folder}/{file_name}.pdf
|
pages
:noindex: |
List[Page]
A list of Pages in the Document .
|
entities
:noindex: |
List[Entity]
A list of un-nested Entities in the Document .
|
chunks
:noindex: |
Iterator[documentai.Document.ChunkedDocument.Chunk]
An iterator of document chunks extracted from a Layout Parser. |
document_layout_blocks
:noindex: |
Iterator[documentai.Document.DocumentLayout.DocumentLayoutBlock]
An iterator of document layout blocks extracted from a Layout Parser. |
text
:noindex: |
str
The full text of the Document .
|
Methods
convert_document_to_annotate_file_json_response
convert_document_to_annotate_file_json_response() -> str
Convert OCR data from Document.proto
to JSON str of AnnotateFileResponse
for Vision API.
Returns | |
---|---|
Type | Description |
str |
JSON string of TextAnnotations . |
convert_document_to_annotate_file_response
convert_document_to_annotate_file_response() -> (
google.cloud.vision_v1.types.image_annotator.AnnotateFileResponse
)
Convert OCR data from Document.proto
to AnnotateFileResponse.proto
for Vision API.
Returns | |
---|---|
Type | Description |
AnnotateFileResponse |
Proto with TextAnnotations . |
entities_to_bigquery
entities_to_bigquery(
dataset_name: str, table_name: str, project_id: typing.Optional[str] = None
) -> google.cloud.bigquery.job.load.LoadJob
Adds extracted entities to a BigQuery table.
Parameters | |
---|---|
Name | Description |
dataset_name |
str
Required. Name of the BigQuery dataset. |
table_name |
str
Required. Name of the BigQuery table. |
project_id |
Optional[str]
Optional. Project ID containing the BigQuery table. If not passed, falls back to the default inferred from the environment. |
Returns | |
---|---|
Type | Description |
bigquery.job.LoadJob |
The BigQuery LoadJob for adding the entities. |
entities_to_dict
entities_to_dict() -> typing.Dict[str, typing.Union[str, typing.List[str]]]
Returns Dictionary of entities in document.
Returns | |
---|---|
Type | Description |
Dict |
The Dict of the entities indexed by type. |
export_hocr_str
export_hocr_str(title: str) -> str
Exports a string hOCR version of the Document.
The format for the id of the object follows as such:
object_{page_index}_...
For example words will have the following id format:
word_{page_index}_{block_index}_{paragraph_index}_{line_index}_{word_index}
Parameter | |
---|---|
Name | Description |
title |
str
Required. The title for hocr_page and head. |
Returns | |
---|---|
Type | Description |
str |
A string hOCR version of the Document |
export_images
export_images(
output_path: str, output_file_prefix: str, output_file_extension: str
) -> typing.List[str]
Exports images from Document.entities
to files. Only exports Portrait
entities.
Parameters | |
---|---|
Name | Description |
output_path |
str
Required. The path to the output directory. |
output_file_prefix |
str
Required. The output file name prefix. |
output_file_extension |
str
Required. The output file extension. Format: |
Returns | |
---|---|
Type | Description |
List[str] |
A list of output image file names. Format: {output_path}/{output_file_prefix}_{index}_{Entity.type_}.{output_file_extension} |
form_fields_to_bigquery
form_fields_to_bigquery(
dataset_name: str, table_name: str, project_id: typing.Optional[str] = None
) -> google.cloud.bigquery.job.load.LoadJob
Adds extracted form fields to a BigQuery table.
Parameters | |
---|---|
Name | Description |
dataset_name |
str
Required. Name of the BigQuery dataset. |
table_name |
str
Required. Name of the BigQuery table. |
project_id |
Optional[str]
Optional. Project ID containing the BigQuery table. If not passed, falls back to the default inferred from the environment. |
Returns | |
---|---|
Type | Description |
bigquery.job.LoadJob |
The BigQuery LoadJob for adding the form fields. |
form_fields_to_dict
form_fields_to_dict() -> typing.Dict[str, typing.Union[str, typing.List[str]]]
Returns dictionary of form fields in document.
Returns | |
---|---|
Type | Description |
Dict[str, Union[str, List[str]]] |
The Dict of the form fields indexed by type. |
from_batch_process_metadata
from_batch_process_metadata(
metadata: google.cloud.documentai_v1.types.document_processor_service.BatchProcessMetadata,
) -> typing.List[google.cloud.documentai_toolbox.wrappers.document.Document]
Loads Documents from Cloud Storage, using the output from BatchProcessMetadata
.
.. code-block:: python
from google.cloud import documentai
from google.cloud.documentai_toolbox import document
operation = client.batch_process_documents(request)
operation.result(timeout=timeout)
metadata = documentai.BatchProcessMetadata(operation.metadata)
wrapped_document = document.Document.from_batch_process_metadata(metadata)
Parameter | |
---|---|
Name | Description |
metadata |
documentai.BatchProcessMetadata
Required. The operation metadata after a |
Returns | |
---|---|
Type | Description |
List[Document] |
A list of wrapped documents from gcs. Each document corresponds to an input file. |
from_batch_process_operation
from_batch_process_operation(
location: str, operation_name: str, timeout: typing.Optional[float] = None
) -> typing.List[google.cloud.documentai_toolbox.wrappers.document.Document]
Loads Documents from Cloud Storage, using the operation name returned from batch_process_documents()
.
.. code-block:: python
from google.cloud import documentai
from google.cloud.documentai_toolbox import document
operation = client.batch_process_documents(request)
operation_name = operation.operation.name
wrapped_document = document.Document.from_batch_process_operation(operation_name)
Parameters | |
---|---|
Name | Description |
location |
str
Optional. The location of the processor used for |
operation_name |
str
Required. The fully qualified operation name for a |
timeout |
float
Optional. Default None. Time in seconds to wait for operation to complete. If None, will wait indefinitely. |
Returns | |
---|---|
Type | Description |
List[Document] |
A list of wrapped documents from gcs. Each document corresponds to an input file. |
from_document_path
from_document_path(
document_path: str,
) -> google.cloud.documentai_toolbox.wrappers.document.Document
Loads Document
from local document_path
.
.. code-block:: python
from google.cloud.documentai_toolbox import document
document_path = "/path/to/local/file.json"
wrapped_document = document.Document.from_document_path(document_path)
Parameter | |
---|---|
Name | Description |
document_path |
str
Required. The path to the |
Returns | |
---|---|
Type | Description |
Document |
A document from local document_path . |
from_documentai_document
from_documentai_document(
documentai_document: google.cloud.documentai_v1.types.document.Document,
) -> google.cloud.documentai_toolbox.wrappers.document.Document
Loads Document
from local documentai_document
.
.. code-block:: python
from google.cloud import documentai
from google.cloud.documentai_toolbox import document
documentai_document = client.process_documents(request).document
wrapped_document = document.Document.from_documentai_document(documentai_document)
Parameter | |
---|---|
Name | Description |
documentai_document |
documentai.Document
Required. The |
Returns | |
---|---|
Type | Description |
Document |
A document from local documentai_document . |
from_gcs
from_gcs(
gcs_bucket_name: str, gcs_prefix: str, gcs_input_uri: typing.Optional[str] = None
) -> google.cloud.documentai_toolbox.wrappers.document.Document
Loads a Document from a Cloud Storage directory.
Parameters | |
---|---|
Name | Description |
gcs_bucket_name |
str
Required. The gcs bucket. Format: Given |
gcs_prefix |
str
Required. The prefix to the location of the target folder. Format: Given |
gcs_input_uri |
str
Optional. The gcs uri to the original input file. Format: |
Returns | |
---|---|
Type | Description |
Document |
A document from gcs. |
from_gcs_uri
from_gcs_uri(
gcs_uri: str, gcs_input_uri: typing.Optional[str] = None
) -> google.cloud.documentai_toolbox.wrappers.document.Document
Loads a Document from a Cloud Storage uri.
Parameters | |
---|---|
Name | Description |
gcs_uri |
str
Required. The full GCS uri to a Document JSON file. Example: |
gcs_input_uri |
str
Optional. The gcs uri to the original input file. Format: |
Returns | |
---|---|
Type | Description |
Document |
A document from gcs. |
get_entity_by_type
get_entity_by_type(
target_type: str,
) -> typing.List[google.cloud.documentai_toolbox.wrappers.entity.Entity]
Returns the list of Entities
of target_type
.
Parameter | |
---|---|
Name | Description |
target_type |
str
Required. Target entity type. |
Returns | |
---|---|
Type | Description |
List[Entity] |
A list of Entity matching target_type . |
get_form_field_by_name
get_form_field_by_name(
target_field: str,
) -> typing.List[google.cloud.documentai_toolbox.wrappers.page.FormField]
Returns the list of FormFields
named target_field
.
Parameter | |
---|---|
Name | Description |
target_field |
str
Required. Target field name. |
Returns | |
---|---|
Type | Description |
List[FormField] |
A list of FormField matching target_field . |
search_pages
search_pages(
target_string: typing.Optional[str] = None, pattern: typing.Optional[str] = None
) -> typing.List[google.cloud.documentai_toolbox.wrappers.page.Page]
Returns the list of Pages containing target_string or text matching pattern.
Parameters | |
---|---|
Name | Description |
target_string |
Optional[str]
Optional. target str. |
pattern |
Optional[str]
Optional. regex str. |
Returns | |
---|---|
Type | Description |
List[Page] |
A list of Pages. |
split_pdf
split_pdf(pdf_path: str, output_path: str) -> typing.List[str]
Splits local PDF file into multiple PDF files based on output from a Splitter processor.
Parameters | |
---|---|
Name | Description |
pdf_path |
str
Required. The path to the PDF file. |
output_path |
str
Required. The path to the output directory. |
Returns | |
---|---|
Type | Description |
List[str] |
A list of output pdf files. |
to_merged_documentai_document
to_merged_documentai_document() -> (
google.cloud.documentai_v1.types.document.Document
)
Exports a documentai.Document from the wrapped document with shards merged.
Returns | |
---|---|
Type | Description |
documentai.Document |
Document with all shards merged and text offsets applied. |