Module document (0.5.0a0)

Wrappers for Document AI Document type.

Classes

Document

Document(
    shards: List[google.cloud.documentai_v1.types.document.Document],
    gcs_bucket_name: Optional[str] = None,
    gcs_prefix: Optional[str] = None,
    gcs_input_uri: Optional[str] = None,
)

Represents a wrapped Document.

This class hides away the complexities of using Document protobuf response outputted by BatchProcessDocuments or ProcessDocument methods and implements convenient methods for searching and extracting information within the Document.

Optional. The name of the gcs bucket.

Format: gs://{bucket_name}/{optional_folder}/{target_folder}/ where gcs_bucket_name=bucket.

:type: Optional[str]

(List[Entity]): A list of Entities in the Document.

Modules Functions

_convert_to_vision_annotate_file_response

_convert_to_vision_annotate_file_response(
    text: str, pages: List[google.cloud.documentai_toolbox.wrappers.page.Page]
)

Convert OCR data from Document.proto to AnnotateFileResponse.proto for Vision API.

Parameters
NameDescription
text str

Required. Contents of document.

pages List[Page]

Required. A list of pages.

Returns
TypeDescription
AnnotateFileResponseProto with TextAnnotations.

_entities_from_shards

_entities_from_shards(
    shards: List[google.cloud.documentai_v1.types.document.Document],
)

Returns a list of Entities from a list of documentai.Document shards.

Parameter
NameDescription
shards List[google.cloud.documentai.Document]

Required. List of document shards.

Returns
TypeDescription
List[Entity]a list of Entities.

_get_batch_process_metadata

_get_batch_process_metadata(location: str, operation_name: str)

Get BatchProcessMetadata from a batch_process_documents() long-running operation.

Parameters
NameDescription
location str

Required. The location of the processor used for batch_process_documents().

operation_name str

Required. The fully qualified operation name for a batch_process_documents() operation.

Returns
TypeDescription
documentai.BatchProcessMetadataMetadata from batch process.

_get_shards

_get_shards(gcs_bucket_name: str, gcs_prefix: str)

Returns a list of documentai.Document shards from a Cloud Storage folder.

Parameters
NameDescription
gcs_bucket_name str

Required. The name of the gcs bucket. Format: gs://{bucket_name}/{optional_folder}/{target_folder}/ where gcs_bucket_name=bucket.

gcs_prefix str

Required. The prefix of the json files in the target_folder. Format: gs://{bucket_name}/{optional_folder}/{target_folder}/ where gcs_prefix={optional_folder}/{target_folder}.

Returns
TypeDescription
List[google.cloud.documentai.Document]A list of documentai.Documents.

_pages_from_shards

_pages_from_shards(
    shards: List[google.cloud.documentai_v1.types.document.Document],
)

Returns a list of Pages from a list of documentai.Document shards.

Parameter
NameDescription
shards List[google.cloud.documentai.Document]

Required. List of document shards.

Returns
TypeDescription
List[Page]A list of Pages.

_text_from_shards

_text_from_shards(shards: List[google.cloud.documentai_v1.types.document.Document])

Gets text from shards.

Parameter
NameDescription
shards List[google.cloud.documentai.Document]

Required. List of document shards.

Returns
TypeDescription
strText in all shards.