This page describes how to process a document that contains an table you want to parse.
Document AI can detect and parse text from PDF, TIFF, GIF files stored in Cloud Storage, including text that contains unstructured data in the form of tables.
You request table detection from a smaller file (<=5 pages) using the
process
method, and larger file requests (files with a large
number of pages) use the batchProcess
method. The status of
batch (asynchronous) requests can be checked using the operations
resources.
Output from a batch request is written to a JSON file created in
the specified Cloud Storage bucket.
Small file online processing
Synchronous ("online") requests target a document with a small number of pages and size (<=5 pages, < 20MB) stored in Cloud Storage. Synchronous requests immediately return a response inline.
The following code samples show you how to process a document with a table.
v1beta2
Select the tab below for your language or environment:
REST & CMD LINE
This sample shows how to use the
process
method to request small document processing (<=5 pages, < 20MB). The
example uses the access
token for a service account set up for the project using the Cloud SDK. For
instructions on installing the Cloud SDK, setting up a project with a service
account, and obtaining an access token, see
Before you begin.
The sample
request body
contains required fields (inputConfig
) and optional fields, some
for table-specific processing
(tableExtractionParams
).
Note that default behavior enables table extraction and automatic table
location detection, even if tableExtractionParams
are not
specified.
Before using any of the request data below, make the following replacements:
- LOCATION: one of the following regional processing options:
us
- United Stateseu
- European Union
- PROJECT_ID: Your GCP project ID.
- STORAGE_URI: The URI of the document you want to
process stored in a Cloud Storage bucket, including the
gs://
prefix. You must at least have read privileges to the file. Example:gs://cloud-samples-data/documentai/table_parsing.pdf
-
BOUNDING_POLY (optional): A bounding box hint for a table on
the page. This field is intended for complex cases when the model may have difficulty locating
the table. The values must be normalized [0,1]. Object format:
{"x": X_MIN,"y": Y_MIN}, {"x": X_MAX,"y": Y_MIN},{"x": X_MAX,"y": Y_MAX},{"x": X_MIN,"y": Y_MAX}
.
HTTP method and URL:
POST https://LOCATION-documentai.googleapis.com/v1beta2/projects/PROJECT_ID/locations/LOCATION/documents:process
Request JSON body:
{ "inputConfig": { "gcsSource":{ "uri":"STORAGE_URI" }, "mimeType":"application/pdf" }, "documentType": "general", "tableExtractionParams": { "enabled": true, "tableBoundHints": [ { "boundingBox": { "normalizedVertices": [ BOUNDING_POLY ] } } ], "modelVersion": "builtin/stable" } }
To send your request, choose one of these options:
curl
Save the request body in a file called request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
https://LOCATION-documentai.googleapis.com/v1beta2/projects/PROJECT_ID/locations/LOCATION/documents:process
PowerShell
Save the request body in a file called request.json
,
and execute the following command:
$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-documentai.googleapis.com/v1beta2/projects/PROJECT_ID/locations/LOCATION/documents:process" | Select-Object -Expand Content
If the request is successful, the server returns a 200 OK
HTTP status code and the
response in JSON format. The response body contains an instance of
Document in its standard format.
Java
Node.js
Python
Large file offline processing
Asynchronous ("offline") requests targets longer documents and allows you to set the number of pages in the output files. This request starts a long-running operation. When this operation finishes it stores output as a JSON file in a specified Cloud Storage bucket.
Document AI asynchronous processing accepts PDF, TIFF, GIF files up to 2000 pages. Attempting to process larger files returns an error. Additionally, the maximum number of files you can send in a batch process request is 100. The maximum file size is 1GB.
The following code samples show you how to process a document containing a table.
v1beta2
Select the tab below for your language or environment:
REST & CMD LINE
This sample shows how to send a POST
request to the
batchProcess
method for large document asynchronous
processing. The example uses the access token for
a service account set up for the project using the Cloud SDK. For
instructions on installing the Cloud SDK, setting up a project with a service
account, and obtaining an access token, see
Before you begin.
The sample
request body
contains required fields (inputConfig
, outputConfig
)
and optional fields, some for table-specific processing
(tableExtractionParams
).
Note that default behavior enables table extraction and automatic table
location detection, even if tableExtractionParams
are not
specified.
A batchProcess
request starts a long-running operation and
stores results in a Cloud Storage bucket. This sample also shows you how to
get the status of this long-running operation after it has started.
Send the process request
Before using any of the request data below, make the following replacements:
- LOCATION: one of the following regional processing options:
us
- United Stateseu
- European Union
- PROJECT_ID: Your GCP project ID.
- STORAGE_URI: The URI of the document you want to
process stored in a Cloud Storage bucket, including the
gs://
prefix. You must at least have read privileges to the file. Example:gs://cloud-samples-data/documentai/table_parsing.pdf
- OUTPUT_BUCKET: A Cloud Storage
bucket/directory to save output files to, expressed in the following form:
gs://bucket/directory/
-
BOUNDING_POLY (optional): A bounding box hint for a table on
the page. This field is intended for complex cases when the model may have difficulty locating
the table. The values must be normalized [0,1]. Object format:
{"x": X_MIN,"y": Y_MIN}, {"x": X_MAX,"y": Y_MIN},{"x": X_MAX,"y": Y_MAX},{"x": X_MIN,"y": Y_MAX}
.
HTTP method and URL:
POST https://LOCATION-documentai.googleapis.com/v1beta2/projects/PROJECT_ID/locations/LOCATION/documents:batchProcess
Request JSON body:
{ "requests": [ { "inputConfig": { "gcsSource":{ "uri":"STORAGE_URI" }, "mimeType":"application/pdf" }, "outputConfig": { "pagesPerShard": 1, "gcsDestination": { "uri": "OUTPUT_BUCKET" } }, "documentType": "general", "tableExtractionParams": { "enabled": true, "tableBoundHints": [ { "boundingBox": { "normalizedVertices": [ BOUNDING_POLY ] } } ], "modelVersion": "builtin/stable" } } ] }
To send your request, choose one of these options:
curl
Save the request body in a file called request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
https://LOCATION-documentai.googleapis.com/v1beta2/projects/PROJECT_ID/locations/LOCATION/documents:batchProcess
PowerShell
Save the request body in a file called request.json
,
and execute the following command:
$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-documentai.googleapis.com/v1beta2/projects/PROJECT_ID/locations/LOCATION/documents:batchProcess" | Select-Object -Expand Content
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_ID/operations/operation-id" }
If the request is successful, the Document AI returns the name for your operation.
Get the results
To get the results of your request, you must send a GET
request to
the operations
resource. The following shows how to send such a
request.
Before using any of the request data below, make the following replacements:
- LOCATION: one of the following regional processing options:
us
- United Stateseu
- European Union
- PROJECT_ID: Your GCP project ID.
- OPERATION_ID: The ID of your operation. The ID is the last element of the name
of your operation. For example:
- operation name:
projects/PROJECT_ID/locations/LOCATION/operations/bc4e1d412863e626
- operation id:
bc4e1d412863e626
- operation name:
HTTP method and URL:
GET https://LOCATION-documentai.googleapis.com/v1beta2/projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID
To send your request, choose one of these options:
curl
Execute the following command:
curl -X GET \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
https://LOCATION-documentai.googleapis.com/v1beta2/projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID
PowerShell
Execute the following command:
$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method GET `
-Headers $headers `
-Uri "https://LOCATION-documentai.googleapis.com/v1beta2/projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID" | Select-Object -Expand Content
You should receive a JSON response similar to the following:
{ "name": "projects/BUCKET_ID/locations/LOCATION/operations/OPERATION_ID", "metadata": { "@type": "type.googleapis.com/google.cloud.documentai.v1beta2.OperationMetadata", "state": "SUCCEEDED", "createTime": "2019-11-19T00:36:37.310474834Z", "updateTime": "2019-11-19T00:37:10.682615795Z" }, "done": true, "response": { "@type": "type.googleapis.com/google.cloud.documentai.v1beta2.BatchProcessDocumentsResponse", "responses": [ { "inputConfig": { "gcsSource": { "uri": "gs://INPUT_FILE" }, "mimeType": "application/pdf" }, "outputConfig": { "gcsDestination": { "uri": "gs://OUTPUT_BUCKET/" } } } ] } }
Processing output should look similar to the
following example. The response body contains an instance of
Document in its standard
format with any information relevant to batch processing
(shardInfo
).
This output is for a publicly accessible PDF file
(gs://cloud-samples-data/documentai/table_parsing.pdf
), with one page per
shard. This file is stored to the output Cloud Storage bucket specified
in the request body.
output-page-1-to-1.json
:
Java
Node.js
Python