Parsing documents containing tables

This page describes how to process a document that contains an table you want to parse.

Document AI can detect and parse text from PDF, TIFF, GIF files stored in Cloud Storage, including text that contains unstructured data in the form of tables.

You request table detection from a smaller file (<=5 pages) using the process method, and larger file requests (files with a large number of pages) use the batchProcess method. The status of batch (asynchronous) requests can be checked using the operations resources. Output from a batch request is written to a JSON file created in the specified Cloud Storage bucket.

Small file online processing

Synchronous ("online") requests target a document with a small number of pages and size (<=5 pages, < 20MB) stored in Cloud Storage. Synchronous requests immediately return a response inline.

The following code samples show you how to process a document with a table.

v1beta2

Select the tab below for your language or environment:

REST & CMD LINE

This sample shows how to use the process method to request small document processing (<=5 pages, < 20MB). The example uses the access token for a service account set up for the project using the Cloud SDK. For instructions on installing the Cloud SDK, setting up a project with a service account, and obtaining an access token, see Before you begin.

The sample request body contains required fields (inputConfig) and optional fields, some for table-specific processing (tableExtractionParams). Note that default behavior enables table extraction and automatic table location detection, even if tableExtractionParams are not specified.

Before using any of the request data below, make the following replacements:

  • LOCATION: one of the following regional processing options:
    • us - United States
    • eu - European Union
  • PROJECT_ID: Your GCP project ID.
  • STORAGE_URI: The URI of the document you want to process stored in a Cloud Storage bucket, including the gs:// prefix. You must at least have read privileges to the file. Example:
    • gs://cloud-samples-data/documentai/table_parsing.pdf
  • BOUNDING_POLY (optional): A bounding box hint for a table on the page. This field is intended for complex cases when the model may have difficulty locating the table. The values must be normalized [0,1]. Object format: {"x": X_MIN,"y": Y_MIN}, {"x": X_MAX,"y": Y_MIN},{"x": X_MAX,"y": Y_MAX},{"x": X_MIN,"y": Y_MAX}.

HTTP method and URL:

POST https://LOCATION-documentai.googleapis.com/v1beta2/projects/PROJECT_ID/locations/LOCATION/documents:process

Request JSON body:

{
  "inputConfig": {
    "gcsSource":{
      "uri":"STORAGE_URI"
    },
    "mimeType":"application/pdf"
  },
  "documentType": "general",
  "tableExtractionParams": {
      "enabled": true,
      "tableBoundHints": [
        {
          "boundingBox": {
            "normalizedVertices": [
              BOUNDING_POLY
            ]
          }
        }
      ],
      "modelVersion": "builtin/stable"
  }
}

To send your request, choose one of these options:

curl

Save the request body in a file called request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
https://LOCATION-documentai.googleapis.com/v1beta2/projects/PROJECT_ID/locations/LOCATION/documents:process

PowerShell

Save the request body in a file called request.json, and execute the following command:

$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-documentai.googleapis.com/v1beta2/projects/PROJECT_ID/locations/LOCATION/documents:process" | Select-Object -Expand Content

If the request is successful, the server returns a 200 OK HTTP status code and the response in JSON format. The response body contains an instance of Document in its standard format.

Java


import com.google.cloud.documentai.v1beta2.BoundingPoly;
import com.google.cloud.documentai.v1beta2.Document;
import com.google.cloud.documentai.v1beta2.DocumentUnderstandingServiceClient;
import com.google.cloud.documentai.v1beta2.GcsSource;
import com.google.cloud.documentai.v1beta2.InputConfig;
import com.google.cloud.documentai.v1beta2.NormalizedVertex;
import com.google.cloud.documentai.v1beta2.ProcessDocumentRequest;
import com.google.cloud.documentai.v1beta2.TableBoundHint;
import com.google.cloud.documentai.v1beta2.TableExtractionParams;
import java.io.IOException;
import java.util.List;

public class ParseTableBeta {

  public static void parseTable() throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-project-id";
    String location = "your-project-location"; // Format is "us" or "eu".
    String inputGcsUri = "gs://your-gcs-bucket/path/to/input/file.json";
    parseTable(projectId, location, inputGcsUri);
  }

  public static void parseTable(String projectId, String location, String inputGcsUri)
      throws IOException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (DocumentUnderstandingServiceClient client = DocumentUnderstandingServiceClient.create()) {
      // Configure the request for processing the PDF
      String parent = String.format("projects/%s/locations/%s", projectId, location);

      TableBoundHint tableBoundHints =
          TableBoundHint.newBuilder()
              .setBoundingBox(
                  // Define a polygon around tables to detect
                  // Each vertice coordinate must be a number between 0 and 1
                  BoundingPoly.newBuilder()
                      // top left
                      .addNormalizedVertices(NormalizedVertex.newBuilder().setX(0).setX(0).build())
                      // top right
                      .addNormalizedVertices(NormalizedVertex.newBuilder().setX(1).setX(0).build())
                      // bottom right
                      .addNormalizedVertices(NormalizedVertex.newBuilder().setX(1).setX(1).build())
                      // bottom left
                      .addNormalizedVertices(NormalizedVertex.newBuilder().setX(0).setX(1).build())
                      .build())
              .setPageNumber(1)
              .build();

      TableExtractionParams params =
          TableExtractionParams.newBuilder()
              .setEnabled(true)
              .addTableBoundHints(tableBoundHints)
              .build();

      GcsSource uri = GcsSource.newBuilder().setUri(inputGcsUri).build();

      // mime_type can be application/pdf, image/tiff,
      // and image/gif, or application/json
      InputConfig config =
          InputConfig.newBuilder().setGcsSource(uri).setMimeType("application/pdf").build();

      ProcessDocumentRequest request =
          ProcessDocumentRequest.newBuilder()
              .setParent(parent)
              .setTableExtractionParams(params)
              .setInputConfig(config)
              .build();

      // Recognizes text entities in the PDF document
      Document response = client.processDocument(request);

      // Get all of the document text as one big string
      String text = response.getText();

      // Get the first table in the document
      if (response.getPagesCount() > 0) {
        Document.Page page1 = response.getPages(0);
        if (page1.getTablesCount() > 0) {
          Document.Page.Table table = page1.getTables(0);

          System.out.println("Results from first table processed:");
          List<Document.Page.DetectedLanguage> detectedLangs = page1.getDetectedLanguagesList();
          String langCode =
              detectedLangs.size() > 0 ? detectedLangs.get(0).getLanguageCode() : "NOT_FOUND";
          System.out.printf("First detected language: : %s", langCode);

          Document.Page.Table.TableRow headerRow = table.getHeaderRows(0);
          System.out.println("Header row:");

          for (Document.Page.Table.TableCell tableCell : headerRow.getCellsList()) {
            if (tableCell.getLayout().getTextAnchor().getTextSegmentsList() != null) {
              // Extract shards from the text field
              // First shard in document doesn't have startIndex property
              System.out.printf("\t%s", getText(tableCell.getLayout(), text));
            }
          }
        }
      }
    }
  }

  // Extract shards from the text field
  private static String getText(Document.Page.Layout layout, String text) {
    Document.TextAnchor textAnchor = layout.getTextAnchor();
    if (textAnchor.getTextSegmentsList().size() > 0) {
      int startIdx = (int) textAnchor.getTextSegments(0).getStartIndex();
      int endIdx = (int) textAnchor.getTextSegments(0).getEndIndex();
      return text.substring(startIdx, endIdx);
    }
    return "[NO TEXT]";
  }
}

Node.js

/**
 * TODO(developer): Uncomment these variables before running the sample.
 */
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION'; // Format is 'us' or 'eu'
// const gcsInputUri = 'YOUR_SOURCE_PDF';

const {
  DocumentUnderstandingServiceClient,
} = require('@google-cloud/documentai').v1beta2;
const client = new DocumentUnderstandingServiceClient();

async function parseTable() {
  // Configure the request for processing the PDF
  const parent = `projects/${projectId}/locations/${location}`;
  const request = {
    parent,
    inputConfig: {
      gcsSource: {
        uri: gcsInputUri,
      },
      mimeType: 'application/pdf',
    },
    tableExtractionParams: {
      enabled: true,
      tableBoundHints: [
        {
          boundingBox: {
            normalizedVertices: [
              {x: 0, y: 0},
              {x: 1, y: 0},
              {x: 1, y: 1},
              {x: 0, y: 1},
            ],
          },
        },
      ],
    },
  };

  // Recognizes text entities in the PDF document
  const [result] = await client.processDocument(request);

  // Get all of the document text as one big string
  const {text} = result;

  // Extract shards from the text field
  function getText(textAnchor) {
    // Text anchor has no text segments if cell is empty
    if (textAnchor.textSegments.length > 0) {
      // First shard in document doesn't have startIndex property
      const startIndex = textAnchor.textSegments[0].startIndex || 0;
      const endIndex = textAnchor.textSegments[0].endIndex;

      return text.substring(startIndex, endIndex);
    }
    return '[NO TEXT]';
  }

  // Get the first table in the document
  const [page1] = result.pages;
  const [table] = page1.tables;
  const [headerRow] = table.headerRows;

  console.log('Header row:');
  for (const tableCell of headerRow.cells) {
    if (tableCell.layout.textAnchor.textSegments) {
      // Extract shards from the text field
      // First shard in document doesn't have startIndex property
      const textAnchor = tableCell.layout.textAnchor;

      console.log(`\t${getText(textAnchor)}`);
    }
  }
}

Python

from google.cloud import documentai_v1beta2 as documentai


def parse_table(
    project_id="YOUR_PROJECT_ID",
    input_uri="gs://cloud-samples-data/documentai/invoice.pdf",
):
    """Parse a form"""

    client = documentai.DocumentUnderstandingServiceClient()

    gcs_source = documentai.types.GcsSource(uri=input_uri)

    # mime_type can be application/pdf, image/tiff,
    # and image/gif, or application/json
    input_config = documentai.types.InputConfig(
        gcs_source=gcs_source, mime_type="application/pdf"
    )

    # Improve table parsing results by providing bounding boxes
    # specifying where the box appears in the document (optional)
    table_bound_hints = [
        documentai.types.TableBoundHint(
            page_number=1,
            bounding_box=documentai.types.BoundingPoly(
                # Define a polygon around tables to detect
                # Each vertice coordinate must be a number between 0 and 1
                normalized_vertices=[
                    # Top left
                    documentai.types.geometry.NormalizedVertex(x=0, y=0),
                    # Top right
                    documentai.types.geometry.NormalizedVertex(x=1, y=0),
                    # Bottom right
                    documentai.types.geometry.NormalizedVertex(x=1, y=1),
                    # Bottom left
                    documentai.types.geometry.NormalizedVertex(x=0, y=1),
                ]
            ),
        )
    ]

    # Setting enabled=True enables form extraction
    table_extraction_params = documentai.types.TableExtractionParams(
        enabled=True, table_bound_hints=table_bound_hints
    )

    # Location can be 'us' or 'eu'
    parent = "projects/{}/locations/us".format(project_id)
    request = documentai.types.ProcessDocumentRequest(
        parent=parent,
        input_config=input_config,
        table_extraction_params=table_extraction_params,
    )

    document = client.process_document(request=request)

    def _get_text(el):
        """Convert text offset indexes into text snippets."""
        response = ""
        # If a text segment spans several lines, it will
        # be stored in different text segments.
        for segment in el.text_anchor.text_segments:
            start_index = segment.start_index
            end_index = segment.end_index
            response += document.text[start_index:end_index]
        return response

    for page in document.pages:
        print("Page number: {}".format(page.page_number))
        for table_num, table in enumerate(page.tables):
            print("Table {}: ".format(table_num))
            for row_num, row in enumerate(table.header_rows):
                cells = "\t".join([_get_text(cell.layout) for cell in row.cells])
                print("Header Row {}: {}".format(row_num, cells))
            for row_num, row in enumerate(table.body_rows):
                cells = "\t".join([_get_text(cell.layout) for cell in row.cells])
                print("Row {}: {}".format(row_num, cells))

Large file offline processing

Asynchronous ("offline") requests targets longer documents and allows you to set the number of pages in the output files. This request starts a long-running operation. When this operation finishes it stores output as a JSON file in a specified Cloud Storage bucket.

Document AI asynchronous processing accepts PDF, TIFF, GIF files up to 2000 pages. Attempting to process larger files returns an error. Additionally, the maximum number of files you can send in a batch process request is 100. The maximum file size is 1GB.

The following code samples show you how to process a document containing a table.

v1beta2

Select the tab below for your language or environment:

REST & CMD LINE

This sample shows how to send a POST request to the batchProcess method for large document asynchronous processing. The example uses the access token for a service account set up for the project using the Cloud SDK. For instructions on installing the Cloud SDK, setting up a project with a service account, and obtaining an access token, see Before you begin.

The sample request body contains required fields (inputConfig, outputConfig) and optional fields, some for table-specific processing (tableExtractionParams). Note that default behavior enables table extraction and automatic table location detection, even if tableExtractionParams are not specified.

A batchProcess request starts a long-running operation and stores results in a Cloud Storage bucket. This sample also shows you how to get the status of this long-running operation after it has started.

Send the process request

Before using any of the request data below, make the following replacements:

  • LOCATION: one of the following regional processing options:
    • us - United States
    • eu - European Union
  • PROJECT_ID: Your GCP project ID.
  • STORAGE_URI: The URI of the document you want to process stored in a Cloud Storage bucket, including the gs:// prefix. You must at least have read privileges to the file. Example:
    • gs://cloud-samples-data/documentai/table_parsing.pdf
  • OUTPUT_BUCKET: A Cloud Storage bucket/directory to save output files to, expressed in the following form:
    • gs://bucket/directory/
    The requesting user must have write permission to the bucket.
  • BOUNDING_POLY (optional): A bounding box hint for a table on the page. This field is intended for complex cases when the model may have difficulty locating the table. The values must be normalized [0,1]. Object format: {"x": X_MIN,"y": Y_MIN}, {"x": X_MAX,"y": Y_MIN},{"x": X_MAX,"y": Y_MAX},{"x": X_MIN,"y": Y_MAX}.

HTTP method and URL:

POST https://LOCATION-documentai.googleapis.com/v1beta2/projects/PROJECT_ID/locations/LOCATION/documents:batchProcess

Request JSON body:

{
  "requests": [
    {
      "inputConfig": {
        "gcsSource":{
          "uri":"STORAGE_URI"
        },
        "mimeType":"application/pdf"
      },
      "outputConfig": {
        "pagesPerShard": 1,
        "gcsDestination": {
          "uri": "OUTPUT_BUCKET"
        }
      },
      "documentType": "general",
      "tableExtractionParams": {
        "enabled": true,
        "tableBoundHints": [
          {
            "boundingBox": {
              "normalizedVertices": [
                BOUNDING_POLY
             ]
            }
          }
        ],
        "modelVersion": "builtin/stable"
      }
    }
  ]
}

To send your request, choose one of these options:

curl

Save the request body in a file called request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
https://LOCATION-documentai.googleapis.com/v1beta2/projects/PROJECT_ID/locations/LOCATION/documents:batchProcess

PowerShell

Save the request body in a file called request.json, and execute the following command:

$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-documentai.googleapis.com/v1beta2/projects/PROJECT_ID/locations/LOCATION/documents:batchProcess" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_ID/operations/operation-id"
}

If the request is successful, the Document AI returns the name for your operation.

Get the results

To get the results of your request, you must send a GET request to the operations resource. The following shows how to send such a request.

Before using any of the request data below, make the following replacements:

  • LOCATION: one of the following regional processing options:
    • us - United States
    • eu - European Union
  • PROJECT_ID: Your GCP project ID.
  • OPERATION_ID: The ID of your operation. The ID is the last element of the name of your operation. For example:
    • operation name: projects/PROJECT_ID/locations/LOCATION/operations/bc4e1d412863e626
    • operation id: bc4e1d412863e626

HTTP method and URL:

GET https://LOCATION-documentai.googleapis.com/v1beta2/projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID

To send your request, choose one of these options:

curl

Execute the following command:

curl -X GET \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
https://LOCATION-documentai.googleapis.com/v1beta2/projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID

PowerShell

Execute the following command:

$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method GET `
-Headers $headers `
-Uri "https://LOCATION-documentai.googleapis.com/v1beta2/projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "name": "projects/BUCKET_ID/locations/LOCATION/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.documentai.v1beta2.OperationMetadata",
    "state": "SUCCEEDED",
    "createTime": "2019-11-19T00:36:37.310474834Z",
    "updateTime": "2019-11-19T00:37:10.682615795Z"
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.documentai.v1beta2.BatchProcessDocumentsResponse",
    "responses": [
      {
        "inputConfig": {
          "gcsSource": {
            "uri": "gs://INPUT_FILE"
          },
          "mimeType": "application/pdf"
        },
        "outputConfig": {
          "gcsDestination": {
            "uri": "gs://OUTPUT_BUCKET/"
          }
        }
      }
    ]
  }
}

Processing output should look similar to the following example. The response body contains an instance of Document in its standard format with any information relevant to batch processing (shardInfo).

This output is for a publicly accessible PDF file (gs://cloud-samples-data/documentai/table_parsing.pdf), with one page per shard. This file is stored to the output Cloud Storage bucket specified in the request body.

output-page-1-to-1.json:

Java


import com.google.api.gax.longrunning.OperationFuture;
import com.google.api.gax.paging.Page;
import com.google.cloud.documentai.v1beta2.BatchProcessDocumentsRequest;
import com.google.cloud.documentai.v1beta2.BatchProcessDocumentsResponse;
import com.google.cloud.documentai.v1beta2.BoundingPoly;
import com.google.cloud.documentai.v1beta2.Document;
import com.google.cloud.documentai.v1beta2.DocumentUnderstandingServiceClient;
import com.google.cloud.documentai.v1beta2.GcsDestination;
import com.google.cloud.documentai.v1beta2.GcsSource;
import com.google.cloud.documentai.v1beta2.InputConfig;
import com.google.cloud.documentai.v1beta2.NormalizedVertex;
import com.google.cloud.documentai.v1beta2.OperationMetadata;
import com.google.cloud.documentai.v1beta2.OutputConfig;
import com.google.cloud.documentai.v1beta2.ProcessDocumentRequest;
import com.google.cloud.documentai.v1beta2.TableBoundHint;
import com.google.cloud.documentai.v1beta2.TableExtractionParams;
import com.google.cloud.storage.Blob;
import com.google.cloud.storage.BlobId;
import com.google.cloud.storage.Bucket;
import com.google.cloud.storage.Storage;
import com.google.cloud.storage.StorageOptions;
import com.google.protobuf.util.JsonFormat;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.List;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

public class BatchParseTableBeta {

  public static void batchParseTableGcs()
      throws IOException, InterruptedException, ExecutionException, TimeoutException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-project-id";
    String location = "your-project-location"; // Format is "us" or "eu".
    String outputGcsBucketName = "your-gcs-bucket-name";
    String outputGcsPrefix = "PREFIX";
    String inputGcsUri = "gs://your-gcs-bucket/path/to/input/file.json";
    batchParseTableGcs(projectId, location, outputGcsBucketName, outputGcsPrefix, inputGcsUri);
  }

  public static void batchParseTableGcs(
      String projectId,
      String location,
      String outputGcsBucketName,
      String outputGcsPrefix,
      String inputGcsUri)
      throws IOException, InterruptedException, ExecutionException, TimeoutException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (DocumentUnderstandingServiceClient client = DocumentUnderstandingServiceClient.create()) {

      // Configure the request for processing the PDF
      String parent = String.format("projects/%s/locations/%s", projectId, location);

      TableBoundHint tableBoundHints =
          TableBoundHint.newBuilder()
              .setBoundingBox(
                  // Define a polygon around tables to detect
                  // Each vertice coordinate must be a number between 0 and 1
                  BoundingPoly.newBuilder()
                      // top left
                      .addNormalizedVertices(NormalizedVertex.newBuilder().setX(0).setX(0).build())
                      // top right
                      .addNormalizedVertices(NormalizedVertex.newBuilder().setX(1).setX(0).build())
                      // bottom right
                      .addNormalizedVertices(NormalizedVertex.newBuilder().setX(1).setX(1).build())
                      // bottom left
                      .addNormalizedVertices(NormalizedVertex.newBuilder().setX(0).setX(1).build())
                      .build())
              .setPageNumber(1)
              .build();

      TableExtractionParams params =
          TableExtractionParams.newBuilder()
              .setEnabled(true)
              .addTableBoundHints(tableBoundHints)
              .build();

      GcsSource inputUri = GcsSource.newBuilder().setUri(inputGcsUri).build();

      // mime_type can be application/pdf, image/tiff,
      // and image/gif, or application/json
      InputConfig config =
          InputConfig.newBuilder().setGcsSource(inputUri).setMimeType("application/pdf").build();

      GcsDestination gcsDestination =
          GcsDestination.newBuilder()
              .setUri(String.format("gs://%s/%s", outputGcsBucketName, outputGcsPrefix))
              .build();

      OutputConfig outputConfig =
          OutputConfig.newBuilder().setGcsDestination(gcsDestination).setPagesPerShard(1).build();

      ProcessDocumentRequest request =
          ProcessDocumentRequest.newBuilder()
              .setTableExtractionParams(params)
              .setInputConfig(config)
              .setOutputConfig(outputConfig)
              .build();

      BatchProcessDocumentsRequest requests =
          BatchProcessDocumentsRequest.newBuilder().addRequests(request).setParent(parent).build();

      // Batch process document using a long-running operation.
      OperationFuture<BatchProcessDocumentsResponse, OperationMetadata> future =
          client.batchProcessDocumentsAsync(requests);

      // Wait for operation to complete.
      System.out.println("Waiting for operation to complete...");
      future.get(300, TimeUnit.SECONDS);

      System.out.println("Document processing complete.");

      Storage storage = StorageOptions.newBuilder().setProjectId(projectId).build().getService();
      Bucket bucket = storage.get(outputGcsBucketName);

      // List all of the files in the Storage bucket.
      Page<Blob> blobs =
          bucket.list(
              Storage.BlobListOption.currentDirectory(),
              Storage.BlobListOption.prefix(outputGcsPrefix));

      int idx = 0;
      for (Blob blob : blobs.iterateAll()) {
        if (!blob.isDirectory()) {
          System.out.printf("Fetched file #%d\n", ++idx);
          // Read the results

          // Download and store json data in a temp file.
          File tempFile = File.createTempFile("file", ".json");
          Blob fileInfo = storage.get(BlobId.of(outputGcsBucketName, blob.getName()));
          fileInfo.downloadTo(tempFile.toPath());

          // Parse json file into Document.
          FileReader reader = new FileReader(tempFile);
          Document.Builder builder = Document.newBuilder();
          JsonFormat.parser().merge(reader, builder);
          Document document = builder.build();

          // Get all of the document text as one big string.
          String text = document.getText();

          // Process the output.
          if (document.getPagesCount() > 0) {
            Document.Page page1 = document.getPages(0);
            if (page1.getTablesCount() > 0) {
              Document.Page.Table table = page1.getTables(0);

              System.out.println("Results from first table processed:");
              System.out.println("Header row:");

              if (table.getHeaderRowsCount() > 0) {
                Document.Page.Table.TableRow headerRow = table.getHeaderRows(0);

                for (Document.Page.Table.TableCell tableCell : headerRow.getCellsList()) {
                  if (!tableCell.getLayout().getTextAnchor().getTextSegmentsList().isEmpty()) {
                    // Extract shards from the text field
                    // First shard in document doesn't have startIndex property
                    List<Document.TextAnchor.TextSegment> textSegments =
                        tableCell.getLayout().getTextAnchor().getTextSegmentsList();
                    int startIdx =
                        textSegments.size() > 0 ? (int) textSegments.get(0).getStartIndex() : 0;
                    int endIdx = (int) textSegments.get(0).getEndIndex();
                    System.out.printf("\t%s", text.substring(startIdx, endIdx));
                  }
                }
              }
            }
          }

          // Clean up temp file.
          tempFile.deleteOnExit();
        }
      }
    }
  }
}

Node.js

/**
 * TODO(developer): Uncomment these variables before running the sample.
 */
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION'; // Format is 'us' or 'eu'
// const gcsOutputUri = 'YOUR_STORAGE_BUCKET';
// const gcsOutputUriPrefix = 'YOUR_STORAGE_PREFIX';
// const gcsInputUri = 'YOUR_SOURCE_PDF';

// Imports the Google Cloud client library
const {
  DocumentUnderstandingServiceClient,
} = require('@google-cloud/documentai').v1beta2;
const {Storage} = require('@google-cloud/storage');

const client = new DocumentUnderstandingServiceClient();
const storage = new Storage();

async function parseTableGCS(inputUri, outputUri, outputUriPrefix) {
  const parent = `projects/${projectId}/locations/${location}`;

  // Configure the batch process request.
  const request = {
    //parent,
    inputConfig: {
      gcsSource: {
        uri: inputUri,
      },
      mimeType: 'application/pdf',
    },
    outputConfig: {
      gcsDestination: {
        uri: `${outputUri}/${outputUriPrefix}/`,
      },
      pagesPerShard: 1,
    },
    tableExtractionParams: {
      enabled: true,
      tableBoundHints: [
        {
          boundingBox: {
            normalizedVertices: [
              {x: 0, y: 0},
              {x: 1, y: 0},
              {x: 1, y: 1},
              {x: 0, y: 1},
            ],
          },
        },
      ],
    },
  };

  // Configure the request for batch process
  const requests = {
    parent,
    requests: [request],
  };

  // Batch process document using a long-running operation.
  // You can wait for now, or get results later.
  // Note: first request to the service takes longer than subsequent
  // requests.
  const [operation] = await client.batchProcessDocuments(requests);

  // Wait for operation to complete.
  await operation.promise();

  console.log('Document processing complete.');

  // Query Storage bucket for the results file(s).
  const query = {
    prefix: outputUriPrefix,
  };

  console.log('Fetching results ...');

  // List all of the files in the Storage bucket
  const [files] = await storage.bucket(gcsOutputUri).getFiles(query);

  files.forEach(async (fileInfo, index) => {
    // Get the file as a buffer
    const [file] = await fileInfo.download();

    console.log(`Fetched file #${index + 1}:`);

    // Read the results
    const results = JSON.parse(file.toString());

    // Get all of the document text as one big string
    const text = results.text;

    // Get the first table in the document
    const [page1] = results.pages;
    const [table] = page1.tables;
    const [headerRow] = table.headerRows;

    console.log('Results from first table processed:');
    console.log(
      `First detected language: ${page1.detectedLanguages[0].languageCode}`
    );

    console.log('Header row:');
    for (const tableCell of headerRow.cells) {
      if (tableCell.layout.textAnchor.textSegments) {
        // Extract shards from the text field
        // First shard in document doesn't have startIndex property
        const startIndex =
          tableCell.layout.textAnchor.textSegments[0].startIndex || 0;
        const endIndex = tableCell.layout.textAnchor.textSegments[0].endIndex;

        console.log(`\t${text.substring(startIndex, endIndex)}`);
      }
    }
  });
}

Python

import re

from google.cloud import documentai_v1beta2 as documentai
from google.cloud import storage


def batch_parse_table(
    project_id="YOUR_PROJECT_ID",
    input_uri="gs://cloud-samples-data/documentai/form.pdf",
    destination_uri="gs://your-bucket-id/path/to/save/results/",
    timeout=90
):
    """Parse a form"""

    client = documentai.DocumentUnderstandingServiceClient()

    gcs_source = documentai.types.GcsSource(uri=input_uri)

    # mime_type can be application/pdf, image/tiff,
    # and image/gif, or application/json
    input_config = documentai.types.InputConfig(
        gcs_source=gcs_source, mime_type="application/pdf"
    )

    # where to write results
    output_config = documentai.types.OutputConfig(
        gcs_destination=documentai.types.GcsDestination(uri=destination_uri),
        pages_per_shard=1,  # Map one doc page to one output page
    )

    # Improve table parsing results by providing bounding boxes
    # specifying where the box appears in the document (optional)
    table_bound_hints = [
        documentai.types.TableBoundHint(
            page_number=1,
            bounding_box=documentai.types.BoundingPoly(
                # Define a polygon around tables to detect
                # Each vertice coordinate must be a number between 0 and 1
                normalized_vertices=[
                    # Top left
                    documentai.types.geometry.NormalizedVertex(x=0, y=0),
                    # Top right
                    documentai.types.geometry.NormalizedVertex(x=1, y=0),
                    # Bottom right
                    documentai.types.geometry.NormalizedVertex(x=1, y=1),
                    # Bottom left
                    documentai.types.geometry.NormalizedVertex(x=0, y=1),
                ]
            ),
        )
    ]

    # Setting enabled=True enables form extraction
    table_extraction_params = documentai.types.TableExtractionParams(
        enabled=True, table_bound_hints=table_bound_hints
    )

    # Location can be 'us' or 'eu'
    parent = "projects/{}/locations/us".format(project_id)
    request = documentai.types.ProcessDocumentRequest(
        input_config=input_config,
        output_config=output_config,
        table_extraction_params=table_extraction_params,
    )

    requests = []
    requests.append(request)

    batch_request = documentai.types.BatchProcessDocumentsRequest(
        parent=parent, requests=requests
    )

    operation = client.batch_process_documents(batch_request)

    # Wait for the operation to finish
    operation.result(timeout)

    # Results are written to GCS. Use a regex to find
    # output files
    match = re.match(r"gs://([^/]+)/(.+)", destination_uri)
    output_bucket = match.group(1)
    prefix = match.group(2)

    storage_client = storage.client.Client()
    bucket = storage_client.get_bucket(output_bucket)
    blob_list = list(bucket.list_blobs(prefix=prefix))
    print("Output files:")
    for blob in blob_list:
        print(blob.name)