Parsing documents containing forms

This page describes how to process a document that contains a form that you want to parse.

Document AI can detect and parse text from PDF, TIFF, GIF files stored in Cloud Storage, including text that contains unstructured data in the form documents.

Request document processing from a smaller file (<=5 pages) using the process method, and larger file requests (files with a large number of pages) use the batchProcess method. The status of batch (asynchronous) requests can be checked using the operations resources.

Small file online processing

Synchronous ("online") requests target a document with a small number of pages and size (<=5 pages, < 20MB) stored in Cloud Storage. Synchronous requests immediately return a response inline.

The following code samples show you how to process a form document with key/value pairs synchronously.

REST & CMD LINE

This sample shows how to use the process method to request small document processing (<=5 pages, < 20MB). The example uses the access token for a service account set up for the project using the Cloud SDK. For instructions on installing the Cloud SDK, setting up a project with a service account, and obtaining an access token, see Before you begin.

The sample request body contains required fields (inputConfig) and optional fields, some for form-specific processing (formExtractionParams).

Before using any of the request data below, make the following replacements:

  • project-id: Your GCP project ID.
  • input-storage-file: The URI of the document you want to process stored in a Cloud Storage bucket, including the gs:// prefix. You must at least have read privileges to the file. Example:
    • gs://cloud-samples-data/documentai/loan_form.pdf

HTTP method and URL:

POST https://us-documentai.googleapis.com/v1beta2/projects/project-id/locations/us/documents:process

Request JSON body:

{
   "inputConfig":{
      "gcsSource":{
         "uri":"input-storage-file"
      },
      "mimeType":"application/pdf"
   },
   "documentType":"general",
   "formExtractionParams":{
      "enabled":true
   }
}

To send your request, choose one of these options:

curl

Save the request body in a file called request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
https://us-documentai.googleapis.com/v1beta2/projects/project-id/locations/us/documents:process

PowerShell

Save the request body in a file called request.json, and execute the following command:

$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://us-documentai.googleapis.com/v1beta2/projects/project-id/locations/us/documents:process" | Select-Object -Expand Content

If the request is successful, the server returns a 200 OK HTTP status code and the response in JSON format. The response body contains an instance of Document in its standard format.

Java


import com.google.cloud.documentai.v1beta2.Document;
import com.google.cloud.documentai.v1beta2.DocumentUnderstandingServiceClient;
import com.google.cloud.documentai.v1beta2.FormExtractionParams;
import com.google.cloud.documentai.v1beta2.GcsSource;
import com.google.cloud.documentai.v1beta2.InputConfig;
import com.google.cloud.documentai.v1beta2.KeyValuePairHint;
import com.google.cloud.documentai.v1beta2.ProcessDocumentRequest;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

public class ParseFormBeta {
  public static void parseForm() throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-project-id";
    String location = "your-project-location"; // Format is "us" or "eu".
    String inputGcsUri = "gs://your-gcs-bucket/path/to/input/file.json";
    parseForm(projectId, location, inputGcsUri);
  }

  public static void parseForm(String projectId, String location, String inputGcsUri)
      throws IOException, ExecutionException, InterruptedException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (DocumentUnderstandingServiceClient client = DocumentUnderstandingServiceClient.create()) {
      // Configure the request for processing the PDF
      String parent = String.format("projects/%s/locations/%s", projectId, location);

      // Improve form parsing results by providing key-value pair hints.
      // For each key hint, key is text that is likely to appear in the
      // document as a form field name (i.e. "DOB").
      // Value types are optional, but can be one or more of:
      // ADDRESS, LOCATION, ORGANIZATION, PERSON, PHONE_NUMBER, ID,
      // NUMBER, EMAIL, PRICE, TERMS, DATE, NAME
      KeyValuePairHint keyValuePairHint =
          KeyValuePairHint.newBuilder().setKey("Phone").addValueTypes("PHONE_NUMBER").build();
      KeyValuePairHint keyValuePairHint2 =
          KeyValuePairHint.newBuilder()
              .setKey("Contact")
              .addValueTypes("EMAIL")
              .addValueTypes("NAME")
              .build();

      // Setting enabled=True enables form extraction
      FormExtractionParams params =
          FormExtractionParams.newBuilder()
              .setEnabled(true)
              .addKeyValuePairHints(keyValuePairHint)
              .addKeyValuePairHints(keyValuePairHint2)
              .build();

      GcsSource uri = GcsSource.newBuilder().setUri(inputGcsUri).build();

      // mime_type can be application/pdf, image/tiff,
      // and image/gif, or application/json
      InputConfig config =
          InputConfig.newBuilder().setGcsSource(uri).setMimeType("application/pdf").build();

      ProcessDocumentRequest request =
          ProcessDocumentRequest.newBuilder()
              .setParent(parent)
              .setFormExtractionParams(params)
              .setInputConfig(config)
              .build();

      // Recognizes text entities in the PDF document
      Document response = client.processDocument(request);

      // Get all of the document text as one big string
      String text = response.getText();

      // Process the output
      Document.Page page1 = response.getPages(0);
      for (Document.Page.FormField field : page1.getFormFieldsList()) {
        String fieldName = getText(field.getFieldName(), text);
        String fieldValue = getText(field.getFieldValue(), text);

        System.out.println("Extracted form fields pair:");
        System.out.printf("\t(%s, %s))", fieldName, fieldValue);
      }
    }
  }

  private static String getText(Document.Page.Layout layout, String text) {
    Document.TextAnchor textAnchor = layout.getTextAnchor();
    if (textAnchor.getTextSegmentsList().size() > 0) {
      int startIdx = (int) textAnchor.getTextSegments(0).getStartIndex();
      int endIdx = (int) textAnchor.getTextSegments(0).getEndIndex();
      return text.substring(startIdx, endIdx);
    }
    return "[NO TEXT]";
  }
}

Node.js

/**
 * TODO(developer): Uncomment these variables before running the sample.
 */
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION'; // Format is 'us' or 'eu'
// const gcsInputUri = 'YOUR_SOURCE_PDF';

const {
  DocumentUnderstandingServiceClient,
} = require('@google-cloud/documentai');
const client = new DocumentUnderstandingServiceClient();

async function parseForm() {
  // Configure the request for processing the PDF
  const parent = `projects/${projectId}/locations/${location}`;
  const request = {
    parent,
    inputConfig: {
      gcsSource: {
        uri: gcsInputUri,
      },
      mimeType: 'application/pdf',
    },
    formExtractionParams: {
      enabled: true,
      keyValuePairHints: [
        {
          key: 'Phone',
          valueTypes: ['PHONE_NUMBER'],
        },
        {
          key: 'Contact',
          valueTypes: ['EMAIL', 'NAME'],
        },
      ],
    },
  };

  // Recognizes text entities in the PDF document
  const [result] = await client.processDocument(request);

  // Get all of the document text as one big string
  const {text} = result;

  // Extract shards from the text field
  const getText = textAnchor => {
    // First shard in document doesn't have startIndex property
    const startIndex = textAnchor.textSegments[0].startIndex || 0;
    const endIndex = textAnchor.textSegments[0].endIndex;

    return text.substring(startIndex, endIndex);
  };

  // Process the output
  const [page1] = result.pages;
  const {formFields} = page1;

  for (const field of formFields) {
    const fieldName = getText(field.fieldName.textAnchor);
    const fieldValue = getText(field.fieldValue.textAnchor);

    console.log('Extracted key value pair:');
    console.log(`\t(${fieldName}, ${fieldValue})`);
  }
}

Python

from google.cloud import documentai_v1beta2 as documentai


def parse_form(project_id='YOUR_PROJECT_ID',
               input_uri='gs://cloud-samples-data/documentai/form.pdf'):
    """Parse a form"""

    client = documentai.DocumentUnderstandingServiceClient()

    gcs_source = documentai.types.GcsSource(uri=input_uri)

    # mime_type can be application/pdf, image/tiff,
    # and image/gif, or application/json
    input_config = documentai.types.InputConfig(
        gcs_source=gcs_source, mime_type='application/pdf')

    # Improve form parsing results by providing key-value pair hints.
    # For each key hint, key is text that is likely to appear in the
    # document as a form field name (i.e. "DOB").
    # Value types are optional, but can be one or more of:
    # ADDRESS, LOCATION, ORGANIZATION, PERSON, PHONE_NUMBER, ID,
    # NUMBER, EMAIL, PRICE, TERMS, DATE, NAME
    key_value_pair_hints = [
        documentai.types.KeyValuePairHint(key='Emergency Contact',
                                          value_types=['NAME']),
        documentai.types.KeyValuePairHint(
            key='Referred By')
    ]

    # Setting enabled=True enables form extraction
    form_extraction_params = documentai.types.FormExtractionParams(
        enabled=True, key_value_pair_hints=key_value_pair_hints)

    # Location can be 'us' or 'eu'
    parent = 'projects/{}/locations/us'.format(project_id)
    request = documentai.types.ProcessDocumentRequest(
        parent=parent,
        input_config=input_config,
        form_extraction_params=form_extraction_params)

    document = client.process_document(request=request)

    def _get_text(el):
        """Doc AI identifies form fields by their offsets
        in document text. This function converts offsets
        to text snippets.
        """
        response = ''
        # If a text segment spans several lines, it will
        # be stored in different text segments.
        for segment in el.text_anchor.text_segments:
            start_index = segment.start_index
            end_index = segment.end_index
            response += document.text[start_index:end_index]
        return response

    for page in document.pages:
        print('Page number: {}'.format(page.page_number))
        for form_field in page.form_fields:
            print('Field Name: {}\tConfidence: {}'.format(
                _get_text(form_field.field_name),
                form_field.field_name.confidence))
            print('Field Value: {}\tConfidence: {}'.format(
                _get_text(form_field.field_value),
                form_field.field_value.confidence))

Large file offline processing

Asynchronous ("offline") requests targets longer documents and allows you to set the number of pages in the output files. This request starts a long-running operation. When this operation finishes it stores output as a JSON file in a specified Cloud Storage bucket.

Document AI asynchronous processing accepts PDF, TIFF, GIF files up to 2000 pages. Attempting to process larger files returns an error.

The following code samples show you how to process a document containing a form.

REST & CMD LINE

This sample shows how to send a POST request to the batchProcess method for large document asynchronous processing. The example uses the access token for a service account set up for the project using the Cloud SDK. For instructions on installing the Cloud SDK, setting up a project with a service account, and obtaining an access token, see Before you begin.

The sample request body contains required fields (inputConfig, outputConfig) and optional fields, some for form-specific processing (formExtractionParams).

A batchProcess request starts a long-running operation and stores results in a Cloud Storage bucket. This sample also shows how to get the status of this long-running operation after it has started.

Send the process request

Before using any of the request data below, make the following replacements:

  • project-id: Your GCP project ID.
  • input-storage-file: The URI of the document you want to process stored in a Cloud Storage bucket, including the gs:// prefix. You must at least have read privileges to the file. Example:
    • gs://cloud-samples-data/documentai/loan_form.pdf
  • output-storage-bucket: A Cloud Storage bucket/directory to save output files to, expressed in the following form:
    • gs://bucket/directory/
    The requesting user must have write permission to the bucket.

HTTP method and URL:

POST https://us-documentai.googleapis.com/v1beta2/projects/project-id/locations/us/documents:batchProcess

Request JSON body:

{
  "requests": [
    {
      "inputConfig": {
        "gcsSource": {
          "uri": "input-storage-file"
        },
        "mimeType": "application/pdf"
      },
      "outputConfig": {
        "pagesPerShard": 1,
        "gcsDestination": {
          "uri": "output-storage-bucket"
        }
      },
      "documentType": "general",
      "formExtractionParams": {
        "enabled": true
      }
    }
  ]
}

To send your request, choose one of these options:

curl

Save the request body in a file called request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
https://us-documentai.googleapis.com/v1beta2/projects/project-id/locations/us/documents:batchProcess

PowerShell

Save the request body in a file called request.json, and execute the following command:

$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://us-documentai.googleapis.com/v1beta2/projects/project-id/locations/us/documents:batchProcess" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "name": "projects/project-id/operations/operation-id"
}

If the request is successful, the Document AI returns the name for your operation.

Get the results

To get the results of your request, you must send a GET request to the operations resource. The following shows how to send such a request.

Before using any of the request data below, make the following replacements:

  • project-id: your GCP project ID
  • operation-id: ID of the operation returned from Document AI.

HTTP method and URL:

GET https://us-documentai.googleapis.com/v1beta2/projects/project-id/operations/operation-id

To send your request, choose one of these options:

curl

Execute the following command:

curl -X GET \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
https://us-documentai.googleapis.com/v1beta2/projects/project-id/operations/operation-id

PowerShell

Execute the following command:

$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method GET `
-Headers $headers `
-Uri "https://us-documentai.googleapis.com/v1beta2/projects/project-id/operations/operation-id" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "name": "projects/bucket-id/operations/4e2b314779b999b5",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.documentai.v1beta2.OperationMetadata",
    "state": "SUCCEEDED",
    "createTime": "2019-11-19T00:36:37.310474834Z",
    "updateTime": "2019-11-19T00:37:10.682615795Z"
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.documentai.v1beta2.BatchProcessDocumentsResponse",
    "responses": [
      {
        "inputConfig": {
          "gcsSource": {
            "uri": "gs://input-file"
          },
          "mimeType": "application/pdf"
        },
        "outputConfig": {
          "gcsDestination": {
            "uri": "gs://output-bucket/"
          }
        }
      }
    ]
  }
}

Processing output should look similar to the following example. The response body contains an instance of Document in its standard format with any information relevant to batch processing (shardInfo).

This output is for a publicly accessible PDF file (gs://cloud-samples-data/documentai/loan_form.pdf), with one page per shard. This file is stored to the specified output Cloud Storage bucket.

output-page-1-to-1.json:

Java


import com.google.api.gax.longrunning.OperationFuture;
import com.google.api.gax.paging.Page;
import com.google.cloud.documentai.v1beta2.BatchProcessDocumentsRequest;
import com.google.cloud.documentai.v1beta2.BatchProcessDocumentsResponse;
import com.google.cloud.documentai.v1beta2.Document;
import com.google.cloud.documentai.v1beta2.DocumentUnderstandingServiceClient;
import com.google.cloud.documentai.v1beta2.DocumentUnderstandingServiceSettings;
import com.google.cloud.documentai.v1beta2.FormExtractionParams;
import com.google.cloud.documentai.v1beta2.GcsDestination;
import com.google.cloud.documentai.v1beta2.GcsSource;
import com.google.cloud.documentai.v1beta2.InputConfig;
import com.google.cloud.documentai.v1beta2.KeyValuePairHint;
import com.google.cloud.documentai.v1beta2.OperationMetadata;
import com.google.cloud.documentai.v1beta2.OutputConfig;
import com.google.cloud.documentai.v1beta2.ProcessDocumentRequest;
import com.google.cloud.storage.Blob;
import com.google.cloud.storage.BlobId;
import com.google.cloud.storage.Bucket;
import com.google.cloud.storage.Storage;
import com.google.cloud.storage.StorageOptions;
import com.google.protobuf.util.JsonFormat;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

public class BatchParseFormBeta {

  public static void batchParseFormGcs()
      throws IOException, InterruptedException, ExecutionException, TimeoutException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-project-id";
    String location = "your-project-location"; // Format is "us" or "eu".
    String outputGcsBucketName = "your-gcs-bucket-name";
    String outputGcsPrefix = "PREFIX";
    String inputGcsUri = "gs://your-gcs-bucket/path/to/input/file.json";
    batchParseFormGcs(projectId, location, outputGcsBucketName, outputGcsPrefix, inputGcsUri);
  }

  public static void batchParseFormGcs(
      String projectId,
      String location,
      String outputGcsBucketName,
      String outputGcsPrefix,
      String inputGcsUri)
      throws IOException, InterruptedException, ExecutionException, TimeoutException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (DocumentUnderstandingServiceClient client =
        DocumentUnderstandingServiceClient.create()) {

      // Configure the request for processing the PDF
      String parent = String.format("projects/%s/locations/%s", projectId, location);

      // Improve form parsing results by providing key-value pair hints.
      // For each key hint, key is text that is likely to appear in the
      // document as a form field name (i.e. "DOB").
      // Value types are optional, but can be one or more of:
      // ADDRESS, LOCATION, ORGANIZATION, PERSON, PHONE_NUMBER, ID,
      // NUMBER, EMAIL, PRICE, TERMS, DATE, NAME
      KeyValuePairHint keyValuePairHint =
          KeyValuePairHint.newBuilder().setKey("Phone").addValueTypes("PHONE_NUMBER").build();

      KeyValuePairHint keyValuePairHint2 =
          KeyValuePairHint.newBuilder()
              .setKey("Contact")
              .addValueTypes("EMAIL")
              .addValueTypes("NAME")
              .build();

      // Setting enabled=True enables form extraction
      FormExtractionParams params =
          FormExtractionParams.newBuilder()
              .setEnabled(true)
              .addKeyValuePairHints(keyValuePairHint)
              .addKeyValuePairHints(keyValuePairHint2)
              .build();

      GcsSource inputUri = GcsSource.newBuilder().setUri(inputGcsUri).build();

      // mime_type can be application/pdf, image/tiff,
      // and image/gif, or application/json
      InputConfig config =
          InputConfig.newBuilder().setGcsSource(inputUri)
                  .setMimeType("application/pdf").build();

      GcsDestination gcsDestination = GcsDestination.newBuilder()
              .setUri(String.format("gs://%s/%s", outputGcsBucketName, outputGcsPrefix)).build();

      OutputConfig outputConfig =  OutputConfig.newBuilder()
              .setGcsDestination(gcsDestination)
              .setPagesPerShard(1)
              .build();

      ProcessDocumentRequest request =
          ProcessDocumentRequest.newBuilder()
              .setFormExtractionParams(params)
              .setInputConfig(config)
              .setOutputConfig(outputConfig)
              .build();

      BatchProcessDocumentsRequest requests =
          BatchProcessDocumentsRequest.newBuilder().addRequests(request).setParent(parent).build();

      // Batch process document using a long-running operation.
      OperationFuture<BatchProcessDocumentsResponse, OperationMetadata> future =
          client.batchProcessDocumentsAsync(requests);

      // Wait for operation to complete.
      System.out.println("Waiting for operation to complete...");
      future.get(300, TimeUnit.SECONDS);

      System.out.println("Document processing complete.");

      Storage storage = StorageOptions.newBuilder().setProjectId(projectId).build().getService();
      Bucket bucket = storage.get(outputGcsBucketName);

      // List all of the files in the Storage bucket.
      Page<Blob> blobs =
          bucket.list(
              Storage.BlobListOption.currentDirectory(),
              Storage.BlobListOption.prefix(outputGcsPrefix));

      int idx = 0;
      for (Blob blob : blobs.iterateAll()) {
        if (!blob.isDirectory()) {
          System.out.printf("Fetched file #%d\n", ++idx);
          // Read the results

          // Download and store json data in a temp file.
          File tempFile = File.createTempFile("file", ".json");
          Blob fileInfo = storage.get(BlobId.of(outputGcsBucketName, blob.getName()));
          fileInfo.downloadTo(tempFile.toPath());

          // Parse json file into Document.
          FileReader reader = new FileReader(tempFile);
          Document.Builder builder = Document.newBuilder();
          JsonFormat.parser().merge(reader, builder);

          Document document = builder.build();

          // Get all of the document text as one big string.
          String text = document.getText();

          // Process the output.
          Document.Page page1 = document.getPages(0);
          for (Document.Page.FormField field : page1.getFormFieldsList()) {
            String fieldName = getText(field.getFieldName(), text);
            String fieldValue = getText(field.getFieldValue(), text);

            System.out.println("Extracted form fields pair:");
            System.out.printf("\t(%s, %s))", fieldName, fieldValue);
          }

          // Clean up temp file.
          tempFile.deleteOnExit();
        }
      }
    }
  }

  private static String getText(Document.Page.Layout layout, String text) {
    Document.TextAnchor textAnchor = layout.getTextAnchor();
    if (textAnchor.getTextSegmentsList().size() > 0) {
      int startIdx = (int) textAnchor.getTextSegments(0).getStartIndex();
      int endIdx = (int) textAnchor.getTextSegments(0).getEndIndex();
      return text.substring(startIdx, endIdx);
    }
    return "[NO TEXT]";
  }
}

Node.js

/**
 * TODO(developer): Uncomment these variables before running the sample.
 */
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION'; // Format is 'us' or 'eu'
// const gcsOutputUri = 'YOUR_STORAGE_BUCKET';
// const gcsOutputUriPrefix = 'YOUR_STORAGE_PREFIX';
// const gcsInputUri = 'GCS URI of the PDF to process';

// Imports the Google Cloud client library
const {
  DocumentUnderstandingServiceClient,
} = require('@google-cloud/documentai');
const {Storage} = require('@google-cloud/storage');

const client = new DocumentUnderstandingServiceClient();
const storage = new Storage();

async function parseFormGCS(inputUri, outputUri, outputUriPrefix) {
  const parent = `projects/${projectId}/locations/${location}`;

  // Configure the batch process request.
  const request = {
    inputConfig: {
      gcsSource: {
        uri: inputUri,
      },
      mimeType: 'application/pdf',
    },
    outputConfig: {
      gcsDestination: {
        uri: `${outputUri}/${outputUriPrefix}/`,
      },
      pagesPerShard: 1,
    },
    formExtractionParams: {
      enabled: true,
      keyValuePairHints: [
        {
          key: 'Phone',
          valueTypes: ['PHONE_NUMBER'],
        },
        {
          key: 'Contact',
          valueTypes: ['EMAIL', 'NAME'],
        },
      ],
    },
  };

  // Configure the request for batch process
  const requests = {
    parent,
    requests: [request],
  };

  // Batch process document using a long-running operation.
  // You can wait for now, or get results later.
  const [operation] = await client.batchProcessDocuments(requests);

  // Wait for operation to complete.
  await operation.promise();

  console.log('Document processing complete.');

  // Query Storage bucket for the results file(s).
  const query = {
    prefix: outputUriPrefix,
  };

  console.log('Fetching results ...');

  // List all of the files in the Storage bucket
  const [files] = await storage.bucket(gcsOutputUri).getFiles(query);

  files.forEach(async (fileInfo, index) => {
    // Get the file as a buffer
    const [file] = await fileInfo.download();

    console.log(`Fetched file #${index + 1}:`);

    // Read the results
    const results = JSON.parse(file.toString());

    // Get all of the document text as one big string.
    const {text} = results;

    // Utility to extract text anchors from text field.
    const getText = textAnchor => {
      const startIndex = textAnchor.textSegments[0].startIndex || 0;
      const endIndex = textAnchor.textSegments[0].endIndex;

      return `\t${text.substring(startIndex, endIndex)}`;
    };

    // Process the output
    const [page1] = results.pages;
    const formFields = page1.formFields;

    for (const field of formFields) {
      const fieldName = getText(field.fieldName.textAnchor);
      const fieldValue = getText(field.fieldValue.textAnchor);

      console.log('Extracted key value pair:');
      console.log(`\t(${fieldName}, ${fieldValue})`);
    }
  });
}

Python

import re

from google.cloud import documentai_v1beta2 as documentai
from google.cloud import storage


def batch_parse_form(
        project_id='YOUR_PROJECT_ID',
        input_uri='gs://cloud-samples-data/documentai/form.pdf',
        destination_uri='gs://your-bucket-id/path/to/save/results/'):
    """Parse a form"""

    client = documentai.DocumentUnderstandingServiceClient()

    gcs_source = documentai.types.GcsSource(uri=input_uri)

    # mime_type can be application/pdf, image/tiff,
    # and image/gif, or application/json
    input_config = documentai.types.InputConfig(
        gcs_source=gcs_source, mime_type='application/pdf')

    # where to write results
    output_config = documentai.types.OutputConfig(
        gcs_destination=documentai.types.GcsDestination(
            uri=destination_uri),
        pages_per_shard=1  # Map one doc page to one output page
    )

    # Improve form parsing results by providing key-value pair hints.
    # For each key hint, key is text that is likely to appear in the
    # document as a form field name (i.e. "DOB").
    # Value types are optional, but can be one or more of:
    # ADDRESS, LOCATION, ORGANIZATION, PERSON, PHONE_NUMBER, ID,
    # NUMBER, EMAIL, PRICE, TERMS, DATE, NAME
    key_value_pair_hints = [
        documentai.types.KeyValuePairHint(
            key='Emergency Contact',
            value_types=['NAME']),
        documentai.types.KeyValuePairHint(
            key='Referred By')
    ]

    # Setting enabled=True enables form extraction
    form_extraction_params = documentai.types.FormExtractionParams(
        enabled=True, key_value_pair_hints=key_value_pair_hints)

    # Location can be 'us' or 'eu'
    parent = 'projects/{}/locations/us'.format(project_id)
    request = documentai.types.ProcessDocumentRequest(
        input_config=input_config,
        output_config=output_config,
        form_extraction_params=form_extraction_params)

    # Add each ProcessDocumentRequest to the batch request
    requests = []
    requests.append(request)

    batch_request = documentai.types.BatchProcessDocumentsRequest(
        parent=parent, requests=requests
    )

    operation = client.batch_process_documents(batch_request)

    # Wait for the operation to finish
    operation.result()

    # Results are written to GCS. Use a regex to find
    # output files
    match = re.match(r'gs://([^/]+)/(.+)', destination_uri)
    output_bucket = match.group(1)
    prefix = match.group(2)

    storage_client = storage.client.Client()
    bucket = storage_client.get_bucket(output_bucket)
    blob_list = list(bucket.list_blobs(prefix=prefix))
    print('Output files:')
    for blob in blob_list:
        print(blob.name)