PDF/TIFF Document Text Detection

The Vision API can detect and transcribe text from PDF and TIFF files stored in Google Cloud Storage.

Document text detection from PDF and TIFF must be requested using the asyncBatchAnnotate function, which performs an asynchronous request and provides its status using the operations resources.

Output from a PDF/TIFF request is written to a JSON file created in the specified Google Cloud Storage bucket.

Pricing for PDF/TIFF document text detection is at the DOCUMENT_TEXT_DETECTION rate. Please see the Pricing page for details.

Limitations

The Vision API accepts PDF/TIFF files up to 2000 pages. Larger files will return an error.

Authentication

API keys are not supported for asyncBatchAnnotate requests. See Using a service account for instructions on authenticating with a service account.

The account used for authentication must have access to the Cloud Storage bucket that you specify for the output (roles/editor or roles/storage.objectCreator or above).

You can use an API key to query the status of the operation; see Using an API key for instructions.

Sample code

Protocol

To perform PDF/TIFF document text detection, make a POST request and provide the appropriate request body:

POST https://vision.googleapis.com/v1p2beta1/files:asyncBatchAnnotate
Authorization: Bearer ACCESS_TOKEN

{
  "requests":[
    {
      "inputConfig": {
        "gcsSource": {
          "uri": "gs://bucket-name-123/vision-api.pdf"
        },
        "mimeType": "application/pdf"
      },
      "features": [
        {
          "type": "DOCUMENT_TEXT_DETECTION"
        }
      ],
      "outputConfig": {
        "gcsDestination": {
          "uri": "gs://your-bucket-name/folder/"
        },
        "batchSize": 2
      }
    }
  ]
}

Where:

  • inputConfig replaces the image field used in other Vision API requests. It contains two child fields: gcsSource.uri is the Google Cloud Storage URI of the PDF or TIFF file (accessible to the user or service account making the request); mimeType is one of application/pdf or application/tiff.

  • outputConfig specifies the Cloud Storage bucket/folder to write the output to. It must contain a Google Cloud Storage URI as the value of the gcsDestination.uri field. The bucket must be writeable by the user or service account making the request. The filename will be output-x-to-y, where x and y represent the PDF/TIFF page numbers included in that output file. If the file exists, its contents will be overwritten. The batchSize field specifies how many pages of output should be included in each output JSON file.

Response

A successful asyncBatchAnnotate request returns a response with a single name field:

{
  "name": "operations/1b0390d223f4a5af"
}

This name represents a long-running operation, which can be queried using the v1.operations API.

To retrieve your Vision annotation response, send a GET request to the v1.operations endpoint, passing the value of name in the URL.

GET https://vision.googleapis.com/v1/operations/1b0390d223f4a5af?key=YOUR_API_KEY

If the operation is in progress:

{
  "name": "operations/1b0390d223f4ab8a",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.vision.v1.OperationMetadata",
    "state": "RUNNING",
    "updateTime": "2017-12-07T00:51:22.225232809Z"
  }
}

Once the operation has completed, the state shows as DONE and your results are written to the Google Cloud Storage file you specified:

{
  "name": "operations/1b0390d223f4ab8a",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.vision.v1p2beta1.OperationMetadata",
    "state": "DONE",
    "updateTime": "2017-12-07T01:36:08.757594684Z"
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.vision.v1p2beta1.AsyncBatchAnnotateFilesResponse",
    "outputConfig": {
      "gcsDestination": {
        "uri": "gs://your-bucket-name/folder/"
      }
    }
  }
}

The JSON in your output file is similar to that of an image's document text detection request, with the addition of a context field showing the location of the PDF or TIFF that was specified and the number of pages in the file:

...
"context": {
  "uri": "gs://bucket-name-123/vision-api.pdf",
  "pageNumber": 1
}
...

Java

For more on installing and creating a Vision API client, refer to Vision API Client Libraries.

/**
 * Performs document text OCR with PDF/TIFF as source files on Google Cloud Storage.
 *
 * @param gcsSourcePath The path to the remote file on Google Cloud Storage to detect document
 *                      text on.
 * @param gcsDestinationPath The path to the remote file on Google Cloud Storage to store the
 *                           results on.
 * @throws Exception on errors while closing the client.
 */
public static void detectDocumentsGcs(String gcsSourcePath, String gcsDestinationPath) throws
    Exception {
  try (ImageAnnotatorClient client = ImageAnnotatorClient.create()) {
    List<AsyncAnnotateFileRequest> requests = new ArrayList<>();

    // Set the GCS source path for the remote file.
    GcsSource gcsSource = GcsSource.newBuilder()
        .setUri(gcsSourcePath)
        .build();

    // Create the configuration with the specified MIME (Multipurpose Internet Mail Extensions)
    // types
    InputConfig inputConfig = InputConfig.newBuilder()
        .setMimeType("application/pdf") // Supported MimeTypes: "application/pdf", "image/tiff"
        .setGcsSource(gcsSource)
        .build();

    // Set the GCS destination path for where to save the results.
    GcsDestination gcsDestination = GcsDestination.newBuilder()
        .setUri(gcsDestinationPath)
        .build();

    // Create the configuration for the output with the batch size.
    // The batch size sets how many pages should be grouped into each json output file.
    OutputConfig outputConfig = OutputConfig.newBuilder()
        .setBatchSize(2)
        .setGcsDestination(gcsDestination)
        .build();

    // Select the Feature required by the vision API
    Feature feature = Feature.newBuilder().setType(Feature.Type.DOCUMENT_TEXT_DETECTION).build();

    // Build the OCR request
    AsyncAnnotateFileRequest request = AsyncAnnotateFileRequest.newBuilder()
        .addFeatures(feature)
        .setInputConfig(inputConfig)
        .setOutputConfig(outputConfig)
        .build();

    requests.add(request);

    // Perform the OCR request
    OperationFuture<AsyncBatchAnnotateFilesResponse, OperationMetadata> response =
        client.asyncBatchAnnotateFilesAsync(requests);

    System.out.println("Waiting for the operation to finish.");

    // Wait for the request to finish. (The result is not used, since the API saves the result to
    // the specified location on GCS.)
    List<AsyncAnnotateFileResponse> result = response.get(180, TimeUnit.SECONDS)
        .getResponsesList();

    // Once the request has completed and the output has been
    // written to GCS, we can list all the output files.
    Storage storage = StorageOptions.getDefaultInstance().getService();

    // Get the destination location from the gcsDestinationPath
    Pattern pattern = Pattern.compile("gs://([^/]+)/(.+)");
    Matcher matcher = pattern.matcher(gcsDestinationPath);

    if (matcher.find()) {
      String bucketName = matcher.group(1);
      String prefix = matcher.group(2);

      // Get the list of objects with the given prefix from the GCS bucket
      Bucket bucket = storage.get(bucketName);
      com.google.api.gax.paging.Page<Blob> pageList = bucket.list(BlobListOption.prefix(prefix));

      Blob firstOutputFile = null;

      // List objects with the given prefix.
      System.out.println("Output files:");
      for (Blob blob : pageList.iterateAll()) {
        System.out.println(blob.getName());

        // Process the first output file from GCS.
        // Since we specified batch size = 2, the first response contains
        // the first two pages of the input file.
        if (firstOutputFile == null) {
          firstOutputFile = blob;
        }
      }

      // Get the contents of the file and convert the JSON contents to an AnnotateFileResponse
      // object. If the Blob is small read all its content in one request
      // (Note: the file is a .json file)
      // Storage guide: https://cloud.google.com/storage/docs/downloading-objects
      String jsonContents = new String(firstOutputFile.getContent());
      Builder builder = AnnotateFileResponse.newBuilder();
      JsonFormat.parser().merge(jsonContents, builder);

      // Build the AnnotateFileResponse object
      AnnotateFileResponse annotateFileResponse = builder.build();

      // Parse through the object to get the actual response for the first page of the input file.
      AnnotateImageResponse annotateImageResponse = annotateFileResponse.getResponses(0);

      // Here we print the full text from the first page.
      // The response contains more information:
      // annotation/pages/blocks/paragraphs/words/symbols
      // including confidence score and bounding boxes
      System.out.format("\nText: %s\n", annotateImageResponse.getFullTextAnnotation().getText());
    } else {
      System.out.println("No MATCH");
    }
  }
}

Node.js

For more on installing and creating a Vision API client, refer to Vision API Client Libraries.

// Imports the Google Cloud client libraries
const vision = require('@google-cloud/vision').v1p2beta1;

// Creates a client
const client = new vision.ImageAnnotatorClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// Bucket where the file resides
// const bucketName = 'my-bucket';
// Path to PDF file within bucket
// const fileName = 'path/to/document.pdf';

const gcsSourceUri = `gs://${bucketName}/${fileName}`;
const gcsDestinationUri = `gs://${bucketName}/${fileName}.json`;

const inputConfig = {
  // Supported mime_types are: 'application/pdf' and 'image/tiff'
  mimeType: 'application/pdf',
  gcsSource: {
    uri: gcsSourceUri,
  },
};
const outputConfig = {
  gcsDestination: {
    uri: gcsDestinationUri,
  },
};
const features = [{type: 'DOCUMENT_TEXT_DETECTION'}];
const request = {
  requests: [
    {
      inputConfig: inputConfig,
      features: features,
      outputConfig: outputConfig,
    },
  ],
};

client
  .asyncBatchAnnotateFiles(request)
  .then(results => {
    const operation = results[0];
    // Get a Promise representation of the final result of the job
    operation
      .promise()
      .then(filesResponse => {
        let destinationUri =
          filesResponse[0].responses[0].outputConfig.gcsDestination.uri;
        console.log('Json saved to: ' + destinationUri);
      })
      .catch(function(error) {
        console.log(error);
      });
  })
  .catch(function(error) {
    console.log(error);
  });

Python

For more on installing and creating a Vision API client, refer to Vision API Client Libraries.

def async_detect_document(gcs_source_uri, gcs_destination_uri):
    # Supported mime_types are: 'application/pdf' and 'image/tiff'
    mime_type = 'application/pdf'

    # How many pages should be grouped into each json output file.
    # With a file of 5 pages
    batch_size = 2

    client = vision.ImageAnnotatorClient()

    feature = vision.types.Feature(
        type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION)

    gcs_source = vision.types.GcsSource(uri=gcs_source_uri)
    input_config = vision.types.InputConfig(
        gcs_source=gcs_source, mime_type=mime_type)

    gcs_destination = vision.types.GcsDestination(uri=gcs_destination_uri)
    output_config = vision.types.OutputConfig(
        gcs_destination=gcs_destination, batch_size=batch_size)

    async_request = vision.types.AsyncAnnotateFileRequest(
        features=[feature], input_config=input_config,
        output_config=output_config)

    operation = client.async_batch_annotate_files(
        requests=[async_request])

    print('Waiting for the operation to finish.')
    operation.result(timeout=90)

    # Once the request has completed and the output has been
    # written to GCS, we can list all the output files.
    storage_client = storage.Client()

    match = re.match(r'gs://([^/]+)/(.+)', gcs_destination_uri)
    bucket_name = match.group(1)
    prefix = match.group(2)

    bucket = storage_client.get_bucket(bucket_name=bucket_name)

    # List objects with the given prefix.
    blob_list = list(bucket.list_blobs(prefix=prefix))
    print('Output files:')
    for blob in blob_list:
        print(blob.name)

    # Process the first output file from GCS.
    # Since we specified batch_size=2, the first response contains
    # the first two pages of the input file.
    output = blob_list[0]

    json_string = output.download_as_string()
    response = json_format.Parse(
        json_string, vision.types.AnnotateFileResponse())

    # The actual response for the first page of the input file.
    first_page_response = response.responses[0]
    annotation = first_page_response.full_text_annotation

    # Here we print the full text from the first page.
    # The response contains more information:
    # annotation/pages/blocks/paragraphs/words/symbols
    # including confidence scores and bounding boxes
    print(u'Full text:\n{}'.format(
        annotation.text))

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Vision API Documentation