Detect text in files (PDF/TIFF)

The Vision API can detect and transcribe text from PDF and TIFF files stored in Google Cloud Storage.

Document text detection from PDF and TIFF must be requested using the files:asyncBatchAnnotate function, which performs an offline (asynchronous) request and provides its status using the operations resources.

Output from a PDF/TIFF request is written to a JSON file created in the specified Google Cloud Storage bucket.

Limitations

The Vision API accepts PDF/TIFF files up to 2000 pages. Larger files will return an error.

Authentication

API keys are not supported for files:asyncBatchAnnotate requests. See Using a service account for instructions on authenticating with a service account.

The account used for authentication must have access to the Cloud Storage bucket that you specify for the output (roles/editor or roles/storage.objectCreator or above).

You can use an API key to query the status of the operation; see Using an API key for instructions.

Document text detection requests

Currently PDF/TIFF document detection is only available for files stored in Google Cloud Storage buckets. Response JSON files are similarly saved to a Google Cloud Storage bucket.

2010 US census PDF page
gs://cloud-samples-data/vision/pdf_tiff/census2010.pdf, Source: United States Census Bureau.

REST & CMD LINE

Before using any of the request data below, make the following replacements:

  • cloud-storage-bucket: a Google Cloud Storage bucket/directory to save output files to, expressed in the following form:
    • gs://bucket/directory
    The requesting user must have write permission to the bucket.
  • cloud-storage-image-uri: the path to a valid image file in a Google Cloud Storage bucket. You must at least have read priveleges to the file. Example:
    • gs://cloud-samples-data/vision/pdf_tiff/census2010.pdf

Field-specific considerations:

  • inputConfig - replaces the image field used in other Vision API requests. It contains two child fields:
    • gcsSource.uri - the Google Cloud Storage URI of the PDF or TIFF file (accessible to the user or service account making the request).
    • mimeType - one of the accepted file types: application/pdf or image/tiff.
  • outputConfig - specifies output details. It contains two child field:
    • gcsDestination.uri - a valid Google Cloud Storage URI. The bucket must be writeable by the user or service account making the request. The filename will be output-x-to-y, where x and y represent the PDF/TIFF page numbers included in that output file. If the file exists, its contents will be overwritten.
    • batchSize - specifies how many pages of output should be included in each output JSON file.

HTTP method and URL:

POST https://vision.googleapis.com/v1/files:asyncBatchAnnotate

Request JSON body:

{
  "requests":[
    {
      "inputConfig": {
        "gcsSource": {
          "uri": "cloud-storage-image-uri"
        },
        "mimeType": "application/pdf"
      },
      "features": [
        {
          "type": "DOCUMENT_TEXT_DETECTION"
        }
      ],
      "outputConfig": {
        "gcsDestination": {
          "uri": "cloud-storage-bucket"
        },
        "batchSize": 1
      }
    }
  ]
}

To send your request, choose one of these options:

curl

Save the request body in a file called request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
https://vision.googleapis.com/v1/files:asyncBatchAnnotate

PowerShell

Save the request body in a file called request.json, and execute the following command:

$cred = gcloud auth application-default print-access-token
$headers = @{ Authorization = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://vision.googleapis.com/v1/files:asyncBatchAnnotate" | Select-Object -Expand Content
Response:

A successful asyncBatchAnnotate request returns a response with a single name field:

{
  "name": "projects/usable-auth-library/operations/1efec2285bd442df"
}

This name represents a long-running operation with an associated ID (for example, 1efec2285bd442df), which can be queried using the v1.operations API.

To retrieve your Vision annotation response, send a GET request to the v1.operations endpoint, passing the operation ID in the URL:

GET https://vision.googleapis.com/v1/operations/operation-id

For example:

curl -X GET -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
-H "Content-Type: application/json" \
https://vision.googleapis.com/v1/operations/1efec2285bd442df

If the operation is in progress:

{
  "name": "operations/1efec2285bd442df",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.vision.v1.OperationMetadata",
    "state": "RUNNING",
    "createTime": "2019-05-15T21:10:08.401917049Z",
    "updateTime": "2019-05-15T21:10:33.700763554Z"
  }
}

Once the operation has completed, the state shows as DONE and your results are written to the Google Cloud Storage file you specified:

{
  "name": "operations/1efec2285bd442df",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.vision.v1.OperationMetadata",
    "state": "DONE",
    "createTime": "2019-05-15T20:56:30.622473785Z",
    "updateTime": "2019-05-15T20:56:41.666379749Z"
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.vision.v1.AsyncBatchAnnotateFilesResponse",
    "responses": [
      {
        "outputConfig": {
          "gcsDestination": {
            "uri": "gs://your-bucket-name/folder/"
          },
          "batchSize": 1
        }
      }
    ]
  }
}

The JSON in your output file is similar to that of an image's [document text detection request](/vision/docs/ocr), with the addition of a context field showing the location of the PDF or TIFF that was specified and the number of pages in the file:

output-1-to-1.json

C#

Before trying this sample, follow the C# setup instructions in the Vision API Quickstart Using Client Libraries . For more information, see the Vision API C# API reference documentation .

private static object DetectDocument(string gcsSourceUri,
    string gcsDestinationBucketName, string gcsDestinationPrefixName)
{
    var client = ImageAnnotatorClient.Create();

    var asyncRequest = new AsyncAnnotateFileRequest
    {
        InputConfig = new InputConfig
        {
            GcsSource = new GcsSource
            {
                Uri = gcsSourceUri
            },
            // Supported mime_types are: 'application/pdf' and 'image/tiff'
            MimeType = "application/pdf"
        },
        OutputConfig = new OutputConfig
        {
            // How many pages should be grouped into each json output file.
            BatchSize = 2,
            GcsDestination = new GcsDestination
            {
                Uri = $"gs://{gcsDestinationBucketName}/{gcsDestinationPrefixName}"
            }
        }
    };

    asyncRequest.Features.Add(new Feature
    {
        Type = Feature.Types.Type.DocumentTextDetection
    });

    List<AsyncAnnotateFileRequest> requests =
        new List<AsyncAnnotateFileRequest>();
    requests.Add(asyncRequest);

    var operation = client.AsyncBatchAnnotateFiles(requests);

    Console.WriteLine("Waiting for the operation to finish");

    operation.PollUntilCompleted();

    // Once the rquest has completed and the output has been
    // written to GCS, we can list all the output files.
    var storageClient = StorageClient.Create();

    // List objects with the given prefix.
    var blobList = storageClient.ListObjects(gcsDestinationBucketName,
        gcsDestinationPrefixName);
    Console.WriteLine("Output files:");
    foreach (var blob in blobList)
    {
        Console.WriteLine(blob.Name);
    }

    // Process the first output file from GCS.
    // Select the first JSON file from the objects in the list.
    var output = blobList.Where(x => x.Name.Contains(".json")).First();

    var jsonString = "";
    using (var stream = new MemoryStream())
    {
        storageClient.DownloadObject(output, stream);
        jsonString = System.Text.Encoding.UTF8.GetString(stream.ToArray());
    }

    var response = JsonParser.Default
                .Parse<AnnotateFileResponse>(jsonString);

    // The actual response for the first page of the input file.
    var firstPageResponses = response.Responses[0];
    var annotation = firstPageResponses.FullTextAnnotation;

    // Here we print the full text from the first page.
    // The response contains more information:
    // annotation/pages/blocks/paragraphs/words/symbols
    // including confidence scores and bounding boxes
    Console.WriteLine($"Full text: \n {annotation.Text}");

    return 0;
}

Go

Before trying this sample, follow the Go setup instructions in the Vision API Quickstart Using Client Libraries . For more information, see the Vision API Go API reference documentation .


// detectAsyncDocumentURI performs Optical Character Recognition (OCR) on a
// PDF file stored in GCS.
func detectAsyncDocumentURI(w io.Writer, gcsSourceURI, gcsDestinationURI string) error {
	ctx := context.Background()

	client, err := vision.NewImageAnnotatorClient(ctx)
	if err != nil {
		return err
	}

	request := &visionpb.AsyncBatchAnnotateFilesRequest{
		Requests: []*visionpb.AsyncAnnotateFileRequest{
			{
				Features: []*visionpb.Feature{
					{
						Type: visionpb.Feature_DOCUMENT_TEXT_DETECTION,
					},
				},
				InputConfig: &visionpb.InputConfig{
					GcsSource: &visionpb.GcsSource{Uri: gcsSourceURI},
					// Supported MimeTypes are: "application/pdf" and "image/tiff".
					MimeType: "application/pdf",
				},
				OutputConfig: &visionpb.OutputConfig{
					GcsDestination: &visionpb.GcsDestination{Uri: gcsDestinationURI},
					// How many pages should be grouped into each json output file.
					BatchSize: 2,
				},
			},
		},
	}

	operation, err := client.AsyncBatchAnnotateFiles(ctx, request)
	if err != nil {
		return err
	}

	fmt.Fprintf(w, "Waiting for the operation to finish.")

	resp, err := operation.Wait(ctx)
	if err != nil {
		return err
	}

	fmt.Fprintf(w, "%v", resp)

	return nil
}

Java

Before trying this sample, follow the Java setup instructions in the Vision API Quickstart Using Client Libraries. For more information, see the Vision API Java API reference documentation.

/**
 * Performs document text OCR with PDF/TIFF as source files on Google Cloud Storage.
 *
 * @param gcsSourcePath The path to the remote file on Google Cloud Storage to detect document
 *                      text on.
 * @param gcsDestinationPath The path to the remote file on Google Cloud Storage to store the
 *                           results on.
 * @throws Exception on errors while closing the client.
 */
public static void detectDocumentsGcs(String gcsSourcePath, String gcsDestinationPath) throws
    Exception {
  try (ImageAnnotatorClient client = ImageAnnotatorClient.create()) {
    List<AsyncAnnotateFileRequest> requests = new ArrayList<>();

    // Set the GCS source path for the remote file.
    GcsSource gcsSource = GcsSource.newBuilder()
        .setUri(gcsSourcePath)
        .build();

    // Create the configuration with the specified MIME (Multipurpose Internet Mail Extensions)
    // types
    InputConfig inputConfig = InputConfig.newBuilder()
        .setMimeType("application/pdf") // Supported MimeTypes: "application/pdf", "image/tiff"
        .setGcsSource(gcsSource)
        .build();

    // Set the GCS destination path for where to save the results.
    GcsDestination gcsDestination = GcsDestination.newBuilder()
        .setUri(gcsDestinationPath)
        .build();

    // Create the configuration for the output with the batch size.
    // The batch size sets how many pages should be grouped into each json output file.
    OutputConfig outputConfig = OutputConfig.newBuilder()
        .setBatchSize(2)
        .setGcsDestination(gcsDestination)
        .build();

    // Select the Feature required by the vision API
    Feature feature = Feature.newBuilder().setType(Feature.Type.DOCUMENT_TEXT_DETECTION).build();

    // Build the OCR request
    AsyncAnnotateFileRequest request = AsyncAnnotateFileRequest.newBuilder()
        .addFeatures(feature)
        .setInputConfig(inputConfig)
        .setOutputConfig(outputConfig)
        .build();

    requests.add(request);

    // Perform the OCR request
    OperationFuture<AsyncBatchAnnotateFilesResponse, OperationMetadata> response =
        client.asyncBatchAnnotateFilesAsync(requests);

    System.out.println("Waiting for the operation to finish.");

    // Wait for the request to finish. (The result is not used, since the API saves the result to
    // the specified location on GCS.)
    List<AsyncAnnotateFileResponse> result = response.get(180, TimeUnit.SECONDS)
        .getResponsesList();

    // Once the request has completed and the output has been
    // written to GCS, we can list all the output files.
    Storage storage = StorageOptions.getDefaultInstance().getService();

    // Get the destination location from the gcsDestinationPath
    Pattern pattern = Pattern.compile("gs://([^/]+)/(.+)");
    Matcher matcher = pattern.matcher(gcsDestinationPath);

    if (matcher.find()) {
      String bucketName = matcher.group(1);
      String prefix = matcher.group(2);

      // Get the list of objects with the given prefix from the GCS bucket
      Bucket bucket = storage.get(bucketName);
      com.google.api.gax.paging.Page<Blob> pageList = bucket.list(BlobListOption.prefix(prefix));

      Blob firstOutputFile = null;

      // List objects with the given prefix.
      System.out.println("Output files:");
      for (Blob blob : pageList.iterateAll()) {
        System.out.println(blob.getName());

        // Process the first output file from GCS.
        // Since we specified batch size = 2, the first response contains
        // the first two pages of the input file.
        if (firstOutputFile == null) {
          firstOutputFile = blob;
        }
      }

      // Get the contents of the file and convert the JSON contents to an AnnotateFileResponse
      // object. If the Blob is small read all its content in one request
      // (Note: the file is a .json file)
      // Storage guide: https://cloud.google.com/storage/docs/downloading-objects
      String jsonContents = new String(firstOutputFile.getContent());
      Builder builder = AnnotateFileResponse.newBuilder();
      JsonFormat.parser().merge(jsonContents, builder);

      // Build the AnnotateFileResponse object
      AnnotateFileResponse annotateFileResponse = builder.build();

      // Parse through the object to get the actual response for the first page of the input file.
      AnnotateImageResponse annotateImageResponse = annotateFileResponse.getResponses(0);

      // Here we print the full text from the first page.
      // The response contains more information:
      // annotation/pages/blocks/paragraphs/words/symbols
      // including confidence score and bounding boxes
      System.out.format("\nText: %s\n", annotateImageResponse.getFullTextAnnotation().getText());
    } else {
      System.out.println("No MATCH");
    }
  }
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the Vision API Quickstart Using Client Libraries . For more information, see the Vision API Node.js API reference documentation .


// Imports the Google Cloud client libraries
const vision = require('@google-cloud/vision').v1;

// Creates a client
const client = new vision.ImageAnnotatorClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// Bucket where the file resides
// const bucketName = 'my-bucket';
// Path to PDF file within bucket
// const fileName = 'path/to/document.pdf';
// The folder to store the results
// const outputPrefix = 'results'

const gcsSourceUri = `gs://${bucketName}/${fileName}`;
const gcsDestinationUri = `gs://${bucketName}/${outputPrefix}/`;

const inputConfig = {
  // Supported mime_types are: 'application/pdf' and 'image/tiff'
  mimeType: 'application/pdf',
  gcsSource: {
    uri: gcsSourceUri,
  },
};
const outputConfig = {
  gcsDestination: {
    uri: gcsDestinationUri,
  },
};
const features = [{type: 'DOCUMENT_TEXT_DETECTION'}];
const request = {
  requests: [
    {
      inputConfig: inputConfig,
      features: features,
      outputConfig: outputConfig,
    },
  ],
};

const [operation] = await client.asyncBatchAnnotateFiles(request);
const [filesResponse] = await operation.promise();
const destinationUri =
  filesResponse.responses[0].outputConfig.gcsDestination.uri;
console.log('Json saved to: ' + destinationUri);

PHP

Before trying this sample, follow the PHP setup instructions in the Vision API Quickstart Using Client Libraries . For more information, see the Vision API PHP API reference documentation .

namespace Google\Cloud\Samples\Vision;

use Google\Cloud\Storage\StorageClient;
use Google\Cloud\Vision\V1\AnnotateFileResponse;
use Google\Cloud\Vision\V1\AsyncAnnotateFileRequest;
use Google\Cloud\Vision\V1\Feature;
use Google\Cloud\Vision\V1\Feature\Type;
use Google\Cloud\Vision\V1\GcsDestination;
use Google\Cloud\Vision\V1\GcsSource;
use Google\Cloud\Vision\V1\ImageAnnotatorClient;
use Google\Cloud\Vision\V1\InputConfig;
use Google\Cloud\Vision\V1\OutputConfig;

// $path = 'gs://path/to/your/document.pdf'
// $output = 'gs://path/to/store/results/'

function detect_pdf_gcs($path, $output)
{
    # select ocr feature
    $feature = (new Feature())
        ->setType(Type::DOCUMENT_TEXT_DETECTION);

    # set $path (file to OCR) as source
    $gcsSource = (new GcsSource())
        ->setUri($path);
    # supported mime_types are: 'application/pdf' and 'image/tiff'
    $mimeType = 'application/pdf';
    $inputConfig = (new InputConfig())
        ->setGcsSource($gcsSource)
        ->setMimeType($mimeType);

    # set $output as destination
    $gcsDestination = (new GcsDestination())
        ->setUri($output);
    # how many pages should be grouped into each json output file.
    $batchSize = 2;
    $outputConfig = (new OutputConfig())
        ->setGcsDestination($gcsDestination)
        ->setBatchSize($batchSize);

    # prepare request using configs set above
    $request = (new AsyncAnnotateFileRequest())
        ->setFeatures([$feature])
        ->setInputConfig($inputConfig)
        ->setOutputConfig($outputConfig);
    $requests = [$request];

    # make request
    $imageAnnotator = new ImageAnnotatorClient();
    $operation = $imageAnnotator->asyncBatchAnnotateFiles($requests);
    print('Waiting for operation to finish.' . PHP_EOL);
    $operation->pollUntilComplete();

    # once the request has completed and the output has been
    # written to GCS, we can list all the output files.
    preg_match('/^gs:\/\/([a-zA-Z0-9\._\-]+)\/?(\S+)?$/', $output, $match);
    $bucketName = $match[1];
    $prefix = isset($match[2]) ? $match[2] : '';

    $storage = new StorageClient();
    $bucket = $storage->bucket($bucketName);
    $options = ['prefix' => $prefix];
    $objects = $bucket->objects($options);

    # save first object for sample below
    $objects->next();
    $firstObject = $objects->current();

    # list objects with the given prefix.
    print('Output files:' . PHP_EOL);
    foreach ($objects as $object) {
        print($object->name() . PHP_EOL);
    }

    # process the first output file from GCS.
    # since we specified batch_size=2, the first response contains
    # the first two pages of the input file.
    $jsonString = $firstObject->downloadAsString();
    $firstBatch = new AnnotateFileResponse();
    $firstBatch->mergeFromJsonString($jsonString);

    # get annotation and print text
    foreach ($firstBatch->getResponses() as $response) {
        $annotation = $response->getFullTextAnnotation();
        print($annotation->getText());
    }

    $imageAnnotator->close();
}

Python

Before trying this sample, follow the Python setup instructions in the Vision API Quickstart Using Client Libraries . For more information, see the Vision API Python API reference documentation .

def async_detect_document(gcs_source_uri, gcs_destination_uri):
    """OCR with PDF/TIFF as source files on GCS"""
    import re
    from google.cloud import vision
    from google.cloud import storage
    from google.protobuf import json_format
    # Supported mime_types are: 'application/pdf' and 'image/tiff'
    mime_type = 'application/pdf'

    # How many pages should be grouped into each json output file.
    batch_size = 2

    client = vision.ImageAnnotatorClient()

    feature = vision.types.Feature(
        type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION)

    gcs_source = vision.types.GcsSource(uri=gcs_source_uri)
    input_config = vision.types.InputConfig(
        gcs_source=gcs_source, mime_type=mime_type)

    gcs_destination = vision.types.GcsDestination(uri=gcs_destination_uri)
    output_config = vision.types.OutputConfig(
        gcs_destination=gcs_destination, batch_size=batch_size)

    async_request = vision.types.AsyncAnnotateFileRequest(
        features=[feature], input_config=input_config,
        output_config=output_config)

    operation = client.async_batch_annotate_files(
        requests=[async_request])

    print('Waiting for the operation to finish.')
    operation.result(timeout=180)

    # Once the request has completed and the output has been
    # written to GCS, we can list all the output files.
    storage_client = storage.Client()

    match = re.match(r'gs://([^/]+)/(.+)', gcs_destination_uri)
    bucket_name = match.group(1)
    prefix = match.group(2)

    bucket = storage_client.get_bucket(bucket_name)

    # List objects with the given prefix.
    blob_list = list(bucket.list_blobs(prefix=prefix))
    print('Output files:')
    for blob in blob_list:
        print(blob.name)

    # Process the first output file from GCS.
    # Since we specified batch_size=2, the first response contains
    # the first two pages of the input file.
    output = blob_list[0]

    json_string = output.download_as_string()
    response = json_format.Parse(
        json_string, vision.types.AnnotateFileResponse())

    # The actual response for the first page of the input file.
    first_page_response = response.responses[0]
    annotation = first_page_response.full_text_annotation

    # Here we print the full text from the first page.
    # The response contains more information:
    # annotation/pages/blocks/paragraphs/words/symbols
    # including confidence scores and bounding boxes
    print(u'Full text:\n{}'.format(
        annotation.text))

Ruby

Before trying this sample, follow the Ruby setup instructions in the Vision API Quickstart Using Client Libraries . For more information, see the Vision API Ruby API reference documentation .

# gcs_source_uri = "Google Cloud Storage URI, eg. 'gs://my-bucket/example.pdf'"
# gcs_destination_uri = "Google Cloud Storage URI, eg. 'gs://my-bucket/prefix_'"

require "google/cloud/vision"
require "google/cloud/storage"

image_annotator = Google::Cloud::Vision::ImageAnnotator.new

operation = image_annotator.document_text_detection(
  image:       gcs_source_uri,
  mime_type:   "application/pdf",
  batch_size:  2,
  destination: gcs_destination_uri,
  async:       true
)

puts "Waiting for the operation to finish."
operation.wait_until_done!

# Once the request has completed and the output has been
# written to GCS, we can list all the output files.
storage = Google::Cloud::Storage.new

bucket_name, prefix = gcs_destination_uri.match("gs://([^/]+)/(.+)").captures
bucket              = storage.bucket bucket_name

# List objects with the given prefix.
puts "Output files:"
blob_list = bucket.files prefix: prefix
blob_list.each do |file|
  puts file.name
end

# Process the first output file from GCS.
# Since we specified a batch_size of 2, the first response contains
# the first two pages of the input file.
output      = blob_list[0]
json_string = output.download
response    = JSON.parse json_string.string

# The actual response for the first page of the input file.
first_page_response = response["responses"][0]
annotation          = first_page_response["fullTextAnnotation"]

# Here we print the full text from the first page.
# The response contains more information:
# annotation/pages/blocks/paragraphs/words/symbols
# including confidence scores and bounding boxes
puts "Full text:\n#{annotation['text']}"

GCLOUD COMMAND

The gcloud command you use depend on the file type.

  • To perform PDF text detection, use the gcloud ml vision detect-text-pdf command as shown in the following example:

    gcloud ml vision detect-text-pdf gs://my_bucket/input_file  gs://my_bucket/out_put_prefix
    
  • To perform TIFF text detection, use the gcloud ml vision detect-text-tiff command as shown in the following example:

    gcloud ml vision detect-text-tiff gs://my_bucket/input_file  gs://my_bucket/out_put_prefix
    

Hai trovato utile questa pagina? Facci sapere cosa ne pensi:

Invia feedback per...

Cloud Vision API Documentation
Hai bisogno di assistenza? Visita la nostra pagina di assistenza.