Try translating formatted documents

Vertex AI Translation and Optical Character Recognition (OCR) services combine to provide a document processing feature called Document Vision Service (DVS).

DVS directly translates formatted documents such as PDF files. Compared to plain text translations, the feature preserves the original formatting and layout in your translated documents, helping you retain much of the original context, like paragraph breaks.

DVS supports document translations inline, from storage buckets, and in batch.

This page guides you through an interactive experience using the document processing feature on Google Distributed Cloud (GDC) air-gapped to translate documents while preserving their format.

Supported formats

DVS supports the following input file types and their associated output file types:

Inputs Document MIME type Output
PDF application/pdf PDF, DOCX
DOC application/msword DOC, DOCX
DOCX application/vnd.openxmlformats-officedocument.wordprocessingml.document DOCX
PPT application/vnd.ms-powerpoint PPT, PPTX
PPTX application/vnd.openxmlformats-officedocument.presentationml.presentation PPTX
XLS application/vnd.ms-excel XLS, XLSX
XLSX application/vnd.openxmlformats-officedocument.spreadsheetml.sheet XLSX

Original and scanned PDF document translations

DVS supports original and scanned PDF files, including translations to or from right-to-left languages. Also, DVS preserves hyperlinks, font size, and font color from files.

Before you begin

Before you can start using the document processing feature, you must have a project named dvs-project. The custom resource of the project must look like in the following example:

apiVersion: resourcemanager.gdc.goog/v1
kind: Project
metadata:
  labels:
    atat.config.google.com/clin-number: CLIN_NUMBER
    atat.config.google.com/task-order-number: TASK_ORDER_NUMBER
  name: dvs-project
  namespace: platform

Furthermore, you must enable both the Vertex AI Translation and OCR pre-trained APIs and have the appropriate credentials. Consider installing the Vertex AI Translation and OCR client libraries to facilitate API calls. For more information about prerequisites, see Set up a translation project.

Translate a document from a storage bucket

To translate a document that is stored in a bucket, you use the Vertex AI Translation API.

This section describes how to translate a document from a bucket and store the result to another output bucket path. The response also returns a byte stream. You can specify the MIME type; if you don't, DVS determines it by using the input file's extension.

DVS supports language auto-detection for documents stored in buckets. If you don't specify a source language code, DVS detects the language for you. The detected language is included in the output in the detectedLanguageCode field.

Follow these steps to translate a document from a storage bucket:

  1. Configure the gdcloud CLI for object storage.
  2. Create a storage bucket in the dvs-project namespace. Use a Standard storage class.

    You can create the storage bucket by deploying a Bucket resource in the dvs-project namespace:

      apiVersion: object.gdc.goog/v1
      kind: Bucket
      metadata:
        name: dvs-bucket
        namespace: dvs-project
      spec:
        description: bucket for document vision service
        storageClass: Standard
        bucketPolicy:
          lockingPolicy:
            defaultObjectRetentionDays: 90
    
  3. Grant read and write permissions on the bucket to the service account (ai-translation-system-sa) used by the Vertex AI Translation service.

    You can follow these steps to create the role and role binding using custom resources:

    1. Create the role by deploying a Role resource in the dvs-project namespace:

        apiVersion: rbac.authorization.k8s.io/v1
        kind: Role
        metadata:
          name: dvs-reader-writer
          namespace: dvs-project
        rules:
          -
            apiGroups:
              - object.gdc.goog
            resources:
              - buckets
            verbs:
              - read-object
              - write-object
      
    2. Create the role binding by deploying a RoleBinding resource in the dvs-project namespace:

        apiVersion: rbac.authorization.k8s.io/v1
        kind: RoleBinding
        metadata:
          name: dvs-reader-writer-rolebinding
          namespace: dvs-project
        roleRef:
          apiGroup: rbac.authorization.k8s.io
          kind: Role
          name: dvs-reader-writer
        subjects:
          -
            kind: ServiceAccount
            name: ai-translation-system-sa
            namespace: ai-translation-system
      
  4. Upload your document to the storage bucket you created. For more information, see Upload and download storage objects in projects.

  5. Make a request to the Vertex AI Translation pre-trained API:

    curl

    Follow these steps to make a curl request:

    1. Save the following request.json file:

      cat <<- EOF > request.json
      {
        "parent": "projects/PROJECT_ID",
        "source_language_code": "SOURCE_LANGUAGE",
        "target_language_code": "TARGET_LANGUAGE",
        "document_input_config": {
          "mime_type": "application/pdf",
          "s3_source": {
            "input_uri": "s3://INPUT_FILE_PATH"
          }
        },
        "document_output_config": {
          "mime_type": "application/pdf"
        },
        "enable_rotation_correction": "true"
      }
      EOF
      

      Replace the following:

      • PROJECT_ID: your project ID.
      • SOURCE_LANGUAGE: the language in which your document is written. See the list of supported languages and their respective language codes.
      • TARGET_LANGUAGE: the language or languages into which you want to translate your document. See the list of supported languages and their respective language codes.
      • INPUT_FILE_PATH: the path of your document file in the storage bucket.

      Modify the mime_type value according to your document.

    2. Get an authentication token.

    3. Make the request:

      curl -vv --data-binary @- -H "Content-Type: application/json" -H "Authorization: Bearer TOKEN" https://ENDPOINT:443/v3/projects/PROJECT_ID:translateDocument < request.json
      

      Replace the following:

Translate a document inline

This section describes how to send a document inline as part of the API request. You must include the MIME type for inline document translations.

DVS supports language auto-detection for inline text translations. If you don't specify a source language code, DVS detects the language for you. The detected language is included in the output in the detectedLanguageCode field.

Make a request to the Vertex AI Translation pre-trained API:

curl

Follow these steps to make a curl request:

  1. Get an authentication token.

  2. Make the request:

echo '{"parent": "projects/PROJECT_ID","source_language_code": "SOURCE_LANGUAGE", "target_language_code": "TARGET_LANGUAGE", "document_input_config": { "mime_type": "application/pdf", "content": "'$(base64 -w 0 INPUT_FILE_PATH)'" }, "document_output_config": { "mime_type": "application/pdf" }, "enable_rotation_correction": "true"}' | curl --data-binary @- -H "Content-Type: application/json" -H "Authorization: Bearer TOKEN" https://ENDPOINT/v3/projects/PROJECT_ID:translateDocument

Replace the following:

  • PROJECT_ID: your project ID.
  • SOURCE_LANGUAGE: the language in which your document is written. See the list of supported languages and their respective language codes.
  • TARGET_LANGUAGE: the language or languages into which you want to translate your document. See the list of supported languages and their respective language codes.
  • INPUT_FILE_PATH: the path of your document file locally.
  • TOKEN: the authentication token you obtained.
  • ENDPOINT: the Vertex AI Translation endpoint that you use for your organization. For more information, view service status and endpoints

Translate documents in batch

Batch translation lets you translate multiple files into multiple languages in a single request. For each request, you can send up to 100 files with a total content size of up to 1 GB or 100 million Unicode codepoints, whichever limit is hit first. You can specify a particular translation model for each language.

For more information, see batchTranslateDocument.

Translate multiple documents

The following example includes multiple input configurations. Each input configuration is a pointer to a file in a storage bucket.

Make a request to the Vertex AI Translation pre-trained API:

curl

Follow these steps to make a curl request:

  1. Save the following request body in a file named request.json:

    {
      "source_language_code": "SOURCE_LANGUAGE",
      "target_language_codes": ["TARGET_LANGUAGE", ...],
      "input_configs": [
        {
          "s3_source": {
            "input_uri": "s3://INPUT_FILE_PATH_1"
          }
        },
        {
          "s3_source": {
            "input_uri": "s3://INPUT_FILE_PATH_2"
          }
        },
        ...
      ],
      "output_config": {
        "s3_destination": {
          "output_uri_prefix": "s3://OUTPUT_FILE_PREFIX"
        }
      }
    }
    

    Replace the following:

    • SOURCE_LANGUAGE: the language code of the input documents. See the list of supported languages and their respective language codes.
    • TARGET_LANGUAGE: the target language or languages to translate the input documents to. See the list of supported languages and their respective language codes.
    • INPUT_FILE_PATH: the storage bucket location and filename of one or more input documents.
    • OUTPUT_FILE_PREFIX: the storage bucket location where all output documents are stored.
  2. Get an authentication token.

  3. Make the request:

    curl -X POST \
      -H "Authorization: Bearer TOKEN" \
      -H "Content-Type: application/json; charset=utf-8" \
      -d @request.json \
      "https://ENDPOINT:443/v3/projects/PROJECT_ID:batchTranslateDocument"
    

    Replace the following:

The response contains the ID for a long-running operation:

{
"name": "projects/PROJECT_ID/operations/OPERATION_ID",
"metadata": {
  "@type": "type.googleapis.com/google.cloud.translation.v3.BatchTranslateDocumentMetadata",
  "state": "RUNNING"
}
}

Translate and convert an original PDF file

The following example translates and converts an original PDF file to a DOCX file. You can specify multiple inputs of various file types; they don't all have to be original PDF files. However, scanned PDF files cannot be included when including a conversion; the request is rejected and no translations are done. Only original PDF files are translated and converted to DOCX files. For example, if you include PPTX files, they are translated and returned as PPTX files.

If you regularly translate a mix of scanned and original PDF files, we recommend that you organize them into separate buckets. That way, when you request a batch translation and conversion, you can exclude the bucket that contains scanned PDF files instead of having to exclude individual files.

Make a request to the Vertex AI Translation pre-trained API:

curl

Follow these steps to make a curl request:

  1. Save the following request body in a file named request.json:

    {
      "source_language_code": "SOURCE_LANGUAGE",
      "target_language_codes": ["TARGET_LANGUAGE", ...],
      "input_configs": [
        {
          "s3_source": {
            "input_uri": "s3://INPUT_FILE_PATH_1"
          }
        },
        {
          "s3_source": {
            "input_uri": "s3://INPUT_FILE_PATH_2"
          }
        },
        ...
      ],
      "output_config": {
        "s3_destination": {
          "output_uri_prefix": "s3://OUTPUT_FILE_PREFIX"
        }
      },
      "format_conversions": {
        "application/pdf": "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
      }
    }
    

    Replace the following:

    • SOURCE_LANGUAGE: the language code of the input documents. See the list of supported languages and their respective language codes.
    • TARGET_LANGUAGE: the target language or languages to translate the input documents to. See the list of supported languages and their respective language codes.
    • INPUT_FILE_PATH: the storage bucket location and filename of one or more input documents.
    • OUTPUT_FILE_PREFIX: the storage bucket location where all output documents are stored.
  2. Get an authentication token.

  3. Make the request:

    curl -X POST \
      -H "Authorization: Bearer TOKEN" \
      -H "Content-Type: application/json; charset=utf-8" \
      -d @request.json \
      "https://ENDPOINT:443/v3/projects/PROJECT_ID:batchTranslateDocument"
    

    Replace the following:

The response contains the ID for a long-running operation:

{
"name": "projects/PROJECT_ID/operations/OPERATION_ID",
"metadata": {
  "@type": "type.googleapis.com/google.cloud.translation.v3.BatchTranslateDocumentMetadata",
  "state": "RUNNING"
}
}

Use a glossary

You can include a glossary to handle domain-specific terminology. If you specify a glossary, you must specify the source language. The following example uses a glossary. You can specify up to 10 target languages with their own glossary.

If you specify a glossary for some target languages, the system doesn't use any glossary for the unspecified languages.

Make a request to the Vertex AI Translation pre-trained API:

curl

Follow these steps to make a curl request:

  1. Save the following request body in a file named request.json:

    {
      "source_language_code": "SOURCE_LANGUAGE",
      "target_language_codes": ["TARGET_LANGUAGE", ...],
      "input_configs": [
        {
          "s3_source": {
            "input_uri": "s3://INPUT_FILE_PATH"
          }
        }
      ],
      "output_config": {
        "s3_destination": {
          "output_uri_prefix": "s3://OUTPUT_FILE_PREFIX"
        }
      },
      "glossaries": {
        "TARGET_LANGUAGE": {
          "glossary": "projects/GLOSSARY_PROJECT_ID"
        },
        ...
      }
    }
    

    Replace the following:

    • SOURCE_LANGUAGE: the language code of the input documents. See the list of supported languages and their respective language codes.
    • TARGET_LANGUAGE: the target language or languages to translate the input documents to. See the list of supported languages and their respective language codes.
    • INPUT_FILE_PATH: the storage bucket location and filename of one or more input documents.
    • OUTPUT_FILE_PREFIX: the storage bucket location where all output documents are stored.
    • GLOSSARY_PROJECT_ID: the project ID where the glossary is located.
  2. Get an authentication token.

  3. Make the request:

    curl -X POST \
      -H "Authorization: Bearer TOKEN" \
      -H "Content-Type: application/json; charset=utf-8" \
      -d @request.json \
      "https://ENDPOINT:443/v3/projects/PROJECT_ID:batchTranslateDocument"
    

    Replace the following:

The response contains the ID for a long-running operation:

{
"name": "projects/PROJECT_ID/operations/OPERATION_ID",
"metadata": {
  "@type": "type.googleapis.com/google.cloud.translation.v3.BatchTranslateDocumentMetadata",
  "state": "RUNNING"
}
}