Translating and speaking text from a photo with glossaries (Advanced)

This page shows how to detect text in an image, how to personalize translations, and how to generate synthetic speech from text. This tutorial uses Cloud Vision to detect text in an image file. Then, this tutorial shows how to use Cloud Translation to provide a custom translation of the detected text. Finally, this tutorial uses Text-to-Speech to provide machine dictation of the translated text.


  1. Pass text recognized by the Cloud Vision API to the Cloud Translation API.

  2. Create and use Cloud Translation glossaries to personalize Cloud Translation API translations.

  3. Create an audio representation of translated text using the Text-to-Speech API.


Each Google Cloud API uses a separate pricing structure.

For pricing details, refer to the Cloud Vision pricing guide, the Cloud Translation pricing guide, and the Text-to-Speech pricing guide.

Before you begin

Make sure that you have:

  • A project in the Google Cloud Console with the Vision API, the Cloud Translation API, and the Text-to-Speech API enabled
  • A basic familiarity with Python programming

Downloading the code samples

This tutorial uses code in the samples/snippets/hybrid_glossaries directory of the Google Cloud Python client library.

To download and navigate to the code for this tutorial, run the following commands from the terminal.

git clone
cd samples/snippets/hybrid_glossaries/

Setting up client libraries

This tutorial uses Vision, Translation, and Text-to-Speech client libraries.

To install the relevant client libraries, run the following commands from the terminal.

pip install --upgrade google-cloud-vision
pip install --upgrade google-cloud-translate
pip install --upgrade google-cloud-texttospeech

Setting up permissions for glossary creation

Creating Translation glossaries requires using a service account key with "Cloud Translation API Editor" permissions.

To set up a service account key with "Cloud Translation API Editor" permissions:

  1. On the Google Cloud Service accounts page, Select a project. Then, select Create Service Account. Designate the Service account name and Create the service account.

  2. Under Service account permissions, click Select a role. Scroll to Cloud Translation and select Cloud Translation API Editor. Select Continue.

  3. Click Create Key, select JSON, and click Create.

  4. From the hybrid_glossaries folder in terminal, set the GOOGLE_APPLICATION_CREDENTIALS variable using the following command. Replace path_to_key with the path to the downloaded JSON file containing your new service account key.

    Linux or macOS




Importing libraries

This tutorial uses the following system imports and client library imports.

import html
import io
import os

# Imports the Google Cloud client libraries
from google.api_core.exceptions import AlreadyExists
from import texttospeech
from import translate_v3beta1 as translate
from import vision

Setting your project ID

You must associate a Google Cloud project with each request to a Google Cloud API. Designate your Google Cloud project by setting the GCLOUD_PROJECT environment variable from the terminal.

In the following command, replace project-number-or-id with your Google Cloud project number or ID. Run the following command from the terminal.

Linux or macOS

export GCLOUD_PROJECT=project-number-or-id


set GCLOUD_PROJECT=project-number-or-id

This tutorial uses the following global project ID variable.

# extract GCP project id

Using Vision to detect text from an image

Use the Vision API to detect and extract text from an image. The Vision API uses Optical Character Recognition (OCR) to support two text-detection features: detection of dense text, or DOCUMENT_TEXT_DETECTION, and sparse text detection, or TEXT_DETECTION.

The following code shows how to use the Vision API DOCUMENT_TEXT_DETECTION feature to detect text in a photo with dense text.

def pic_to_text(infile):
    """Detects text in an image file

    infile: path to image file

    String of text detected in image

    # Instantiates a client
    client = vision.ImageAnnotatorClient()

    # Opens the input image file
    with, "rb") as image_file:
        content =

    image = vision.types.Image(content=content)

    # For dense text, use document_text_detection
    # For less dense text, use text_detection
    response = client.document_text_detection(image=image)
    text = response.full_text_annotation.text
    print("Detected text: {}".format(text))

    return text

Using Translation with glossaries

After extracting text from an image, use Translation glossaries to personalize the translation of the extracted text. Glossaries provide pre-defined translations that override the Cloud Translation API translations of designated terms.

Glossary use cases include:

  • Product names: For example, 'Google Home' must translate to 'Google Home'.

  • Ambiguous words: For example, the word 'bat' can mean a piece of sports equipment or an animal. If you know that you are translating words about sports, you might want to use a glossary to feed the Cloud Translation API the sports translation of 'bat', not the animal translation.

  • Borrowed words: For example, 'bouillabaisse' in French translates to 'bouillabaisse' in English; the English language borrowed the word 'bouillabaisse' from the French language. An English speaker lacking French cultural context might not know that bouillabaisse is a French fish stew dish. Glossaries can override a translation so that 'bouillabaisse' in French translates to 'fish stew' in English.

Making a glossary file

The Cloud Translation API accepts TSV, CSV, or TMX glossary files. This tutorial uses a CSV file uploaded to Cloud Storage to define sets of equivalent terms.

To make a glossary CSV file:

  1. Designate the language of a column using either ISO-639-1 or BCP-47 language codes in the first row of the CSV file.


  2. List pairs of equivalent terms in each row of the CSV file. Separate terms with commas. The following example defines the English translation for several culinary French words.

    chèvre,goat cheese,
    crème brulée,crème brulée,
    bouillabaisse,fish stew,
    steak frites,steak with french fries,

  3. Define variants of a word. The Cloud Translation API is case-sensitive and sensitive to special characters such as accented words. Ensure that your glossary handles variations on a word by explicitly defining different spellings of the word.

    chevre,goat cheese,
    Chevre,Goat cheese,
    chèvre,goat cheese,
    Chèvre,Goat cheese,
    crème brulée,crème brulée,
    Crème brulée,Crème brulée,
    Crème Brulée,Crème Brulée,
    bouillabaisse,fish stew,
    Bouillabaisse,Fish stew,
    steak frites,steak with french fries,
    Steak frites,Steak with french fries,
    Steak Frites,Steak with French Fries,

  4. Upload the glossary to a Cloud Storage bucket. For the purposes of this tutorial, you do not need to upload a glossary file to a Cloud Storage bucket nor do you need to create a Cloud Storage bucket. Instead, use the publicly-available glossary file created for this tutorial to avoid incurring any Cloud Storage costs. Send the URI of a glossary file in Cloud Storage to the Cloud Translation API to create a glossary resource. The URI of the publicly-available glossary file for this tutorial is gs://cloud-samples-data/translation/bistro_glossary.csv. To download the glossary, click on the above URI link, but do not open it in a new tab.

Creating a glossary resource

In order to use a glossary, you must create a glossary resource with the Cloud Translation API. To create a glossary resource, send the URI of a glossary file in Cloud Storage to the Cloud Translation API.

Make sure that you are using a service account key with "Cloud Translation API Editor" permissions and make sure that you have set your project ID from the terminal.

The following function creates a glossary resource. With this glossary resource, you can personalize the translation request in the next step of this tutorial.

def create_glossary(languages, project_id, glossary_name, glossary_uri):
    """Creates a GCP glossary resource
    Assumes you've already manually uploaded a glossary to Cloud Storage

    languages: list of languages in the glossary
    project_id: GCP project id
    glossary_name: name you want to give this glossary resource
    glossary_uri: the uri of the glossary you uploaded to Cloud Storage


    # Instantiates a client
    client = translate.TranslationServiceClient()

    # Designates the data center location that you want to use
    location = "us-central1"

    # Set glossary resource name
    name = client.glossary_path(project_id, location, glossary_name)

    # Set language codes
    language_codes_set = translate.types.Glossary.LanguageCodesSet(

    gcs_source = translate.types.GcsSource(input_uri=glossary_uri)

    input_config = translate.types.GlossaryInputConfig(gcs_source=gcs_source)

    # Set glossary resource information
    glossary = translate.types.Glossary(
        name=name, language_codes_set=language_codes_set, input_config=input_config

    parent = f"projects/{project_id}/locations/{location}"

    # Create glossary resource
    # Handle exception for case in which a glossary
    #  with glossary_name already exists
        operation = client.create_glossary(parent=parent, glossary=glossary)
        print("Created glossary " + glossary_name + ".")
    except AlreadyExists:
            "The glossary "
            + glossary_name
            + " already exists. No new glossary was created."

Translating with glossaries

Once you create a glossary resource, you can use the glossary resource to personalize translations of text that you send to the Cloud Translation API.

The following function uses your previously-created glossary resource to personalize the translation of text.

def translate_text(
    text, source_language_code, target_language_code, project_id, glossary_name
    """Translates text to a given language using a glossary

    text: String of text to translate
    source_language_code: language of input text
    target_language_code: language of output text
    project_id: GCP project id
    glossary_name: name you gave your project's glossary
        resource when you created it

    String of translated text

    # Instantiates a client
    client = translate.TranslationServiceClient()

    # Designates the data center location that you want to use
    location = "us-central1"

    glossary = client.glossary_path(project_id, location, glossary_name)

    glossary_config = translate.types.TranslateTextGlossaryConfig(glossary=glossary)

    parent = f"projects/{project_id}/locations/{location}"

    result = client.translate_text(
            "parent": parent,
            "contents": [text],
            "mime_type": "text/plain",  # mime types: text/plain, text/html
            "source_language_code": source_language_code,
            "target_language_code": target_language_code,
            "glossary_config": glossary_config,

    # Extract translated text from API response
    return result.glossary_translations[0].translated_text

Using Text-to-Speech with Speech Synthesis Markup Language

Now that you have personalized a translation of image-detected text, you are ready to use the Text-to-Speech API. The Text-to-Speech API can create synthetic audio of your translated text.

The Text-to-Speech API generates synthetic audio from either a string of plain text or a string of text marked up with Speech Synthesis Markup Language (SSML). SSML is a markup language which supports annotating text with SSML tags. You can use SSML tags to influence how the Text-to-Speech API formats synthetic speech creation.

The following function converts a string of SSML to an MP3 file of synthetic speech.

def text_to_speech(text, outfile):
    """Converts plaintext to SSML and
    generates synthetic audio from SSML

    text: text to synthesize
    outfile: filename to use to store synthetic audio


    # Replace special characters with HTML Ampersand Character Codes
    # These Codes prevent the API from confusing text with
    # SSML commands
    # For example, '<' --> '&lt;' and '&' --> '&amp;'
    escaped_lines = html.escape(text)

    # Convert plaintext to SSML in order to wait two seconds
    #   between each line in synthetic speech
    ssml = "<speak>{}</speak>".format(
        escaped_lines.replace("\n", '\n<break time="2s"/>')

    # Instantiates a client
    client = texttospeech.TextToSpeechClient()

    # Sets the text input to be synthesized
    synthesis_input = texttospeech.SynthesisInput(ssml=ssml)

    # Builds the voice request, selects the language code ("en-US") and
    # the SSML voice gender ("MALE")
    voice = texttospeech.VoiceSelectionParams(
        language_code="en-US", ssml_gender=texttospeech.SsmlVoiceGender.MALE

    # Selects the type of audio file to return
    audio_config = texttospeech.AudioConfig(

    # Performs the text-to-speech request on the text input with the selected
    # voice parameters and audio file type

    request = texttospeech.SynthesizeSpeechRequest(
        input=synthesis_input, voice=voice, audio_config=audio_config

    response = client.synthesize_speech(request=request)

    # Writes the synthetic audio to the output file.
    with open(outfile, "wb") as out:
        print("Audio content written to file " + outfile)

Putting it all together

In the previous steps, you defined functions in that use Vision, Translation, and Text-to-Speech. Now, you are ready to use these functions to generate synthetic speech of translated text from the following photo.

The following code calls functions defined in to:

  • create a Cloud Translation API glossary resource

  • use the Vision API to detect text in the above image

  • perform a Cloud Translation API glossary translation of the detected text

  • generate Text-to-Speech synthetic speech of the translated text

def main():

    # Photo from which to extract text
    infile = "resources/example.png"
    # Name of file that will hold synthetic speech
    outfile = "resources/example.mp3"

    # Defines the languages in the glossary
    # This list must match the languages in the glossary
    #   Here, the glossary includes French and English
    glossary_langs = ["fr", "en"]
    # Name that will be assigned to your project's glossary resource
    glossary_name = "bistro-glossary"
    # uri of .csv file uploaded to Cloud Storage
    glossary_uri = "gs://cloud-samples-data/translation/bistro_glossary.csv"

    create_glossary(glossary_langs, PROJECT_ID, glossary_name, glossary_uri)

    # photo -> detected text
    text_to_translate = pic_to_text(infile)
    # detected text -> translated text
    text_to_speak = translate_text(
        text_to_translate, "fr", "en", PROJECT_ID, glossary_name
    # translated text -> synthetic audio
    text_to_speech(text_to_speak, outfile)

Running the code

To run the code, enter the following command in terminal in your cloned hybrid_glossaries directory:


The following output appears:

Created glossary bistro-glossary.
Audio content written to file resources/example.mp3

After running, navigate into the resources directory from the hybrid_glossaries directory. Check the resources directory for an example.mp3 file.

Listen to the following audio clip to check that your example.mp3 file sounds the same.

Troubleshooting error messages

Cleaning up

Use the Google Cloud Console to delete your project if you do not need it. Deleting your project prevents incurring additional charges to your Google Cloud account for the resources used in this tutorial.

Deleting your project

  1. In the Cloud Console, go to the Projects page.
  2. In the project list, select the project you want to delete and click Delete.
  3. In the dialog box, type the project ID, and click Shut down to delete the project.

What's next

Congratulations! You just used Vision OCR to detect text in an image. Then, you created a Translation glossary and performed a translated with that glossary. Afterwards, you used Text-to-Speech to generate synthetic audio of the translated text.

To build on your knowledge of Vision, Cloud Translation, and Text-to-Speech: