Create long-form audio

This document walks you through the process of synthesizing long-form audio. Long Audio Synthesis asynchronously synthesizes up to 1 million bytes on input. To learn more about the fundamental concepts in Text-to-Speech, read Text-to-Speech Basics.

Before you begin

Before you can send a request to the Text-to-Speech API, you must have completed the following actions. See the before you begin page for details.

Synthesize long audio from text using the command line

You can convert long-form text to audio by making an HTTP POST request to the https://texttospeech.googleapis.com/v1beta1/projects/{$project_number}/locations/global:synthesizeLongAudio endpoint. In the body of your POST command, specify the following fields.

voice: The type of voice to synthesize.

input.text: The text to synthesize.

audioConfig: The type of audio to create.

output_gcs_uri: The GCS output file path under the form of "gs://bucket_name/file_name.wav".

parent: The parent under the form "projects/{YOUR PROJECT NUMBER}/locations/{YOUR PROJECT LOCATION}".

The input can contain up to 1MB of characters, the exact limit can vary from different inputs.

  1. Create a Google Cloud Storage bucket under the project that is used to run the synthesis. Make sure the service account used to run the synthesis has read/write access to the output GCS bucket.

  2. Execute the REST request below at the command line to synthesize audio from text using Text-to-Speech. The command uses the gcloud auth application-default print-access-token command to retrieve an authorization token for the request.

    Make sure that the service account running the GET operation has the Text-to-Speech Editor role.

    HTTP method and URL:

    POST https://texttospeech.googleapis.com/v1beta1/projects/12345/locations/global:synthesizeLongAudio

    Request JSON body:

    {
      "parent": "projects/12345/locations/global",
      "audio_config":{
          "audio_encoding":"LINEAR16"
      },
      "input":{
          "text":"hello"
      },
      "voice":{
          "language_code":"en-us",
          "name":"en-us-Standard-A"
      },
      "output_gcs_uri": "gs://bucket_name/file_name.wav"
    }
    

    To send your request, expand one of these options:

    You should receive a JSON response similar to the following:

    {
      "name": "23456",
      "metadata": {
        "@type": "type.googleapis.com/google.cloud.texttospeech.v1beta1.SynthesizeLongAudioMetadata",
        "progressPercentage": 0,
        "startTime": "2022-12-20T00:46:56.296191037Z",
        "lastUpdateTime": "2022-12-20T00:46:56.296191037Z"
      },
      "done": false
    }
    

  3. The JSON output for the REST command contains the long operation name in the name field. Execute the REST request below at the command line to query the state of the long running operation.

    Make sure that the service account running the GET operation is from the same project as the one used for synthesis.

    HTTP method and URL:

    GET https://texttospeech.googleapis.com/v1beta1/projects/12345/locations/global/operations/23456

    To send your request, expand one of these options:

    You should receive a JSON response similar to the following:

    {
      "name": "projects/12345/locations/global/operations/23456",
      "metadata": {
        "@type": "type.googleapis.com/google.cloud.texttospeech.v1beta1.SynthesizeLongAudioMetadata",
        "progressPercentage": 100
      },
      "done": true
    }
    

  4. Query the list of all operations running under a given project, execute the REST request below.

    Make sure that the service account running the LIST operation is from the same project as the one used for synthesis.

    HTTP method and URL:

    GET https://texttospeech.googleapis.com/v1beta1/projects/12345/locations/global/operations

    To send your request, expand one of these options:

    You should receive a JSON response similar to the following:

    {
      "operations": [
        {
          "name": "12345",
          "done": false
        },
        {
          "name": "23456",
          "done": false
        }
      ],
      "nextPageToken": ""
    }
    

  5. Once the long running operation successfully completes, find the output audio file in the given bucket uri in the output_gcs_uri field. If the operation did not complete successfully, find the error by querying using the GET REST command, correct the error, and issue the RPC again.

Synthesize long audio from text using client libraries

Install the client library

Python

Before installing the library, make sure you've prepared your environment for Python development.

pip install --upgrade google-cloud-texttospeech

Create audio data

You can use Text-to-Speech to create a long audio file of synthetic human speech. Use the following code to create a long audio file in your GCS bucket.

Python

Before running the example, make sure you've prepared your environment for Python development.

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from google.cloud import texttospeech


def synthesize_long_audio(project_id: str, output_gcs_uri: str) -> None:
    """
    Synthesizes long input, writing the resulting audio to `output_gcs_uri`.

    Args:
        project_id: ID or number of the Google Cloud project you want to use.
        output_gcs_uri: Specifies a Cloud Storage URI for the synthesis results.
            Must be specified in the format:
            ``gs://bucket_name/object_name``, and the bucket must
            already exist.
    """

    client = texttospeech.TextToSpeechLongAudioSynthesizeClient()

    input = texttospeech.SynthesisInput(
        text="Test input. Replace this with any text you want to synthesize, up to 1 million bytes long!"
    )

    audio_config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.LINEAR16
    )

    voice = texttospeech.VoiceSelectionParams(
        language_code="en-US", name="en-US-Standard-A"
    )

    parent = f"projects/{project_id}/locations/us-central1"

    request = texttospeech.SynthesizeLongAudioRequest(
        parent=parent,
        input=input,
        audio_config=audio_config,
        voice=voice,
        output_gcs_uri=output_gcs_uri,
    )

    operation = client.synthesize_long_audio(request=request)
    # Set a deadline for your LRO to finish. 300 seconds is reasonable, but can be adjusted depending on the length of the input.
    # If the operation times out, that likely means there was an error. In that case, inspect the error, and try again.
    result = operation.result(timeout=300)
    print(
        "\nFinished processing, check your GCS bucket to find your audio file! Printing what should be an empty result: ",
        result,
    )

Clean up

To avoid unnecessary Google Cloud Platform charges, use the Google Cloud console to delete your project if you do not need it.

What's next

  • Learn more about Cloud Text-to-Speech by reading the basics.
  • Review the list of available voices you can use for synthetic speech.