Transcribing Long Audio File

This page demonstrates how to transcribe long audio files (longer than 1 minute) to text using asynchronous speech recognition.

Asynchronous speech recognition starts a long running audio processing operation. Use asynchronous speech recognition to recognize audio that is longer than a minute and stored in Google Cloud Storage. For shorter audio, including audio stored locally (inline), Synchronous Speech Recognition is faster and simpler.

You can retrieve the results of the operation via the google.longrunning.Operations interface. Audio content can be sent directly to Cloud Speech-to-Text or it can process audio content that already resides in Google Cloud Storage. See also the audio limits for asynchronous speech recognition requests.

The Speech-to-Text v1 is officially released and is generally available from the https://speech.googleapis.com/v1/speech endpoint. The Client Libraries are released as Alpha and will likely be changed in backward-incompatible ways. The client libraries are currently not recommended for production use.

These samples require that you have set up gcloud and have created and activated a service account. For information about setting up gcloud, and also creating and activating a service account, see Quickstart.

Protocol

Refer to the speech:longrunningrecognize API endpoint for complete details.

To perform synchronous speech recognition, make a POST request and provide the appropriate request body. The following shows an example of a POST request using curl. The example uses the access token for a service account set up for the project using the Google Cloud Platform Cloud SDK. For instructions on installing the Cloud SDK, setting up a project with a service account, and obtaining an access token, see the Quickstart.

curl -X POST \
     -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
     -H "Content-Type: application/json; charset=utf-8" \
     --data "{
  'config': {
    'language_code': 'en-US'
  },
  'audio':{
    'uri':'gs://gcs-test-data/vr.flac'
  }
}" "https://speech.googleapis.com/v1/speech:longrunningrecognize"

See the RecognitionConfig and RecognitionAudio reference documentation for more information on configuring the request body.

If the request is successful, the server returns a 200 OK HTTP status code and the response in JSON format:

{
  "name": "7612202767953098924"
}

where name is the name of the long running operation created for the request.

Wait approximately 30 seconds for processing to complete. To retrieve the result of the operation, make a GET request to the https://speech.googleapis.com/v1/operations/ endpoint. Replace your-operation-name with the name received from your longrunningrecognize request.

curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
     -H "Content-Type: application/json; charset=utf-8" \
     "https://speech.googleapis.com/v1/operations/your-operation-name"

If the request is successful, the server returns a 200 OK HTTP status code and the response in JSON format:

{
  "name": "7612202767953098924",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata",
    "progressPercent": 100,
    "startTime": "2017-07-20T16:36:55.033650Z",
    "lastUpdateTime": "2017-07-20T16:37:17.158630Z"
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
    "results": [
      {
        "alternatives": [
          {
            "transcript": "okay so what am I doing here...(etc)...",
            "confidence": 0.96096134,
          }
        ]
      },
      {
        "alternatives": [
          {
            ...
          }
        ]
      }
    ]
  }
}

If the operation has not completed, you can poll the endpoint by repeatedly making the GET request until the done property of the response is true.

GCLOUD COMMAND

Refer to the recognize-long-running command for complete details.

To perform asynchronous speech recognition, use the gcloud command line tool, providing the path of a local file or a Google Cloud Storage URL.

gcloud ml speech recognize-long-running \
    'gs://cloud-samples-tests/speech/brooklyn.flac' \
     --language-code='en-US' --async

If the request is successful, the server returns the ID of the long-running operation in JSON format.

{
  "name": OPERATION_ID
}

You can then get information about the operation by running the following command.

gcloud ml speech operations describe OPERATION_ID

You can also poll the operation until it completes by running the following command.

gcloud ml speech operations wait OPERATION_ID

Once the operation completes, the operation returns the audio in JSON-format.

{
  "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.9840146,
          "transcript": "how old is the Brooklyn Bridge"
        }
      ]
    }
  ]
}

C#

For more on installing and creating a Speech-to-Text client, refer to Speech-to-Text Client Libraries.

static object AsyncRecognizeGcs(string storageUri)
{
    var speech = SpeechClient.Create();
    var longOperation = speech.LongRunningRecognize(new RecognitionConfig()
    {
        Encoding = RecognitionConfig.Types.AudioEncoding.Linear16,
        SampleRateHertz = 16000,
        LanguageCode = "en",
    }, RecognitionAudio.FromStorageUri(storageUri));
    longOperation = longOperation.PollUntilCompleted();
    var response = longOperation.Result;
    foreach (var result in response.Results)
    {
        foreach (var alternative in result.Alternatives)
        {
            Console.WriteLine($"Transcript: { alternative.Transcript}");
        }
    }
    return 0;
}

Go

For more on installing and creating a Speech-to-Text client, refer to Speech-to-Text Client Libraries.

func sendGCS(w io.Writer, client *speech.Client, gcsURI string) error {
	ctx := context.Background()

	// Send the contents of the audio file with the encoding and
	// and sample rate information to be transcripted.
	req := &speechpb.LongRunningRecognizeRequest{
		Config: &speechpb.RecognitionConfig{
			Encoding:        speechpb.RecognitionConfig_LINEAR16,
			SampleRateHertz: 16000,
			LanguageCode:    "en-US",
		},
		Audio: &speechpb.RecognitionAudio{
			AudioSource: &speechpb.RecognitionAudio_Uri{Uri: gcsURI},
		},
	}

	op, err := client.LongRunningRecognize(ctx, req)
	if err != nil {
		return err
	}
	resp, err := op.Wait(ctx)
	if err != nil {
		return err
	}

	// Print the results.
	for _, result := range resp.Results {
		for _, alt := range result.Alternatives {
			fmt.Fprintf(w, "\"%v\" (confidence=%3f)\n", alt.Transcript, alt.Confidence)
		}
	}
	return nil
}

Java

For more on installing and creating a Speech-to-Text client, refer to Speech-to-Text Client Libraries.

/**
 * Performs non-blocking speech recognition on remote FLAC file and prints the transcription.
 *
 * @param gcsUri the path to the remote LINEAR16 audio file to transcribe.
 */
public static void asyncRecognizeGcs(String gcsUri) throws Exception {
  // Instantiates a client with GOOGLE_APPLICATION_CREDENTIALS
  try (SpeechClient speech = SpeechClient.create()) {

    // Configure remote file request for Linear16
    RecognitionConfig config =
        RecognitionConfig.newBuilder()
            .setEncoding(AudioEncoding.FLAC)
            .setLanguageCode("en-US")
            .setSampleRateHertz(16000)
            .build();
    RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();

    // Use non-blocking call for getting file transcription
    OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
        speech.longRunningRecognizeAsync(config, audio);
    while (!response.isDone()) {
      System.out.println("Waiting for response...");
      Thread.sleep(10000);
    }

    List<SpeechRecognitionResult> results = response.get().getResultsList();

    for (SpeechRecognitionResult result : results) {
      // There can be several alternative transcripts for a given chunk of speech. Just use the
      // first (most likely) one here.
      SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
      System.out.printf("Transcription: %s\n", alternative.getTranscript());
    }
  }
}

Node.js

For more on installing and creating a Speech-to-Text client, refer to Speech-to-Text Client Libraries.

// Imports the Google Cloud client library
const speech = require('@google-cloud/speech');

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const gcsUri = 'gs://my-bucket/audio.raw';
// const encoding = 'Eencoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,
};

const audio = {
  uri: gcsUri,
};

const request = {
  config: config,
  audio: audio,
};

// Detects speech in the audio file. This creates a recognition job that you
// can wait for now, or get its result later.
client
  .longRunningRecognize(request)
  .then(data => {
    const operation = data[0];
    // Get a Promise representation of the final result of the job
    return operation.promise();
  })
  .then(data => {
    const response = data[0];
    const transcription = response.results
      .map(result => result.alternatives[0].transcript)
      .join('\n');
    console.log(`Transcription: ${transcription}`);
  })
  .catch(err => {
    console.error('ERROR:', err);
  });

PHP

For more on installing and creating a Speech-to-Text client, refer to Speech-to-Text Client Libraries.

use Google\Cloud\Speech\SpeechClient;
use Google\Cloud\Storage\StorageClient;
use Google\Cloud\Core\ExponentialBackoff;

/**
 * Transcribe an audio file using Google Cloud Speech API
 * Example:
 * ```
 * transcribe_async_gcs('your-bucket-name', 'audiofile.wav');
 * ```.
 *
 * @param string $bucketName The Cloud Storage bucket name.
 * @param string $objectName The Cloud Storage object name.
 * @param string $languageCode The Cloud Storage
 *     be recognized. Accepts BCP-47 (e.g., `"en-US"`, `"es-ES"`).
 * @param array $options configuration options.
 *
 * @return string the text transcription
 */
function transcribe_async_gcs($bucketName, $objectName, $languageCode = 'en-US', $options = [])
{
    // Create the speech client
    $speech = new SpeechClient([
        'languageCode' => $languageCode,
    ]);

    // Fetch the storage object
    $storage = new StorageClient();
    $object = $storage->bucket($bucketName)->object($objectName);

    // Create the asyncronous recognize operation
    $operation = $speech->beginRecognizeOperation(
        $object,
        $options
    );

    // Wait for the operation to complete
    $backoff = new ExponentialBackoff(10);
    $backoff->execute(function () use ($operation) {
        print('Waiting for operation to complete' . PHP_EOL);
        $operation->reload();
        if (!$operation->isComplete()) {
            throw new Exception('Job has not yet completed', 500);
        }
    });

    // Print the results
    if ($operation->isComplete()) {
        $results = $operation->results();
        foreach ($results as $result) {
            $alternative = $result->alternatives()[0];
            printf('Transcript: %s' . PHP_EOL, $alternative['transcript']);
            printf('Confidence: %s' . PHP_EOL, $alternative['confidence']);
        }
    }
}

Python

For more on installing and creating a Speech-to-Text client, refer to Speech-to-Text Client Libraries.

def transcribe_gcs(gcs_uri):
    """Asynchronously transcribes the audio file specified by the gcs_uri."""
    from google.cloud import speech
    from google.cloud.speech import enums
    from google.cloud.speech import types
    client = speech.SpeechClient()

    audio = types.RecognitionAudio(uri=gcs_uri)
    config = types.RecognitionConfig(
        encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
        sample_rate_hertz=16000,
        language_code='en-US')

    operation = client.long_running_recognize(config, audio)

    print('Waiting for operation to complete...')
    response = operation.result(timeout=90)

    # Each result is for a consecutive portion of the audio. Iterate through
    # them to get the transcripts for the entire audio file.
    for result in response.results:
        # The first alternative is the most likely one for this portion.
        print(u'Transcript: {}'.format(result.alternatives[0].transcript))
        print('Confidence: {}'.format(result.alternatives[0].confidence))

Ruby

For more on installing and creating a Speech-to-Text client, refer to Speech-to-Text Client Libraries.

# storage_path = "Path to file in Cloud Storage, eg. gs://bucket/audio.raw"

require "google/cloud/speech"

speech = Google::Cloud::Speech.new

config     = { encoding:          :LINEAR16,
               sample_rate_hertz: 16000,
               language_code:     "en-US"   }
audio  = { uri: storage_path }

operation = speech.long_running_recognize config, audio

puts "Operation started"

operation.wait_until_done!

raise operation.results.message if operation.error?

results = operation.response.results

alternatives = results.first.alternatives
alternatives.each do |alternative|
  puts "Transcription: #{alternative.transcript}"
end

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Speech API Documentation