Transcribing long audio files

This page demonstrates how to transcribe long audio files (longer than 1 minute) to text using asynchronous speech recognition.

Asynchronous speech recognition starts a long running audio processing operation. Use asynchronous speech recognition to recognize audio that is longer than a minute. For shorter audio, Synchronous Speech Recognition is faster and simpler.

You can retrieve the results of the operation via the google.longrunning.Operations interface. Audio content can be sent directly to Cloud Speech-to-Text or it can process audio content that already resides in Google Cloud Storage. See also the audio limits for asynchronous speech recognition requests.

The Speech-to-Text v1 is officially released and is generally available from the endpoint. The Client Libraries are released as Alpha and will likely be changed in backward-incompatible ways. The client libraries are currently not recommended for production use.

These samples require that you have set up gcloud and have created and activated a service account. For information about setting up gcloud, and also creating and activating a service account, see Quickstart.


Refer to the speech:longrunningrecognize API endpoint for complete details.

To perform synchronous speech recognition, make a POST request and provide the appropriate request body. The following shows an example of a POST request using curl. The example uses the access token for a service account set up for the project using the Google Cloud Platform Cloud SDK. For instructions on installing the Cloud SDK, setting up a project with a service account, and obtaining an access token, see the Quickstart.

curl -X POST \
     -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
     -H "Content-Type: application/json; charset=utf-8" \
     --data "{
  'config': {
    'language_code': 'en-US'
}" ""

See the RecognitionConfig and RecognitionAudio reference documentation for more information on configuring the request body.

If the request is successful, the server returns a 200 OK HTTP status code and the response in JSON format:

  "name": "7612202767953098924"

where name is the name of the long running operation created for the request.

Wait approximately 30 seconds for processing to complete. To retrieve the result of the operation, make a GET request to the endpoint. Replace your-operation-name with the name received from your longrunningrecognize request.

curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
     -H "Content-Type: application/json; charset=utf-8" \

If the request is successful, the server returns a 200 OK HTTP status code and the response in JSON format:

  "name": "7612202767953098924",
  "metadata": {
    "@type": "",
    "progressPercent": 100,
    "startTime": "2017-07-20T16:36:55.033650Z",
    "lastUpdateTime": "2017-07-20T16:37:17.158630Z"
  "done": true,
  "response": {
    "@type": "",
    "results": [
        "alternatives": [
            "transcript": "okay so what am I doing here...(etc)...",
            "confidence": 0.96096134,
        "alternatives": [

If the operation has not completed, you can poll the endpoint by repeatedly making the GET request until the done property of the response is true.


Refer to the recognize-long-running command for complete details.

To perform asynchronous speech recognition, use the gcloud command line tool, providing the path of a local file or a Google Cloud Storage URL.

gcloud ml speech recognize-long-running \
    'gs://cloud-samples-tests/speech/brooklyn.flac' \
     --language-code='en-US' --async

If the request is successful, the server returns the ID of the long-running operation in JSON format.

  "name": OPERATION_ID

You can then get information about the operation by running the following command.

gcloud ml speech operations describe OPERATION_ID

You can also poll the operation until it completes by running the following command.

gcloud ml speech operations wait OPERATION_ID

Once the operation completes, the operation returns the audio in JSON-format.

  "@type": "",
  "results": [
      "alternatives": [
          "confidence": 0.9840146,
          "transcript": "how old is the Brooklyn Bridge"


static object AsyncRecognizeGcs(string storageUri)
    var speech = SpeechClient.Create();
    var longOperation = speech.LongRunningRecognize(new RecognitionConfig()
        Encoding = RecognitionConfig.Types.AudioEncoding.Linear16,
        SampleRateHertz = 16000,
        LanguageCode = "en",
    }, RecognitionAudio.FromStorageUri(storageUri));
    longOperation = longOperation.PollUntilCompleted();
    var response = longOperation.Result;
    foreach (var result in response.Results)
        foreach (var alternative in result.Alternatives)
            Console.WriteLine($"Transcript: { alternative.Transcript}");
    return 0;


func sendGCS(w io.Writer, client *speech.Client, gcsURI string) error {
	ctx := context.Background()

	// Send the contents of the audio file with the encoding and
	// and sample rate information to be transcripted.
	req := &speechpb.LongRunningRecognizeRequest{
		Config: &speechpb.RecognitionConfig{
			Encoding:        speechpb.RecognitionConfig_LINEAR16,
			SampleRateHertz: 16000,
			LanguageCode:    "en-US",
		Audio: &speechpb.RecognitionAudio{
			AudioSource: &speechpb.RecognitionAudio_Uri{Uri: gcsURI},

	op, err := client.LongRunningRecognize(ctx, req)
	if err != nil {
		return err
	resp, err := op.Wait(ctx)
	if err != nil {
		return err

	// Print the results.
	for _, result := range resp.Results {
		for _, alt := range result.Alternatives {
			fmt.Fprintf(w, "\"%v\" (confidence=%3f)\n", alt.Transcript, alt.Confidence)
	return nil


 * Performs non-blocking speech recognition on remote FLAC file and prints the transcription.
 * @param gcsUri the path to the remote LINEAR16 audio file to transcribe.
public static void asyncRecognizeGcs(String gcsUri) throws Exception {
  // Instantiates a client with GOOGLE_APPLICATION_CREDENTIALS
  try (SpeechClient speech = SpeechClient.create()) {

    // Configure remote file request for Linear16
    RecognitionConfig config =
    RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();

    // Use non-blocking call for getting file transcription
    OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
        speech.longRunningRecognizeAsync(config, audio);
    while (!response.isDone()) {
      System.out.println("Waiting for response...");

    List<SpeechRecognitionResult> results = response.get().getResultsList();

    for (SpeechRecognitionResult result : results) {
      // There can be several alternative transcripts for a given chunk of speech. Just use the
      // first (most likely) one here.
      SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
      System.out.printf("Transcription: %s\n", alternative.getTranscript());


// Imports the Google Cloud client library
const speech = require('@google-cloud/speech');

// Creates a client
const client = new speech.SpeechClient();

 * TODO(developer): Uncomment the following lines before running the sample.
// const gcsUri = 'gs://my-bucket/audio.raw';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,

const audio = {
  uri: gcsUri,

const request = {
  config: config,
  audio: audio,

// Detects speech in the audio file. This creates a recognition job that you
// can wait for now, or get its result later.
const [operation] = await client.longRunningRecognize(request);
// Get a Promise representation of the final result of the job
const [response] = await operation.promise();
const transcription = response.results
  .map(result => result.alternatives[0].transcript)
console.log(`Transcription: ${transcription}`);


use Google\Cloud\Speech\V1\SpeechClient;
use Google\Cloud\Speech\V1\RecognitionAudio;
use Google\Cloud\Speech\V1\RecognitionConfig;
use Google\Cloud\Speech\V1\RecognitionConfig\AudioEncoding;

 * Transcribe an audio file using Google Cloud Speech API
 * Example:
 * ```
 * transcribe_async_gcs('your-bucket-name', 'audiofile.wav');
 * ```.
 * @param string $bucketName The Cloud Storage bucket name.
 * @param string $objectName The Cloud Storage object name.
 * @param string $languageCode The Cloud Storage
 *     be recognized. Accepts BCP-47 (e.g., `"en-US"`, `"es-ES"`).
 * @param array $options configuration options.
 * @return string the text transcription
function transcribe_async_gcs($audioFile)
    // change these variables
    $encoding = AudioEncoding::LINEAR16;
    $sampleRateHertz = 32000;
    $languageCode = 'en-US';

    // set string as audio content
    $audio = (new RecognitionAudio())

    // set config
    $config = (new RecognitionConfig())

    // create the speech client
    $client = new SpeechClient();

    // create the asyncronous recognize operation
    $operation = $client->longRunningRecognize($config, $audio);

    if ($operation->operationSucceeded()) {
        $response = $operation->getResult();

        // each result is for a consecutive portion of the audio. iterate
        // through them to get the transcripts for the entire audio file.
        foreach ($response->getResults() as $result) {
            $alternatives = $result->getAlternatives();
            $mostLikely = $alternatives[0];
            $transcript = $mostLikely->getTranscript();
            $confidence = $mostLikely->getConfidence();
            printf('Transcript: %s' . PHP_EOL, $transcript);
            printf('Confidence: %s' . PHP_EOL, $confidence);
    } else {



def transcribe_gcs(gcs_uri):
    """Asynchronously transcribes the audio file specified by the gcs_uri."""
    from import speech
    from import enums
    from import types
    client = speech.SpeechClient()

    audio = types.RecognitionAudio(uri=gcs_uri)
    config = types.RecognitionConfig(

    operation = client.long_running_recognize(config, audio)

    print('Waiting for operation to complete...')
    response = operation.result(timeout=90)

    # Each result is for a consecutive portion of the audio. Iterate through
    # them to get the transcripts for the entire audio file.
    for result in response.results:
        # The first alternative is the most likely one for this portion.
        print(u'Transcript: {}'.format(result.alternatives[0].transcript))
        print('Confidence: {}'.format(result.alternatives[0].confidence))


# storage_path = "Path to file in Cloud Storage, eg. gs://bucket/audio.raw"

require "google/cloud/speech"

speech =

config     = { encoding:          :LINEAR16,
               sample_rate_hertz: 16000,
               language_code:     "en-US"   }
audio  = { uri: storage_path }

operation = speech.long_running_recognize config, audio

puts "Operation started"


raise operation.results.message if operation.error?

results = operation.response.results

alternatives = results.first.alternatives
alternatives.each do |alternative|
  puts "Transcription: #{alternative.transcript}"

Hai trovato utile questa pagina? Facci sapere cosa ne pensi:

Invia feedback per...

Cloud Speech-to-Text Documentation
Hai bisogno di assistenza? Visita la nostra pagina di assistenza.