This page demonstrates how to transcribe long audio files (longer than 1 minute) to text using the Speech-to-Text API and asynchronous speech recognition.
About asynchronous speech recognition
Asynchronous speech recognition starts a long running audio processing operation. Use asynchronous speech recognition to transcribe audio that is longer than 60 seconds. For shorter audio, synchronous speech recognition is faster and simpler. The upper limit for asynchronous speech recognition is 480 minutes.
Speech-to-Text and asynchronous processing
Audio content can be sent directly to Speech-to-Text from a local file for asynchronous processing. However, the audio time limit for local files is 60 seconds. Attempting to transcribe local audio files that are longer than 60 seconds will result in an error. To use asynchronous speech recognition to transcribe audio longer than 60 seconds, you must have your data saved in a Google Cloud Storage bucket.
You can retrieve the results of the operation using the google.longrunning.Operations method. Results remain available for retrieval for 5 days (120 hours). You also have the option of uploading your results directly to a Google Cloud Storage bucket.
Transcribe long audio files using a Google Cloud Storage file
These samples use a
Cloud Storage bucket
to store the raw audio input
for the long-running transcription process. For an example of a typical
longrunningrecognize
operation response, see the
reference documentation.
Protocol
Refer to the speech:longrunningrecognize
API endpoint for complete
details.
To perform synchronous speech recognition, make a POST
request and provide the
appropriate request body. The following shows an example of a POST
request using
curl
. The example uses the Google Cloud CLI to generate an access
token. For instructions on installing the gcloud CLI,
see the quickstart.
curl -X POST \ -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \ -H "Content-Type: application/json; charset=utf-8" \ --data "{ 'config': { 'language_code': 'en-US' }, 'audio':{ 'uri':'gs://cloud-samples-tests/speech/brooklyn.flac' } }" "https://speech.googleapis.com/v1/speech:longrunningrecognize"
See the RecognitionConfig and RecognitionAudio reference documentation for more information on configuring the request body.
If the request is successful, the server returns a 200 OK
HTTP status code and
the response in JSON format:
{ "name": "7612202767953098924" }
where name
is the name of the long running operation created for the request.
Wait for processing to complete. Processing time differs depending on your
source audio. In most cases, you will get results in half
the length of the source audio.
You can get the status of your long-running operation by making a GET
request to the https://speech.googleapis.com/v1/operations/
endpoint. Replace your-operation-name
with the name
returned from your longrunningrecognize
request.
curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \ -H "Content-Type: application/json; charset=utf-8" \ "https://speech.googleapis.com/v1/operations/your-operation-name"
If the request is successful, the server returns a 200 OK
HTTP status code and
the response in JSON format:
{ "name": "7612202767953098924", "metadata": { "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata", "progressPercent": 100, "startTime": "2017-07-20T16:36:55.033650Z", "lastUpdateTime": "2017-07-20T16:37:17.158630Z" }, "done": true, "response": { "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse", "results": [ { "alternatives": [ { "transcript": "how old is the Brooklyn Bridge", "confidence": 0.96096134, } ] }, { "alternatives": [ { ... } ] } ] } }
If the operation has not completed, you can poll the endpoint by repeatedly
making the GET
request until the done
property of the response is true
.
gcloud
Refer to the
recognize-long-running
command for complete details.
To perform asynchronous speech recognition, use the Google Cloud CLI, providing the path of a local file or a Google Cloud Storage URL.
gcloud ml speech recognize-long-running \ 'gs://cloud-samples-tests/speech/brooklyn.flac' \ --language-code='en-US' --async
If the request is successful, the server returns the ID of the long-running operation in JSON format.
{ "name": OPERATION_ID }
You can then get information about the operation by running the following command.
gcloud ml speech operations describe OPERATION_ID
You can also poll the operation until it completes by running the following command.
gcloud ml speech operations wait OPERATION_ID
After the operation completes, the operation returns a transcript of the audio in JSON format.
{ "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse", "results": [ { "alternatives": [ { "confidence": 0.9840146, "transcript": "how old is the Brooklyn Bridge" } ] } ] }
Go
To learn how to install and use the client library for Speech-to-Text, see Speech-to-Text client libraries. For more information, see the Speech-to-Text Go API reference documentation.
To authenticate to Speech-to-Text, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
To learn how to install and use the client library for Speech-to-Text, see Speech-to-Text client libraries. For more information, see the Speech-to-Text Java API reference documentation.
To authenticate to Speech-to-Text, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To learn how to install and use the client library for Speech-to-Text, see Speech-to-Text client libraries. For more information, see the Speech-to-Text Node.js API reference documentation.
To authenticate to Speech-to-Text, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install and use the client library for Speech-to-Text, see Speech-to-Text client libraries. For more information, see the Speech-to-Text Python API reference documentation.
To authenticate to Speech-to-Text, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Additional languages
C#: Please follow the C# setup instructions on the client libraries page and then visit the Speech-to-Text reference documentation for .NET.
PHP: Please follow the PHP setup instructions on the client libraries page and then visit the Speech-to-Text reference documentation for PHP.
Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the Speech-to-Text reference documentation for Ruby.
Upload your transcription results to a Cloud Storage bucket
Speech-to-Text supports uploading your longrunning recognition results directly to a Cloud Storage bucket. If you implement this feature with Cloud Storage Triggers, Cloud Storage uploads can trigger notifications that call Cloud Functions and remove the need to poll Speech-to-Text for recognition results.
To have your results uploaded to a Cloud Storage bucket, provide the
optional TranscriptOutputConfig
output configuration in your longrunning recognition request.
message TranscriptOutputConfig {
oneof output_type {
// Specifies a Cloud Storage URI for the recognition results. Must be
// specified in the format: `gs://bucket_name/object_name`
string gcs_uri = 1;
}
}
Protocol
Refer to the longrunningrecognize
API endpoint for complete details.
The following example shows how to send a POST
request using curl
,
where the body of the request specifies the path to a Cloud Storage
bucket. The results are uploaded to this location as a JSON
file that stores
SpeechRecognitionResult
.
curl -X POST \ -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ -H "Content-Type: application/json; charset=utf-8" \ --data "{ 'config': {...}, 'output_config': { 'gcs_uri':'gs://bucket/result-output-path.json' }, 'audio': { 'uri': 'gs://bucket/audio-path' } }" "https://speech.googleapis.com/v1p1beta1/speech:longrunningrecognize"
The LongRunningRecognizeResponse
includes the path to the Cloud Storage bucket where the upload was attempted. If
the upload was unsuccessful, an output error will be returned. If a file with
the same name already exists, the upload writes the results to a new file with a
timestamp as the suffix.
{ ... "metadata": { ... "outputConfig": {...} }, ... "response": { ... "results": [...], "outputConfig": { "gcs_uri":"gs://bucket/result-output-path" }, "outputError": {...} } }
Try it for yourself
If you're new to Google Cloud, create an account to evaluate how Speech-to-Text performs in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
Try Speech-to-Text free