Learn how to create real-time WebVTT captions for audio or video clips by using the Speech-to-Text API in a Dataflow pipeline.
This tutorial shows you how to use the Dataflow flex template provided by the Automatic WebVTT Caption From Streaming Speech-to-Text API By Using Dataflow reference implementation to address several common challenges to creating WebVTT captions in real time. These challenges are:
- Minimizing latency when creating the caption.
- Constructing start and end time offsets for each caption.
- Specifying how long to make each caption.
This tutorial is intended for developers, and assumes that you have basic knowledge of Dataflow pipelines.
Architecture
This tutorial is based on the Automatic WebVTT Caption From Streaming Speech-to-Text API By Using Dataflow reference implementation on GitHub. The reference implementation works as follows:
- A media clip is uploaded to a Cloud Storage bucket. The addition of the object to the bucket causes a message to be sent to a Pub/Sub topic.
- The Dataflow pipeline has a subscription to that Pub/Sub topic. The arrival of the message triggers it to retrieve and process the media clip.
- The pipeline processes the media clip. It first calls the Speech-to-Text API to get a text transcript of the audio from the media file, and then constructs captions from this transcript.
- The pipeline outputs the captions in WebVTT format and publishes them to a Pub/Sub topic.
This reference implementation works best with smaller media files. We recommend using it with files that are 5 MB or smaller.
The following diagram illustrates the architecture of the reference implementation:
Data processing design decisions
When you run the Dataflow pipeline, you use a publicly hosted flex template that calls Java code from the reference implementation. That code constructs the captions from the results returned by the Speech-to-Text API. The key design decisions in this code are as follows:
- The code uses the Streaming Recognition method of the Speech-to-Text API to process the clip. This method provides interim results in addition to finalized results, with the interim results being of lower quality but arriving faster. The code in the reference implementation processes the interim results, which reduces the latency between receiving the clip and producing the caption.
- Transcripts returned in the interim results must meet a
stability bar in
order to be processed. Stability for interim results can range from 0.0,
which indicates complete instability, to 1.0, which indicates complete
stability. The stability bar is set to 0.8 by default, but you can
change it by using the
--stabilityThreshold
parameter when you run the pipeline. Because the interim results aren't fully processed, transcripts can contain low-stability and overlapping phrases. To get an idea of what this looks like, see the example for StreamingRecognizeResponse.
To handle this, the code breaks the results into captions based on word count, which you specify with the
--wordCount
parameter. It then uses the Apache Beam Timer API to compare each caption to the previous one and skip the words already displayed. See @ProcessElement to review the code that implements this logic.Lower
--wordCount
values improve pipeline performance, by letting the code evaluate and compare smaller captions.--wordCount
is set to 10 by default.The start and end time offsets for each caption are constructed using the
result_end_time
field of the StreamingRecognitionResult object. Starting from a time of 00:00, the code keeps track of theresult_end_time
of each caption processed and uses that as the start time for the following caption.
Objectives
- Create a Cloud Storage bucket, a Pub/Sub topic, and a Pub/Sub notification for Cloud Storage to link them. These resources together create the mechanism to feed media files to the pipeline for processing.
- Create and run a Dataflow pipeline that consumes media clips clips, processes them into text transcripts with the Speech-to-Text API, and then creates captions in WebVTT format from those transcripts.
- Publish the WebVTT captions to a Pub/Sub topic.
Costs
This tutorial uses billable components of Google Cloud, including:
- Cloud Storage
- Dataflow
- Pub/Sub
- Speech-to-Text API
Use the pricing calculator to generate a cost estimate based on your projected usage. New Google Cloud users might be eligible for a free trial.
Before you begin
-
Sign in to your Google Account.
If you don't already have one, sign up for a new account.
-
In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.
- Enable the Cloud Storage, Dataflow, Pub/Sub, and the Speech-to-Text API APIs.
Set environmental variables
Set environmental variables for the project and region you are using for this tutorial.
- Activate Cloud Shell
In Cloud Shell, run the commands below to set environmental variables for the project and region. Replace myProject with the ID of the project you are using to complete this tutorial:
export PROJECT=myProject export REGION=us-central1
Create a Cloud Storage bucket
Create a bucket to receive media files for captioning.
In Cloud Shell, run the following command to create a bucket:
export MEDIA_CLIPS_BUCKET=media_$PROJECT
gsutil mb gs://$MEDIA_CLIPS_BUCKET
Create the Pub/Sub topics and subscriptions
Create two Pub/Sub topics and two Pub/Sub subscriptions. The first topic/subscription pair is used to notify the Dataflow pipeline that media files are available for processing. The second topic/subscription pair is used to output the caption files from the Dataflow pipeline so that they are available for reintegration with the media files.
In Cloud Shell, run the following commands to create the topics and subscriptions:
export MEDIA_TOPIC="media-clips" export MEDIA_SUBSCRIPTION="media-clips" export CAPTION_TOPIC="captions" export CAPTION_SUBSCRIPTION="captions" gcloud pubsub topics create $MEDIA_TOPIC gcloud pubsub subscriptions create $MEDIA_SUBSCRIPTION --topic=$MEDIA_TOPIC gcloud pubsub topics create $CAPTION_TOPIC gcloud pubsub subscriptions create $CAPTION_SUBSCRIPTION --topic=$CAPTION_TOPIC
Create the notification from Cloud Storage to Pub/Sub
Create a notification that sends a message to the media-clips
topic when you
upload a media file to the Cloud Storage bucket. This triggers the
Dataflow pipeline to process the media file.
Get the Cloud Storage service account email
Click Settings.
Copy the service account email address from the Cloud Storage Service Account section on the Project Access tab.
Update the Cloud Storage service account permissions
- Open the Pub/Sub topics page
- For the
media-clips
topic, click More , then click View permissions. - Click Add Member. In the pane that appears:
- For New members, paste in the Cloud Storage service account email address.
- For Select a role, choose Pub/Sub and then choose Pub/Sub Publisher.
- Click Save.
Create the notification
- Activate Cloud Shell
In Cloud Shell, run the following command to create the Pub/Sub notification:
gsutil notification create -t media-clips -f json gs://media_$PROJECT
Start the Dataflow pipeline
In Cloud Shell, run the following command to run the pipeline:
gcloud beta dataflow flex-template run "create-captions" \ --project=$PROJECT \ --region=us-central1 \ --template-file-gcs-location=gs://dataflow-stt-audio-clips/dynamic_template_stt_analytics.json \ --parameters=^~^streaming=true~enableStreamingEngine=true~numWorkers=1~maxNumWorkers=3~runner=DataflowRunner~autoscalingAlgorithm=THROUGHPUT_BASED~workerMachineType=n1-standard-4~outputTopic=projects/$PROJECT/topics/$CAPTION_TOPIC~inputNotificationSubscription=projects/$PROJECT/subscriptions/$MEDIA_SUBSCRIPTION~wordCount=10
Wait until the create-captions pipeline shows a status of Running. This may take a few minutes.
Test the pipeline
Copy an audio file to the Cloud Storage bucket to trigger the pipeline to
process it, then check the messages in the captions
subscription to see the
output captions.
- Activate Cloud Shell
In Cloud Shell, run the following command to copy the audio file:
gsutil cp gs://dataflow-stt-audio-clips/wav_mono_kellogs.wav gs://$MEDIA_CLIPS_BUCKET
Click captions in the subscriptions list.
Click View Messages.
On the Messages pane, click Pull.
Click Expand
for any message to see the contents. You should see results similar to the following:
Cleaning up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project containing the resources, or keep the project but delete just those resources.
Either way, you should remove those resources so you won't be billed for them in the future. The following sections describe how to delete these resources.
Delete the project
The easiest way to eliminate billing is to delete the project you created for the tutorial.
- In the Cloud Console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete the components
If you don't want to delete the project, use the following sections to delete the billable components of this tutorial.
Stop the Dataflow job
- Open the Dataflow Jobs page
- In the jobs list, click create-captions.
- On the job details page, click Stop.
- Select Cancel.
- Click Stop job.
Delete the Cloud Storage buckets
- Open the Cloud Storage browser
- Select the checkboxes of the media-<myProject> and dataflow-staging-us-central1-<projectNumber> buckets.
- Click Delete.
- In the overlay window that appears, type
DELETE
and then click Confirm.
Delete the Pub/Sub topics and subscriptions
- Open the Pub/Sub Subscriptions page
- Select the checkboxes of the media-clips and captions subscriptions.
- Click Delete.
- In the overlay window that appears, confirm you want to delete the subscription and its contents by clicking Delete.
- Click Topics.
- Select the checkboxes of the media-clips and captions topics.
- Click Delete.
- In the overlay window that appears, type
delete
and then click Delete.
What's next
- Review the Automatic WebVTT Caption From Streaming Speech-to-Text API By Using Dataflow reference implementation.
- Learn about other smart analytics solutions.
- Learn more about Dataflow.