Captioning media clips in real time by using Dataflow, Pub/Sub, and the Speech-to-Text API

Learn how to create real-time WebVTT captions for audio or video clips by using the Speech-to-Text API in a Dataflow pipeline.

This tutorial shows you how to use the Dataflow flex template provided by the Automatic WebVTT Caption From Streaming Speech-to-Text API By Using Dataflow reference implementation to address several common challenges to creating WebVTT captions in real time. These challenges are:

  • Minimizing latency when creating the caption.
  • Constructing start and end time offsets for each caption.
  • Specifying how long to make each caption.

This tutorial is intended for developers, and assumes that you have basic knowledge of Dataflow pipelines.

Architecture

This tutorial is based on the Automatic WebVTT Caption From Streaming Speech-to-Text API By Using Dataflow reference implementation on GitHub. The reference implementation works as follows:

  1. A media clip is uploaded to a Cloud Storage bucket. The addition of the object to the bucket causes a message to be sent to a Pub/Sub topic.
  2. The Dataflow pipeline has a subscription to that Pub/Sub topic. The arrival of the message triggers it to retrieve and process the media clip.
  3. The pipeline processes the media clip. It first calls the Speech-to-Text API to get a text transcript of the audio from the media file, and then constructs captions from this transcript.
  4. The pipeline outputs the captions in WebVTT format and publishes them to a Pub/Sub topic.

This reference implementation works best with smaller media files. We recommend using it with files that are 5 MB or smaller.

The following diagram illustrates the architecture of the reference implementation:

Diagram showing the architecture of the media transcription solution.

Data processing design decisions

When you run the Dataflow pipeline, you use a publicly hosted flex template that calls Java code from the reference implementation. That code constructs the captions from the results returned by the Speech-to-Text API. The key design decisions in this code are as follows:

  • The code uses the Streaming Recognition method of the Speech-to-Text API to process the clip. This method provides interim results in addition to finalized results, with the interim results being of lower quality but arriving faster. The code in the reference implementation processes the interim results, which reduces the latency between receiving the clip and producing the caption.
  • Transcripts returned in the interim results must meet a stability bar in order to be processed. Stability for interim results can range from 0.0, which indicates complete instability, to 1.0, which indicates complete stability. The stability bar is set to 0.8 by default, but you can change it by using the --stabilityThreshold parameter when you run the pipeline.
  • Because the interim results aren't fully processed, transcripts can contain low-stability and overlapping phrases. To get an idea of what this looks like, see the example for StreamingRecognizeResponse.

    To handle this, the code breaks the results into captions based on word count, which you specify with the --wordCount parameter. It then uses the Apache Beam Timer API to compare each caption to the previous one and skip the words already displayed. See @ProcessElement to review the code that implements this logic.

    Lower --wordCount values improve pipeline performance, by letting the code evaluate and compare smaller captions. --wordCount is set to 10 by default.

  • The start and end time offsets for each caption are constructed using the result_end_time field of the StreamingRecognitionResult object. Starting from a time of 00:00, the code keeps track of the result_end_time of each caption processed and uses that as the start time for the following caption.

Objectives

  • Create a Cloud Storage bucket, a Pub/Sub topic, and a Pub/Sub notification for Cloud Storage to link them. These resources together create the mechanism to feed media files to the pipeline for processing.
  • Create and run a Dataflow pipeline that consumes media clips clips, processes them into text transcripts with the Speech-to-Text API, and then creates captions in WebVTT format from those transcripts.
  • Publish the WebVTT captions to a Pub/Sub topic.

Costs

This tutorial uses billable components of Google Cloud, including:

  • Cloud Storage
  • Dataflow
  • Pub/Sub
  • Speech-to-Text API

Use the pricing calculator to generate a cost estimate based on your projected usage. New Google Cloud users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. Enable the Cloud Storage, Dataflow, Pub/Sub, and the Speech-to-Text API APIs.

    Enable the APIs

Set environmental variables

Set environmental variables for the project and region you are using for this tutorial.

  1. Activate Cloud Shell
  2. In Cloud Shell, run the commands below to set environmental variables for the project and region. Replace myProject with the ID of the project you are using to complete this tutorial:

    export PROJECT=myProject
    export REGION=us-central1
    

Create a Cloud Storage bucket

Create a bucket to receive media files for captioning.

In Cloud Shell, run the following command to create a bucket:

export MEDIA_CLIPS_BUCKET=media_$PROJECT
gsutil mb gs://$MEDIA_CLIPS_BUCKET

Create the Pub/Sub topics and subscriptions

Create two Pub/Sub topics and two Pub/Sub subscriptions. The first topic/subscription pair is used to notify the Dataflow pipeline that media files are available for processing. The second topic/subscription pair is used to output the caption files from the Dataflow pipeline so that they are available for reintegration with the media files.

  1. In Cloud Shell, run the following commands to create the topics and subscriptions:

    export MEDIA_TOPIC="media-clips"
    export MEDIA_SUBSCRIPTION="media-clips"
    export CAPTION_TOPIC="captions"
    export CAPTION_SUBSCRIPTION="captions"
    gcloud pubsub topics create $MEDIA_TOPIC
    gcloud pubsub subscriptions create $MEDIA_SUBSCRIPTION --topic=$MEDIA_TOPIC
    gcloud pubsub topics create $CAPTION_TOPIC
    gcloud pubsub subscriptions create $CAPTION_SUBSCRIPTION --topic=$CAPTION_TOPIC
    

Create the notification from Cloud Storage to Pub/Sub

Create a notification that sends a message to the media-clips topic when you upload a media file to the Cloud Storage bucket. This triggers the Dataflow pipeline to process the media file.

Get the Cloud Storage service account email

  1. Open the Cloud Storage browser

  2. Click Settings.

  3. Copy the service account email address from the Cloud Storage Service Account section on the Project Access tab.

Update the Cloud Storage service account permissions

  1. Open the Pub/Sub topics page
  2. For the media-clips topic, click More, then click View permissions.
  3. Click Add Member. In the pane that appears:
    1. For New members, paste in the Cloud Storage service account email address.
    2. For Select a role, choose Pub/Sub and then choose Pub/Sub Publisher.
    3. Click Save.

Create the notification

  1. Activate Cloud Shell
  2. In Cloud Shell, run the following command to create the Pub/Sub notification:

    gsutil notification create -t media-clips -f json gs://media_$PROJECT
    

Start the Dataflow pipeline

  1. In Cloud Shell, run the following command to run the pipeline:

    gcloud beta dataflow flex-template run "create-captions" \
    --project=$PROJECT \
    --region=us-central1 \
    --template-file-gcs-location=gs://dataflow-stt-audio-clips/dynamic_template_stt_analytics.json \
    --parameters=^~^streaming=true~enableStreamingEngine=true~numWorkers=1~maxNumWorkers=3~runner=DataflowRunner~autoscalingAlgorithm=THROUGHPUT_BASED~workerMachineType=n1-standard-4~outputTopic=projects/$PROJECT/topics/$CAPTION_TOPIC~inputNotificationSubscription=projects/$PROJECT/subscriptions/$MEDIA_SUBSCRIPTION~wordCount=10
    
  2. Open the Dataflow Jobs page

  3. Wait until the create-captions pipeline shows a status of Running. This may take a few minutes.

Test the pipeline

Copy an audio file to the Cloud Storage bucket to trigger the pipeline to process it, then check the messages in the captions subscription to see the output captions.

  1. Activate Cloud Shell
  2. In Cloud Shell, run the following command to copy the audio file:

    gsutil cp gs://dataflow-stt-audio-clips/wav_mono_kellogs.wav gs://$MEDIA_CLIPS_BUCKET
    
  3. Open the Pub/Sub subscriptions page

  4. Click captions in the subscriptions list.

  5. Click View Messages.

  6. On the Messages pane, click Pull.

  7. Click Expand for any message to see the contents. You should see results similar to the following:

    Messages that contain transcriptions of an audio file

Cleaning up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project containing the resources, or keep the project but delete just those resources.

Either way, you should remove those resources so you won't be billed for them in the future. The following sections describe how to delete these resources.

Delete the project

The easiest way to eliminate billing is to delete the project you created for the tutorial.

  1. In the Cloud Console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Delete the components

If you don't want to delete the project, use the following sections to delete the billable components of this tutorial.

Stop the Dataflow job

  1. Open the Dataflow Jobs page
  2. In the jobs list, click create-captions.
  3. On the job details page, click Stop.
  4. Select Cancel.
  5. Click Stop job.

Delete the Cloud Storage buckets

  1. Open the Cloud Storage browser
  2. Select the checkboxes of the media-<myProject> and dataflow-staging-us-central1-<projectNumber> buckets.
  3. Click Delete.
  4. In the overlay window that appears, type DELETE and then click Confirm.

Delete the Pub/Sub topics and subscriptions

  1. Open the Pub/Sub Subscriptions page
  2. Select the checkboxes of the media-clips and captions subscriptions.
  3. Click Delete.
  4. In the overlay window that appears, confirm you want to delete the subscription and its contents by clicking Delete.
  5. Click Topics.
  6. Select the checkboxes of the media-clips and captions topics.
  7. Click Delete.
  8. In the overlay window that appears, type delete and then click Delete.

What's next