Jump to Content

Using Google Cloud Speech-to-Text to transcribe your Twilio calls in real-time

August 28, 2019
Mark Shalda

Technical Program Manager & ML Partner Engineering Lead

Developers have asked us how they can use Google Cloud’s Speech-to-Text to transcribe speech (especially phone audio) coming from Twilio, a leading cloud communications PaaS. We’re pleased to announce that it’s now easier than ever to integrate live call data with Google Cloud’s Speech-to-Text using Twilio’s Media Streams.

The new TwiML <stream> command streams call audio to a websocket server. This makes it simple to move your call audio from your business phone system into an AI platform that can transcribe that data in real time and use it for use cases like helping contact center agents and admins, as well as store it for later analysis. 

When you combine this new functionality with Google Cloud’s Speech-to-Text abilities and other infrastructure and analytics tools like BigQuery, you can create an extremely scalable, reliable and accurate way of getting more value from your audio.


The overall architecture for creating this flow looks something like what you see below. Twilio creates and manages the inbound phone number. Their new Stream command takes the audio from an incoming phone call and sends it to a configured websocket which runs on a simple App Engine flexible environment. From there, sending the audio along as it comes to Cloud Speech-to-Text is not very challenging. Once a transcript is created, it’s stored in BigQuery where real-time analysis can be performed.


Configuring your phone number

Once you’ve bought a number in Twilio, you’ll need to configure your phone number to respond with TwiML, which stands for Twilio Markup Language. It’s a tag-based language much like HTML, which will pass off control via a webhook that expects TwiML that you provide.

Next, navigate to your list phone numbers and choose your new number. On the number settings screen, scroll down to the Voice section. There is a field labelled “A Call Comes In”. Here, choose TwiML Bin from the drop down and press the plus button next to the field to create a new TwiML Bin.

Creating a TwiML Bin

TwiML Bins are a serverless solution that can seamlessly host TwiML instructions. Using a TwiML Bin prevents you from needing to set up a webhook handler in your own web-hosted environment.

Give your TwiML Bin a Friendly Name that you can remember later. In the Body field, enter the following code, replacing the url attribute of the <Stream> tag and the phone number contained in the body of the <Dial> tag.


The <Stream> tag starts the audio stream asynchronously and then control moves onto the <Dial> verb. <Dial> will call that number. The audio stream will end when the call is completed.

Save your TwiML Bin and make sure that you see your Friendly Name in the “A Call Comes In“ drop down next to TwiML Bin. Make sure to Save your phone number.

Setup in Google Cloud

This setup can either be done in an existing Google Cloud project or a new project. To set up a new project, follow the instructions here. Once you have the project selected that you want to work in, you’ll need to set up a few key things before getting started:

  • Enable APIs for Google Speech-to-Text. You can do that by following the instructions here and searching for “Cloud Speech-to-Text API”.

  • Create a service account for your App Engine flexible environment to utilize when accessing other Google Cloud services. You’ll need to download the private key as a JSON file as well.

  • Add firewall rules to allow your App Engine flexible environment to accept incoming connections for the websocket. A command like the following should work from a gcloud enabled terminal:

    • gcloud compute firewall-rules create default-allow-websockets-8080 --allow tcp:8080 --target-tags websocket --description "Allow websocket traffic on port 8080"

App Engine flexible environment setup

For the App Engine application, we will be taking the sample code from Twilio’s repository to create a simple node.js websocket server. You can find the github page here with instructions on environment setup. Once the code is in your project folder, you’ll need to do a few more things to deploy your application:

  • Place the service account JSON key you downloaded earlier, rename it to “google_creds.json”, and put it in the same directory as the node.js code.

  • Create an app.yaml file that looks like the following:

    • runtime: nodejs

    • env: flex

    • manual_scaling:

    •   instances: 1

    • network:

    •   instance_tag: websocket


Once these two items are in order, you will be able to deploy your application with the command:

gcloud app deploy

Once deployed, you can tail the console logs with the command:

gcloud app logs tail -s default

Verifying your stream is working

Call your Twilio number, and you should immediately be connected with the number specified in your TwiML. You should see a websocket connection request made to the url specified in the <Stream>. Your websocket should immediately start receiving messages. If you are tailing the logs in the console, the application will log the intermediate messages as well as any final utterances detected by Google Cloud’s Speech-to-Text API.

Writing transcriptions to BigQuery

In order to analyze the transcripts later, we can create a BigQuery table and modify the sample code from Twilio to write to that table. Instructions for creating a new BigQuery table can be found here. Given the way Google Speech-to-Text creates transcription results, a potential schema for the table might look like the following.


Once a table like this exists, you can modify the Twilio sample code to also stream data to the BigQuery table using sample code found here.


Twilio’s new Stream function allows users to quickly make use of the real time audio that is moving through their phone systems. Paired with Google Cloud, that data can be transcribed in real time and passed on to numerous other applications. This ability to get high quality transcription in real time can benefit businesses—from helping contact center agents document and understand phone calls, to analyzing data from the transcripts of those calls. 

To learn more about Cloud Speech-to-Text, visit our website.

Posted in