Google Cloud Big Data and Machine Learning Blog

Innovation in data processing and machine learning technology

ML Explorer: talking and listening with Google Cloud using Cloud Speech and Text-to-Speech

Wednesday, June 20, 2018

By Barrett Williams, Cloud Big Data and Machine Learning Editor

It’s far easier to explain the how than the what of Google Cloud (or any cloud provider, really). In honor of “show, don’t tell,” I’d like to share with you what makes Google’s Cloud special. I’ll show you around some tools that my colleagues built to make our machine learning-focused offerings both approachable and tangible.

It’s easy to “talk” to Google Cloud Platform in several ways. You may be familiar with Google Assistant, thanks to our consumer hardware-focused colleagues who built the venerable Google Home ecosystem, or maybe you’ve just tested it out on your Android smartphone. Smart assistants require both speech recognition and speech synthesis to function, so here you’ll get to try out both, independently.

Google Cloud Platform’s speech services are of the simplest and most exciting APIs to start with for speech recognition and synthesis, so I’ll show a couple simple examples—one on a recorded file, and another on a live stream from your microphone. I’ll also point out where our two Speech APIs, Text-to-Speech (from now on referred to as simply the Speech API) and Speech-to-Text were recently updated with newer and more advanced capabilities.

Getting started:

Let’s begin by creating a Google Cloud Project, just in case you haven’t already. (You’ll only have to do this once, and I won’t cover this again in future ML Explorer posts.) You’ll encounter a new dashboard hub you’ll become quite familiar with, the Google Cloud Console.

You’ll be prompted to log in with a Google account; feel free to use your work login if you can safely consider this a work endeavor and your organization runs on G Suite. If you’re a hobbyist, you can log in with your Gmail account or create a new account (under “More options”).

An example of the Cloud Console showing usage metrics on APIs and Compute Engine

Note that some Google Cloud Platform (GCP) services are free within certain monthly usage limits, but you’ll need to attach a credit card to your account in order to try out some of the more advanced parts of the ML Explorer series. (We’ll explore a couple beta advanced features of the Speech API that require logging to be enabled, which requires a valid credit card, even if you have remaining free trial credits. Don’t worry, you’ll still be able to use those up first.)

Projects are the fundamental configuration entities for the Compute Engine instances and APIs that you’ll provision, configure, access, and call when you deploy on GCP. You can set up per-project billing accounts, so if you want to set up different payment methods for work versus personal projects, you should consider doing that now.

You’ll also need the Cloud SDK, which is available here for Mac, Windows, and Linux. The SDK allows you to install Python dependencies for the open-source code samples that show you how to use our APIs. It’s also a great way to automate configuration changes you might otherwise make by clicking through the menus in the Cloud Console. If you’re a visual thinker, and like Material Design dashboards, then Console is right for you (it even has its own built-in, web-based terminal called Cloud Shell!), but if you like tmux, iTerm, or a super-slick tiling window manager, then gcloud, the SDK’s command-line tool, is going to become a close friend. The SDK allows you to automate and scale operations in GCP, once you’ve successfully tested them out in the UI.
After you download the binary archive for the SDK, navigate in your Terminal to the path of the downloaded archive and run:
gunzip -c google-cloud-sdk*.tar.gz | tar xopf -

You can manually add the gcloud, bq, and gsutil tools to your path, or run the following: 
./google-cloud-sdk/install.sh (If you’d prefer to install from a package manager within your distribution of choice, that is also an option.)

Be sure to restart your terminal so that your shell will find the relevant executables. Next run gcloud init to ensure that everything is configured properly.
(If you’re curious about what was installed as part of the SDK, run gcloud components list to see which commands and tools were installed by default. gcloud components update should get you in sync with all of the latest versions of these tools. I’m running SDK version 200, for example.)

I’ll be up front and mention that I’ll be testing my instructions on my MacOS developer environment and a Linux virtual machine. Excitingly, the Speech API calls on a microphone and speaker in certain scenarios, which may not be compatible with my VM. However, I’ll describe the steps necessary to test out these APIs natively on my Mac development machine, and those instructions should translate for many Debian-based flavors of Linux, if a mostly open-source stack floats your boat. For the most part, I’ll be sticking with Python, primarily for its readability. If you prefer another language, for most APIs, source code samples are available, but we call our prioritized languages “silver languages,” and you can find them here.

Let’s hack speech:

You may be familiar with the concept of Hello, World!. The equivalent for Google Cloud’s Speech API is delightfully simple, taking a reference file from Cloud Storage and returning a XML or JSON file with the transcribed text. Meanwhile, within the field of machine learning and artificial intelligence, speech recognition is a very challenging problem to solve. Although the Google Brain team does shed some light on how it’s built a state-of-the-art neural net-based recognition system, the combination of phoneme recognition, language modeling, and training on a very large (typically non-public) voice dataset is a laborious, intensive, and fickle process. Luckily GCP’s Speech API lets you write simple queries, without the need to worry about training, serving, or even the classification process. (For other types of machine learning, you will be able to train your model much more easily, as you’ll see in future installments of ML Explorer.)

For a first step, you’ll need to enable the Speech API on the project you’ve created. My project ID is speech-voice-1337 but your project ID will differ. Keep in mind, you are able to select the speech-voice prefix in the name of your project, but you’re usually not able to change the trailing hyphen and (by default, 6) digits that are appended to create a unique identifier for your project. You will need this full project ID when selecting your project from the `gcloud` command-line tool that is part of the Cloud SDK.

Select the project on which you’d like to access the Speech API from the dropdown menu in between the Google Cloud Platform sidebar header and the search box at the top of your screen:

Select your project, then enable the Speech API via the “hamburger” menu at top left>APIs & Services>Library.

Once in the API Library, search for “Speech,” then click on “Google Cloud Speech API.” If prompted to enter/attach billing info, now is the time to do so. Keep in mind, if you’re only running speech queries a few seconds at a time, you are unlikely to incur a large bill. If you’re starting out with a fresh quota on your free trial, you’re unlikely to be billed at all. However, as with any cloud-based paid resource, be sure to shut down instances you aren’t using. APIs are billed on a per-query basis, so if you don’t query the API, you shouldn’t be billed. (But keep your credentials secure, so that others cannot incur charges on your account.)

Next, download your credentials in JSON format. Under APIs & Services, select Credentials.

Be sure to download the credentials for the “Service account key.” Click on the checkbox next to the account you’ll be using, then click the blue button for Create credentials. You’ll want a Service Account key; then select the Compute Engine default service account. Make sure JSON is selected, then click Create. The credentials should download to your browser’s default download location. (For production uses, you may want to create a “user” account key with fewer privileges, but in this case, let’s assume this is a casual test project isolated from live user production data.)

Now, you’re ready to use the Speech-to-Text API. Let’s take a look at some of the more common use cases. (The Speech API can recognize 120 languages and variants, so if you’d like to modify the code samples to transcribe another language, feel free to do so.)

Run a couple Speech-to-Text examples:

  1. Transcribe a file from a public bucket on Google Cloud Storage using a JSON request via HTTP POST (curl):

First off, set an environment variable for your downloaded account credentials:

export GOOGLE_APPLICATION_CREDENTIALS="/Users/$(whoami)/Downloads/service-account-file.json"

(On Linux, replace “Users” with “home”; using Tab to auto-complete can help you set this environment variable, if you don’t remember the exact path or filename.)

Now, save the following API request to a new JSON file, as sync-request.json:

{
  "config": {
      "encoding":"FLAC",
      "sampleRateHertz": 16000,
      "languageCode": "en-US",
      "enableWordTimeOffsets": false,
      “enableAutomaticPunctuation": true  },
  "audio": {
      "uri":"gs://cloud-samples-tests/speech/brooklyn.flac"  }
}

If you’d like to play this file locally, to hear what you’re uploading to the API, try:

gsutil cp gs://cloud-samples-tests/speech/brooklyn.flac . # copies the file from the cloud to your local working directoryopen brooklyn.flac

(On Linux, replace “open” with “vlc” if you have that media player installed.)

The config block provides parameters about the file, as well as the language requested for transcription. You can set the enableWordtimeOffsets flag to true if you need to asses transcription latency, or just want to see how much time has elapsed since the system started processing the first frame of audio.

Then, back in the terminal, run:

curl -s -H "Content-Type: application/json" \
    -H "Authorization: Bearer "$(gcloud auth print-access-token) \
    https://speech.googleapis.com/v1/speech:recognize \
    -d @sync-request.json

Your output from the original API call should provide the transcript of the audio file, as well as a confidence threshold. If you’ve also set the Offsets flag, you’ll receive each individual word with a start and end time rounded to the nearest 100ms. Here was my output:

user-machine:speech-to-text user$ curl -s -H "Content-Type: application/json"
     -H "Authorization: Bearer "$(gcloud auth print-access-token) https://speech.googleapis.com/v1/speech:recognize      -d @sync-request-local.json{
     "results": [
     {"alternatives": [
        {
          "transcript": "welcome to ml Explorer this is a transcription test what do you think so far",                         "confidence": 0.96804965,
          "words": [            {
              "startTime": "0s",
              "endTime": "0.500s",
              "word": "welcome"
            },            {
              "startTime": "0.500s",
              "endTime": "0.800s",
              "word": "to" 
            },            {
              "startTime": "0.800s",
              "endTime": "1.300s", 
              "word": "ml"
            },            {
              "startTime": "1.300s",
              "endTime": "1.900s",
              "word": "Explorer"
            },   ...      { 
              "startTime": "5.600s",
              "endTime": "5.900s",
              "word": "far"
            }
          ]
        }
      ]
    }
  ]
}
  1. Next, we’ll use the gcloud command line tool to transcribe a file stored in Cloud Storage. Most API calls will result in the same response format, but you can execute gcloud and Python requests (as in the next, live transcription section) to achieve the same end goal:
    Send the API a request via `gcloud`:
gcloud auth activate-service-account --key-file=[PATH]
gcloud ml speech recognize 'gs://cloud-samples-tests/speech/brooklyn.flac' \    --language-code='en-US'

Via the Python samples available here (this snippet only deals with the instance of a local file, rather than a file hosted in Cloud Storage):

import io
import os

# Imports the Google Cloud client library
from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types

# Instantiates a client
client = speech.SpeechClient()

# The name of the audio file to transcribe
file_name = os.path.join(
    os.path.dirname(__file__),
    'resources',
    'audio.raw')

# Loads the audio into memory
with io.open(file_name, 'rb') as audio_file:
    content = audio_file.read()
    audio = types.RecognitionAudio(content=content)

config = types.RecognitionConfig(
    encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    language_code='en-US')

# Detects speech in the audio fileresponse = client.recognize(config, audio)

for result in response.results:
    print('Transcript: {}'.format(result.alternatives[0].transcript))

If you’re eager to record your own audio, FLAC and WAV are your best formats for doing so, preferably at 16000 samples per second, with only a mono channel. The API does not require you to specify format or sample rate for these formats, which is convenient, but it’s probably safer if you do in production.

  1. Transcribe long-form audio (not available for local files, you’ll need to upload to Cloud Storage first)

If, like me, you’d prefer to test out transcription on longer form audio (for a Q&A-style interview, perhaps), clone the Python samples for the Speech API on GitHub into your local development environment:

git clone https://github.com/GoogleCloudPlatform/python-docs-samples/tree/master/speech/cloud-client

Or if you’d prefer to do this just from a web browser and a temporary interactive terminal:

First, in order to install dependencies for the examples, run:

virtualenv env # to isolate your current activity to any system-wide settings you may have setsource env/bin/activate
pip install --user -r requirements.txt
def transcribe_gcs(gcs_uri):
    """Asynchronously transcribes the audio file specified by the gcs_uri."""
    from google.cloud import speech
    from google.cloud.speech import enums
    from google.cloud.speech import types 
    client = speech.SpeechClient()

    audio = types.RecognitionAudio(uri=gcs_uri)
    config = types.RecognitionConfig(
        encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
        sample_rate_hertz=16000,        language_code='en-US')

    operation = client.long_running_recognize(config, audio)

    print('Waiting for operation to complete...')
    response = operation.result(timeout=90)

    # Each result is for a consecutive portion of the audio. Iterate through    
    # them to get the transcripts for the entire audio file.
    for result in response.results # The first alternative is the most likely one for this portion.
        print(u'Transcript: {}'.format(result.alternatives[0].transcript))
        print('Confidence: {}'.format(result.alternatives[0].confidence))
  1. And now for another example, we’ll use a Python client library to transcribe live streaming audio.  The example I used was transcribe_streaming_mic.py, which depends on the awesome pyaudio package from MIT. If you have python and pip installed, run:
    pip install --user pyaudio. (I suspect this will not work on most virtual machines without active audio interfaces, but should work on a Linux workstation.)

The example outputs the following as I read the Declaration of Independence aloud:

I was elated to see that pyaudio successfully picked up my voice through my Bluetooth headset, as I read the Declaration of Independence. Try out this stream-from-microphone example here.

One interesting note is that the transcription will end if you speak the words “exit” or “quit.” You can set other keywords/hotwords to stop your transcription on line 160.

A simpler set of options—the Text-to-Speech API:

The first thing to learn about Text-to-Speech is that all queries are sent as text via JSON, and the response is a string that can be converted to MP3 audio. To play these returned filestreams, use `afplay` on MacOS, or `aplay` on Linux. First we’ll explore an example that involves a simple JSON post to the API, and then we’ll demonstrate a couple Python examples that send local text snippets (both raw text and speech-styled markup language called SSML) to the API, which returns the MP3 locally, at which point you can play it back.

For the RESTful API, there are only two methods: GET $apiver/voices, which returns a list of available voice synthesis profiles, and POST $apiver/text:synthesize, which accepts your text request and returns an audio file for playback.

Accordingly, here is an example of a curl/HTTP API call for a higher pitched Australian voice, with no text annotation:

curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token)
     -H "Content-Type: application/json; charset=utf-8"
     --data "{  'input':{    'text':'When in the course of human events it becomes necessary for one people to dissolve the political bands which have connected them with another and to assume among the powers of the earth'  },
     'voice':{    'languageCode':'en-au',    'name':'en-AU-Standard-C',    'ssmlGender':'FEMALE'  },  'audioConfig':{    'audioEncoding':'MP3'  }}"
   "https://texttospeech.googleapis.com/v1beta1/text:synthesize" > synthesize-output.txt

This text file and the single JSON string in it is the output of the API; it is encoded as a base64 string and can be decoded with the following:

base64 synthesize-output-base64.txt --decode > synthesized-audio.mp3 && afplay synthesized-audio.mp3

Through a clever pipeline assemblage, you can request the synthesis and play back the output all in one command. (A little long for a line of code, though.)

  1. Python API call with lower pitched British voice, with text annotation:

From the client samples provided on GitHub, you can request a response from the API and play it back all in one method:

def synthesize_ssml(ssml):
    from google.cloud import texttospeech
    client = texttospeech.TextToSpeechClient()
    input_text = texttospeech.types.SynthesisInput(ssml=ssml)
    voice = texttospeech.types.VoiceSelectionParams(
        language_code='en-GB',
        ssml_gender=texttospeech.enums.SsmlVoiceGender.MALE)
    audio_config = texttospeech.types.AudioConfig(
        audio_encoding=texttospeech.enums.AudioEncoding.MP3)
    response = client.synthesize_speech(input_text, voice, audio_config)
    with open('output.mp3', 'wb') as out:
        out.write(response.audio_content)
        print('Audio content written to file "output.mp3"')

Try the following SSML (speech markup language) snippet as input:

<speak><prosody rate="slow" pitch="-2st">Can you hear me now?</prosody><emphasis level="moderate">Good, thank you for listening!</emphasis>
  Step <say-as interpret-as="cardinal">1</say-as> take a deep breath. <break time="200ms"/>
  The <say-as interpret-as="ordinal">2</say-as> step is to exhale.
  Step 3, take a deep breath again. <break strength="weak"/>
  Step 4, exhale. Spell it out with me: <say-as interpret-as="characters">exhale</say-as>
</speak>

In the Python examples, you can save the above block as sample.ssml and run:

python synthesize_file.py --ssml sample.ssml

SSML permits you to do things like code integer results into your textual input, resulting in ordinals or plurality-matched nouns, depending on whether your result is one or many, or properly spoken dates and times, which can be particularly useful.

  1. Comparison of “standard” and WaveNet voices:

Here we’ll play back a more conventional synthesis model followed by the more advanced WaveNet model, in which an algorithm combines all the waveforms (typically 24,000 per second) via superposition, to form a human-like voice:

from google.cloud import texttospeech
    client = texttospeech.TextToSpeechClient()
    text_old = ’This is how I used to sound, but’
    input_text_old = texttospeech.types.SynthesisInput(text=text_old)
    text_new = ’now I sound like this!’
    input_text_new = texttospeech.types.SynthesisInput(text=text_new)

    voice = texttospeech.types.VoiceSelectionParams(
        language_code='en-US',
        ssml_gender=texttospeech.enums.SsmlVoiceGender.FEMALE,
        name=’en-US-Standard-C’)

    audio_config = texttospeech.types.AudioConfig(
        audio_encoding=texttospeech.enums.AudioEncoding.MP3)

    response_old = client.synthesize_speech(input_text_old, voice, audio_config) # API call 1 (standard)

    voice = texttospeech.types.VoiceSelectionParams(
        language_code='en-US',
        ssml_gender=texttospeech.enums.SsmlVoiceGender.FEMALE,
        name=’en-US-Wavenet-C’)

    response_new = client.synthesize_speech(input_text_new, voice, audio_config) # API call 2 (advanced)

    # The response's audio_content is binary, and needs to be written to disk for playback
    with open('output_old.mp3', 'wb') as out:
        out.write(response_old.audio_content)
    with open('output_new.mp3', 'wb') as out:
        out.write(response_new.audio_content)

import subprocess
   sound_program = "/usr/bin/afplay" # for Mac. Choose /usr/bin/vlc for Linux
   sound_file = "output_old.mp3"
   subprocess.call([sound_program, sound_file]) # may need to wait to avoid overlap  sound_file = "output_new.mp3" if you've run this before
   subprocess.call([sound_program, sound_file])

For the moment, WaveNet voices are only available in US English, but work is underway to expand the range of languages and accents available.

There you have it—a fly-by tour of Google Cloud’s Speech APIs, and all the different ways you can use it to synthesize voice and text. Stay tuned for upcoming installments of the ML Explorer series, where I’ll show you how to tryout the Assistant API and other chat-oriented APIs that Google Cloud offers.

What to try next:

  • Build a simple app for Android that calls the Speech API
  • Explore Natural Language Processing to use in conjunction with the Speech API (will be explained in greater detail in a future post)
  • Follow me on Twitter to hear about future posts and updates
  • Big Data Solutions

  • Product deep dives, technical comparisons, how-to's and tips and tricks for using the latest data processing and machine learning technologies.

  • Learn More

12 Months FREE TRIAL

Try BigQuery, Machine Learning and other cloud products and get $300 free credit to spend over 12 months.

TRY IT FREE