Chirp: Universal speech model

Chirp is the next generation of Speech-to-Text models on Google Distributed Cloud (GDC) air-gapped. Representing a version of a Universal Speech Model, Chirp has over 2B parameters and can transcribe many languages in a single model.

You can transcribe audio in other supported languages that Speech-to-Text doesn't originally support by enabling the Chirp component.

Chirp achieves state-of-the-art Word Error Rate (WER) on a variety of public test sets and languages, offering multi-language support on Distributed Cloud. It uses a universal encoder that trains models with a different architecture than current speech models, using data in many different languages. The model is then fine-tuned to offer transcription for specific languages. A single model unifies data from multiple languages. However, users still specify the language in which the model should recognize speech.

Chirp processes speech in much larger chunks than other models do. Results are only available after an entire utterance has finished. This means it might not be suitable for true, real-time use.

Chirp is available in the Speech-to-Text pre-trained API. The model identifier for Chirp is: chirp. Therefore, in the Distributed Cloud implementation of Speech-to-Text, you can set the value chirp on the model field of the RecognitionConfig message in your request.

Available API methods

Chirp supports both Speech.Recognize and Speech.StreamingRecognize API methods.

The difference between both methods is that StreamingRecognize only returns results after each utterance. For this reason, this method has a latency on the order of seconds rather than milliseconds after starting speech, compared to the Recognize method. However, StreamingRecognize has a very low latency after an utterance is finished, for example, in a sentence followed by a pause.

Before you begin

Before using Chirp on Distributed Cloud, follow these steps:

Ask your Project IAM Admin to grant you the AI Speech Developer (ai-speech-developer) role in your project namespace.
Enable the pre-trained APIs before using the client library.

Authenticate the request

You must get a token to authenticate the requests to the Speech-to-Text pre-trained service. Follow these steps:

gdcloud CLI

Export the identity token for the specified account to an environment variable:

export TOKEN="$($HOME/gdcloud auth print-identity-token --audiences=https://ENDPOINT)"

Replace ENDPOINT with the Speech-to-Text endpoint. For more information, view service statuses and endpoints.

Python

Install the google-auth client library.
```
pip install google-auth
```

Save the following code to a Python script, and update the ENDPOINT to the Speech-to-Text endpoint. For more information, see View service statuses and endpoints.

import google.auth
from google.auth.transport import requests

audience = "https://ENDPOINT:443"

creds, project_id = google.auth.default()
creds = creds.with_gdch_audience(audience)

def test_get_token():
  req = requests.Request()
  creds.refresh(req)
  print(creds.token)

if __name__=="__main__":
  test_get_token()

Run the script to fetch the token.

How to use Chirp

Work through the following steps to use Chirp as a model on the Speech-to-Text client library. You can use Python:

Python

Open a notebook as your coding environment. If you don't have an existing notebook, create a notebook.
Write your code using Python to install the Speech-to-Text library from a tar file and get a transcription.

Import the Speech-to-Text client library and transcribe an audio file to generate a Speech-to-Text transcription.

import base64

# Import the client library.
from google.cloud import speech_v1p1beta1
from google.cloud.speech_v1p1beta1.services.speech import client
from google.api_core.client_options import ClientOptions

api_endpoint="ENDPOINT:443"

def get_client(creds):
  opts = ClientOptions(api_endpoint=api_endpoint)
  return client.SpeechClient(credentials=creds, client_options=opts)

# Specify the audio to transcribe.
tc = get_client(creds)
content = "BASE64_ENCODED_AUDIO"

audio = speech_v1p1beta1.RecognitionAudio()
audio.content = base64.standard_b64decode(content)

config = speech_v1p1beta1.RecognitionConfig(
    encoding=speech_v1p1beta1.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    audio_channel_count=1,
    language_code="LANGUAGE_CODE",
    model="chirp"
)

# Detect speech in the audio file.
metadata = (("x-goog-user-project", "projects/PROJECT_ID"),)
response = tc.recognize(config=config, audio=audio, metadata=metadata)

for result in response.results:
    print("Transcript: {}".format(result.alternatives[0].transcript))

Replace the following:

ENDPOINT: the Speech-to-Text endpoint. For more information, view service statuses and endpoints.
BASE64_ENCODED_AUDIO: the audio data bytes encoded in a Base64 representation. This string begins with characters that look similar to ZkxhQwAAACIQABAAAAUJABtAA+gA8AB+W8FZndQvQAyjv.
LANGUAGE_CODE: a supported language code.
PROJECT_ID: your project ID.

Sample of the Speech-to-Text client library

To transcribe an audio file using the Chirp model on the Speech-to-Text API, first view the statuses and endpoints of the pre-trained models to identify your endpoint. Then, follow the sample code:

from google.cloud import speech_v1p1beta1
from google.cloud.speech_v1p1beta1.services.speech import client
from google.api_core.client_options import ClientOptions
import grpc
import io

def transcribe(local_file_path, api_endpoint):

  opts = ClientOptions(api_endpoint=api_endpoint)
  tc = client.SpeechClient(client_options=opts)
  config = {
    "encoding": speech_v1p1beta1.RecognitionConfig.AudioEncoding.LINEAR16,
    "language_code": "LANGUAGE_CODE",
    "sample_rate_hertz": 16000,
    "audio_channel_count": 1
    "model": "chirp"
  }

  metadata = (("x-goog-user-project", "projects/PROJECT_ID"),)

  with io.open(local_file_path, "rb") as f:
    content = f.read()
    audio = {"content": content}
    response = client.recognize(config=config, audio=audio, metadata=metadata)

Replace LANGUAGE_CODE with a supported language code.

Supported languages

The following languages are supported by Chirp:

Language	Language code
English (United States)	`en-US`
Indonesian (Indonesia)	`id-ID`
Malay (Malaysia)	`ms-MY`