Chirp: Universal speech model

Chirp is the next generation of Speech-to-Text models on Google Distributed Cloud (GDC) air-gapped. Representing a version of a Universal Speech Model, Chirp has over 2B parameters and can transcribe many languages in a single model.

You can transcribe audio in other supported languages that Speech-to-Text doesn't originally support by enabling the Chirp component.

Chirp achieves state-of-the-art Word Error Rate (WER) on a variety of public test sets and languages, offering multi-language support on Distributed Cloud. It uses a universal encoder that trains models with a different architecture than current speech models, using data in many different languages. The model is then fine-tuned to offer transcription for specific languages. A single model unifies data from multiple languages. However, users still specify the language in which the model should recognize speech.

Chirp processes speech in much larger chunks than other models do. Results are only available after an entire utterance has finished. This means it might not be suitable for true, real-time use.

Chirp is available in the Speech-to-Text pre-trained API. The model identifier for Chirp is: chirp. Therefore, in the Distributed Cloud implementation of Speech-to-Text, you can set the value chirp on the model field of the RecognitionConfig message in your request.

Available API methods

Chirp supports both Speech.Recognize and Speech.StreamingRecognize API methods.

The difference between both methods is that StreamingRecognize only returns results after each utterance. For this reason, this method has a latency on the order of seconds rather than milliseconds after starting speech, compared to the Recognize method. However, StreamingRecognize has a very low latency after an utterance is finished, for example, in a sentence followed by a pause.

Before you begin

Before using Chirp on Distributed Cloud, follow these steps:

Ask your Project IAM Admin to grant you the AI Speech Developer (ai-speech-developer) role in your project namespace.
Enable the pre-trained APIs before using the client library.

How to use Chirp

Work through the following steps to use Chirp as a model on the Speech-to-Text client library. You can use curl commands or Python:

curl

Open a notebook as your coding environment. If you don't have an existing notebook, create a notebook.
Write your code using the grpcurl tool and the API reference documentation to process data.
Send a request using the grpcurl tool to generate a Speech-to-Text transcription:

# Send request via curl (where ${GOPATH} is your go installation path)
"${GOPATH}/bin/grpcurl" -plaintext -d @ localhost:10000
google.cloud.speech.v1.Speech.Recognize < recognize_request.json

# Sample request in recognize_request.json
# Audio content field should be populated with audio encoded in base64
{
  "config": {
    "encoding": "LINEAR16",
    "sample_rate_hertz": 16000,
    "audio_channel_count": 1,
    "language_code": "LANGUAGE_CODE",
    "model": "chirp"
  },
  "audio": {
     "content": "BASE64_ENCODED_AUDIO_FILE"
   }
  }

Replace LANGUAGE_CODE with a supported language code.

Python

Open a notebook as your coding environment. If you don't have an existing notebook, create a notebook.
Write your code using Python to install the Speech-to-Text library from a tar file and get a transcription.
Import the Speech-to-Text client library and transcribe an audio file to generate a Speech-to-Text transcription.

# Import the Speech-to-Text client library.
from google.cloud import speech

# Instantiate a client.
client = speech.SpeechClient()

# Specify the audio file to transcribe.
audio_uri = "YOUR_AUDIO_TO_TRANSCRIBE"

audio = speech.RecognitionAudio(uri=audio_uri)

config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    audio_channel_count=1,
    language_code="LANGUAGE_CODE",
    model="chirp"
)

# Detect speech in the audio file.
response = client.recognize(config=config, audio=audio)

for result in response.results:
    print("Transcript: {}".format(result.alternatives[0].transcript))

Replace LANGUAGE_CODE with a supported language code.

Sample of the Speech-to-Text client library

To transcribe an audio file using the Chirp model on the Speech-to-Text API, first view the statuses and endpoints of the pre-trained models to identify your endpoint. Then, follow the sample code:

# api_endpoint = '0.0.0.0:10000'
# local_file_path = '../resources/two_channel_16k.raw'
from google.cloud import speech_v1
import grpc
import io

def transcribe(local_file_path, api_endpoint):

  transport = speech_v1.services.speech.transports.SpeechGrpcTransport(channel= grpc.insecure_channel(target=api_endpoint))
  client = speech_v1.SpeechClient(transport=transport)
  config = {
    "encoding": speech_v1.RecognitionConfig.AudioEncoding.LINEAR16,
    "language_code": "LANGUAGE_CODE",
    "sample_rate_hertz": 16000,
    "audio_channel_count": 1
    "model": "chirp"
  }
  with io.open(local_file_path, "rb") as f:
    content = f.read()
    audio = {"content": content}
    response = client.recognize(request={"config": config, "audio": audio})

Replace LANGUAGE_CODE with a supported language code.

Supported languages

The following languages are supported by Chirp:

Language	Language code
English (United States)	`en-US`
Indonesian (Indonesia)	`id-ID`
Malay (Malaysia)	`ms-MY`