Speech-to-Text enables easy integration of Google speech recognition technologies into your solution. Speech-to-Text is a machine learning (ML) technology that gives you full control over your infrastructure and your protected speech data to meet data residency and compliance requirements.
The following table describes the key capabilities of Speech-to-Text.
Key capabilities | |
---|---|
Transcription | Applies advanced deep learning neural network algorithms from Google to automatic speech recognition. |
Models | Deploys models that are less than 1 GB in size and consume minimal resources. |
API compatible | Uses an API that is fully compatible with the Speech-to-Text API and its client libraries. |
Supported audio encodings for Speech-to-Text
The Speech-to-Text API supports a number of different encodings. The following table lists supported audio codecs:
Codec | Name | Lossless | Usage notes |
---|---|---|---|
FLAC |
Free Lossless Audio Codec | Yes | 16-bit or 24-bit required for streams |
LINEAR16 |
Linear PCM | Yes | 16-bit linear pulse-code modulation (PCM) encoding. The header must contain the sample rate. |
MULAW |
μ-law | No | 8-bit PCM encoding |
OGG_OPUS |
Opus encoded audio frames in an Ogg container | No | Sample rate must be one of 8000 Hz, 12000 Hz, 16000 Hz, 24000 Hz, or 48000 Hz |
FLAC
is both an audio codec and an audio file format. To transcribe audio
files using FLAC
encoding, you must provide them in the .FLAC
file format,
which includes a header containing metadata.
Speech-to-Text supports WAV
files with LINEAR16
or MULAW
encoded audio.
For more information on Speech-to-Text audio codecs, consult the
AudioEncoding
reference documentation.
If you have a choice when encoding the source material, use a lossless encoding
such as FLAC
or LINEAR16
for better speech recognition.
Before you begin
To get the permissions you need to use the Vertex AI Speech-to-Text pre-trained API, ask your Project IAM Admin to grant you the AI Speech Developer (ai-speech-developer
) role in your project namespace.
How to use the Speech-to-Text client library
Work through the following steps to use the Speech-to-Text client library using Python:
Python
- Open a notebook as your coding environment. If you don't have an existing notebook, create a notebook.
- Write your code using Python to install the Speech-to-Text library from a tar file and get a transcription. The following code sample shows how to import the Speech-to-Text client library and transcribe an audio file.
- Run your code to generate a Speech-to-Text transcription.
# Import the Speech-to-Text client library.
from google.cloud import speech
# Instantiate a client.
client = speech.SpeechClient()
# Specify the audio file to transcribe.
audio_uri = "YOUR_AUDIO_TO_TRANSCRIBE"
audio = speech.RecognitionAudio(uri=audio_uri)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
audio_channel_count=1,
language_code="LANGUAGE_CODE",
)
metadata = [("x-goog-user-project", "projects/PROJECT_ID")]
# Detect speech in the audio file.
response = client.recognize(config=config, audio=audio, metadata=metadata)
for result in response.results:
print("Transcript: {}".format(result.alternatives[0].transcript))
Replace LANGUAGE_CODE with a supported language code.
Sample of the Speech-to-Text client library
To transcribe an audio file using the Speech-to-Text API, first view the statuses and endpoints of the pre-trained models to identify your endpoint. Then, follow the sample code:
from google.cloud import speech_v1p1beta1
from google.cloud.speech_v1p1beta1.services.speech import client
from google.api_core.client_options import ClientOptions
import grpc
import io
def transcribe(local_file_path, api_endpoint):
opts = ClientOptions(api_endpoint=api_endpoint)
tc = client.SpeechClient(credentials=creds, client_options=opts)
config = {
"encoding": speech_v1p1beta1.RecognitionConfig.AudioEncoding.LINEAR16,
"language_code": "LANGUAGE_CODE",
"sample_rate_hertz": 16000,
"audio_channel_count": 1
}
metadata = (("x-goog-user-project", "projects/PROJECT_ID"),)
with io.open(local_file_path, "rb") as f:
content = f.read()
audio = {"content": content}
response = tc.recognize(request={"config": config, "audio": audio}, metadata=metadata)
Replace LANGUAGE_CODE with a supported language code.
Supported languages
The following languages are supported by Speech-to-Text:
Language | Language code |
---|---|
Arabic (Egypt) | ar-EG |
Arabic (Levantine) | ar-x-levant |
Arabic (Maghrebi) | ar-x-maghrebi |
Arabic (Peninsular Gulf) | ar-x-gulf |
Chinese, Mandarin (Simplified, China) | cmn-hans-cn |
English (United States) | en-US |
French (France) | fr-FR |
German (Germany) | de-DE |
Korean (South Korea) | ko-KR |
Portuguese (Brazil) | pt-BR |
Russian (Russia) | ru-RU |
Spanish (United States) | es-US |
Ukrainian (Ukraine) | uk-UA |
Urdu (Pakistan) | ur-PK |
Persian (Iran) | fa-IR |
Swahili | sw |