Transcribe audio

Speech-to-Text enables easy integration of Google speech recognition technologies into your solution. Speech-to-Text is a machine learning (ML) technology that gives you full control over your infrastructure and your protected speech data to meet data residency and compliance requirements.

The following table describes the key capabilities of Speech-to-Text.

Key capabilities
Transcription Applies advanced deep learning neural network algorithms from Google to automatic speech recognition.
Models Deploys models that are less than 1 GB in size and consume minimal resources.
API compatible Uses an API that is fully compatible with the Speech-to-Text API and its client libraries.

Supported audio encodings for Speech-to-Text

The Speech-to-Text API supports a number of different encodings. The following table lists supported audio codecs:

Codec Name Lossless Usage notes
FLAC Free Lossless Audio Codec Yes 16-bit or 24-bit required for streams
LINEAR16 Linear PCM Yes 16-bit linear pulse-code modulation (PCM) encoding. The header must contain the sample rate.
MULAW μ-law No 8-bit PCM encoding
OGG_OPUS Opus encoded audio frames in an Ogg container No Sample rate must be one of 8000 Hz, 12000 Hz, 16000 Hz, 24000 Hz, or 48000 Hz

FLAC is both an audio codec and an audio file format. To transcribe audio files using FLAC encoding, you must provide them in the .FLAC file format, which includes a header containing metadata.

Speech-to-Text supports WAV files with LINEAR16 or MULAW encoded audio.

For more information on Speech-to-Text audio codecs, consult the AudioEncoding reference documentation.

If you have a choice when encoding the source material, use a lossless encoding such as FLAC or LINEAR16 for better speech recognition.

Before you begin

To get the permissions you need to use the Vertex AI Speech-to-Text pre-trained API, ask your Project IAM Admin to grant you the AI Speech Developer (ai-speech-developer) role in your project namespace.

How to use the Speech-to-Text client library

Work through the following steps to use the Speech-to-Text client library using Python:

Python

  1. Open a notebook as your coding environment. If you don't have an existing notebook, create a notebook.
  2. Write your code using Python to install the Speech-to-Text library from a tar file and get a transcription. The following code sample shows how to import the Speech-to-Text client library and transcribe an audio file.
  3. Run your code to generate a Speech-to-Text transcription.
# Import the Speech-to-Text client library.
from google.cloud import speech

# Instantiate a client.
client = speech.SpeechClient()

# Specify the audio file to transcribe.
audio_uri = "YOUR_AUDIO_TO_TRANSCRIBE"

audio = speech.RecognitionAudio(uri=audio_uri)

config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    audio_channel_count=1,
    language_code="LANGUAGE_CODE",
)

metadata = [("x-goog-user-project", "projects/PROJECT_ID")]

# Detect speech in the audio file.
response = client.recognize(config=config, audio=audio, metadata=metadata)

for result in response.results:
    print("Transcript: {}".format(result.alternatives[0].transcript))

Replace LANGUAGE_CODE with a supported language code.

Sample of the Speech-to-Text client library

To transcribe an audio file using the Speech-to-Text API, first view the statuses and endpoints of the pre-trained models to identify your endpoint. Then, follow the sample code:

from google.cloud import speech_v1p1beta1
from google.cloud.speech_v1p1beta1.services.speech import client
from google.api_core.client_options import ClientOptions
import grpc
import io

def transcribe(local_file_path, api_endpoint):

  opts = ClientOptions(api_endpoint=api_endpoint)
  tc = client.SpeechClient(credentials=creds, client_options=opts)
  config = {
    "encoding": speech_v1p1beta1.RecognitionConfig.AudioEncoding.LINEAR16,
    "language_code": "LANGUAGE_CODE",
    "sample_rate_hertz": 16000,
    "audio_channel_count": 1
  }

  metadata = (("x-goog-user-project", "projects/PROJECT_ID"),)

  with io.open(local_file_path, "rb") as f:
    content = f.read()
    audio = {"content": content}
    response = tc.recognize(request={"config": config, "audio": audio}, metadata=metadata)

Replace LANGUAGE_CODE with a supported language code.

Supported languages

The following languages are supported by Speech-to-Text:

Language Language code
Arabic (Egypt) ar-EG
Arabic (Levantine) ar-x-levant
Arabic (Maghrebi) ar-x-maghrebi
Arabic (Peninsular Gulf) ar-x-gulf
Chinese, Mandarin (Simplified, China) cmn-hans-cn
English (United States) en-US
French (France) fr-FR
German (Germany) de-DE
Korean (South Korea) ko-KR
Portuguese (Brazil) pt-BR
Russian (Russia) ru-RU
Spanish (United States) es-US
Ukrainian (Ukraine) uk-UA
Urdu (Pakistan) ur-PK
Persian (Iran) fa-IR
Swahili sw