Transcribe audio

Speech-to-Text enables easy integration of Google speech recognition technologies into your solution. Speech-to-Text is a machine learning (ML) technology that gives you full control over your infrastructure and your protected speech data to meet data residency and compliance requirements.

The following table describes the key capabilities of Speech-to-Text.

Key capabilities
Transcription Applies advanced deep learning neural network algorithms from Google to automatic speech recognition.
Models Deploys models that are less than 1 GB in size and consume minimal resources.
API compatible Uses an API that is fully compatible with the Speech-to-Text API and its client libraries.

For a list of supported audio encoding formats, see AudioEncoding of Speech-to-Text.

Before you begin

To get the permissions you need to use the Vertex AI Speech-to-Text pre-trained API, ask your Project IAM Admin to grant you the AI Speech Developer (ai-speech-developer) role in your project namespace.

How to use the Speech-to-Text client library

Work through the following steps to use the Speech-to-Text client library using curl commands or Python:

curl

  1. Open a notebook as your coding environment. If you don't have an existing notebook, create a notebook.
  2. Write your code using the grpcurl tool and the API reference documentation to process data. The following code sample shows how to send a request using the grpcurl tool.
  3. Run your code to generate a Speech-to-Text transcription.
# Send request via curl (where ${GOPATH} is your go installation path)
"${GOPATH}/bin/grpcurl" -plaintext -d @ localhost:10000
google.cloud.speech.v1.Speech.Recognize < recognize_request.json

# Sample request in recognize_request.json
# Audio content field should be populated with audio encoded in base64
{
  "config": {
    "encoding": "LINEAR16",
    "sample_rate_hertz": 16000,
    "audio_channel_count": 1,
    "language_code": "LANGUAGE_CODE"
  },
  "audio": {
     "content": "BASE64_ENCODED_AUDIO_FILE"
   }
  }

Replace LANGUAGE_CODE with a supported language code.

Python

  1. Open a notebook as your coding environment. If you don't have an existing notebook, create a notebook.
  2. Write your code using Python to install the Speech-to-Text library from a tar file and get a transcription. The following code sample shows how to import the Speech-to-Text client library and transcribe an audio file.
  3. Run your code to generate a Speech-to-Text transcription.
# Import the Speech-to-Text client library.
from google.cloud import speech

# Instantiate a client.
client = speech.SpeechClient()

# Specify the audio file to transcribe.
audio_uri = "YOUR_AUDIO_TO_TRANSCRIBE"

audio = speech.RecognitionAudio(uri=audio_uri)

config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    audio_channel_count=1,
    language_code="LANGUAGE_CODE",
)

# Detect speech in the audio file.
response = client.recognize(config=config, audio=audio)

for result in response.results:
    print("Transcript: {}".format(result.alternatives[0].transcript))

Replace LANGUAGE_CODE with a supported language code.

Sample of the Speech-to-Text client library

To transcribe an audio file using the Speech-to-Text API, first view the statuses and endpoints of the pre-trained models to identify your endpoint. Then, follow the sample code:

# api_endpoint = '0.0.0.0:10000'
# local_file_path = '../resources/two_channel_16k.raw'
from google.cloud import speech_v1
import grpc
import io

def transcribe(local_file_path, api_endpoint):

  transport = speech_v1.services.speech.transports.SpeechGrpcTransport(channel= grpc.insecure_channel(target=api_endpoint))
  client = speech_v1.SpeechClient(transport=transport)
  config = {
    "encoding": speech_v1.RecognitionConfig.AudioEncoding.LINEAR16,
    "language_code": "LANGUAGE_CODE",
    "sample_rate_hertz": 16000,
    "audio_channel_count": 1
  }
  with io.open(local_file_path, "rb") as f:
    content = f.read()
    audio = {"content": content}
    response = client.recognize(request={"config": config, "audio": audio})

Replace LANGUAGE_CODE with a supported language code.

Supported languages

The following languages are supported by Speech-to-Text:

Language Language code
Arabic (Egypt) ar-EG
Arabic (Levantine) ar-x-levant
Arabic (Maghrebi) ar-x-maghrebi
Arabic (Peninsular Gulf) ar-x-gulf
Chinese, Mandarin (Simplified, China) cmn-hans-cn
English (United States) en-US
French (France) fr-FR
German (Germany) de-DE
Korean (South Korea) ko-KR
Portuguese (Brazil) pt-BR
Russian (Russia) ru-RU
Spanish (United States) es-US
Ukrainian (Ukraine) uk-UA
Urdu (Pakistan) ur-PK
Persian (Iran) fa-IR
Swahili sw