Transcribe audio

This page shows you how to transcribe audio files into text using the Speech-to-Text API on Google Distributed Cloud (GDC) air-gapped appliance.

The Speech-to-Text service of Vertex AI on GDC air-gapped appliance recognizes speech from audio files. Speech-to-Text converts the detected audio into text transcriptions using its pre-trained API.

Before you begin

Before you can start using the Speech-to-Text API, you must have a project with the Speech-to-Text API enabled and have the appropriate credentials. You can also install client libraries to help you make calls to the API. For more information, see Set up a speech recognition project.

Transcribe audio with the default model

Speech-to-Text performs speech recognition. You send the audio file from which you want to recognize speech directly as content in the API request. The system returns the resulting transcribed text in the API response.

You must provide a RecognitionConfig configuration object when making a speech recognition request. This object tells the API how to process your audio data and what kind of output you expect. If a model is not explicitly specified in this configuration object, Speech-to-Text selects a default model. Speech-to-Text on GDC air-gapped appliance only supports the default model.

Follow these steps to use the Speech-to-Text service from a Python script to transcribe speech from an audio file:

Install the latest version of the Speech-to-Text client library.
Set the required environment variables on a Python script.
Authenticate your API request.

Add the following code to the Python script you created:

import base64

from google.cloud import speech_v1p1beta1
import google.auth
from google.auth.transport import requests
from google.api_core.client_options import ClientOptions

audience="https://ENDPOINT:443"
api_endpoint="ENDPOINT:443"

def get_client(creds):
  opts = ClientOptions(api_endpoint=api_endpoint)
  return speech_v1p1beta1.SpeechClient(credentials=creds, client_options=opts)

def main():
  creds = None
  try:
    creds, project_id = google.auth.default()
    creds = creds.with_gdch_audience(audience)
    req = requests.Request()
    creds.refresh(req)
    print("Got token: ")
    print(creds.token)
  except Exception as e:
    print("Caught exception" + str(e))
    raise e
  return creds

def speech_func(creds):
  tc = get_client(creds)

  content="BASE64_ENCODED_AUDIO"

  audio = speech_v1p1beta1.RecognitionAudio()
  audio.content = base64.standard_b64decode(content)
  config = speech_v1p1beta1.RecognitionConfig()
  config.encoding= speech_v1p1beta1.RecognitionConfig.AudioEncoding.ENCODING
  config.sample_rate_hertz=RATE_HERTZ
  config.language_code="LANGUAGE_CODE"
  config.audio_channel_count=CHANNEL_COUNT

  metadata = [("x-goog-user-project", "projects/PROJECT_ID")]
  resp = tc.recognize(config=config, audio=audio, metadata=metadata)
  print(resp)

if __name__=="__main__":
  creds = main()
  speech_func(creds)

Replace the following:

ENDPOINT: the Speech-to-Text endpoint that you use for your organization. For more information, view service status and endpoints.
PROJECT_ID: your project ID.
BASE64_ENCODED_AUDIO: the audio data bytes encoded in a Base64 representation. This string begins with characters that look similar to ZkxhQwAAACIQABAAAAUJABtAA+gA8AB+W8FZndQvQAyjv.
ENCODING: the encoding of the audio data sent in the request, such as LINEAR16.
RATE_HERTZ: sample rate in Hertz of the audio data sent in the request, such as 16000.
LANGUAGE_CODE: the language of the supplied audio as a BCP-47 language tag. See the list of supported languages and their respective language codes.
CHANNEL_COUNT: the number of channels in the input audio data, such as 1.

Save the Python script.
Run the Python script to transcribe audio:
```
python SCRIPT_NAME
```
Replace SCRIPT_NAME with the name you gave to your Python script, for example, speech.py.