Transcribe audio

This page shows you how to transcribe audio files into text using the Speech-to-Text API on Google Distributed Cloud (GDC) air-gapped appliance.

The Speech-to-Text service of Vertex AI on GDC air-gapped appliance recognizes speech from audio files. Speech-to-Text converts the detected audio into text transcriptions using its pre-trained API.

Before you begin

Before you can start using the Speech-to-Text API, you must have a project with the Speech-to-Text API enabled and have the appropriate credentials. You can also install client libraries to help you make calls to the API. For more information, see Set up a speech recognition project.

Transcribe audio with the default model

Speech-to-Text performs speech recognition. You send the audio file from which you want to recognize speech directly as content in the API request. The system returns the resulting transcribed text in the API response.

You must provide a RecognitionConfig configuration object when making a speech recognition request. This object tells the API how to process your audio data and what kind of output you expect. If a model is not explicitly specified in this configuration object, Speech-to-Text selects a default model. Speech-to-Text on GDC air-gapped appliance only supports the default model.

Follow these steps to use the Speech-to-Text service from a Python script to transcribe speech from an audio file:

  1. Install the latest version of the Speech-to-Text client library.

  2. Set the required environment variables on a Python script.

  3. Authenticate your API request.

  4. Add the following code to the Python script you created:

    import base64
    
    from google.cloud import speech_v1p1beta1
    import google.auth
    from google.auth.transport import requests
    from google.api_core.client_options import ClientOptions
    
    audience="https://ENDPOINT:443"
    api_endpoint="ENDPOINT:443"
    
    def get_client(creds):
      opts = ClientOptions(api_endpoint=api_endpoint)
      return speech_v1p1beta1.SpeechClient(credentials=creds, client_options=opts)
    
    def main():
      creds = None
      try:
        creds, project_id = google.auth.default()
        creds = creds.with_gdch_audience(audience)
        req = requests.Request()
        creds.refresh(req)
        print("Got token: ")
        print(creds.token)
      except Exception as e:
        print("Caught exception" + str(e))
        raise e
      return creds
    
    def speech_func(creds):
      tc = get_client(creds)
    
      content="BASE64_ENCODED_AUDIO"
    
      audio = speech_v1p1beta1.RecognitionAudio()
      audio.content = base64.standard_b64decode(content)
      config = speech_v1p1beta1.RecognitionConfig()
      config.encoding= speech_v1p1beta1.RecognitionConfig.AudioEncoding.ENCODING
      config.sample_rate_hertz=RATE_HERTZ
      config.language_code="LANGUAGE_CODE"
      config.audio_channel_count=CHANNEL_COUNT
    
      metadata = [("x-goog-user-project", "projects/PROJECT_ID")]
      resp = tc.recognize(config=config, audio=audio, metadata=metadata)
      print(resp)
    
    if __name__=="__main__":
      creds = main()
      speech_func(creds)
    

    Replace the following:

    • ENDPOINT: the Speech-to-Text endpoint that you use for your organization. For more information, view service status and endpoints.
    • PROJECT_ID: your project ID.
    • BASE64_ENCODED_AUDIO: the audio data bytes encoded in a Base64 representation. This string begins with characters that look similar to ZkxhQwAAACIQABAAAAUJABtAA+gA8AB+W8FZndQvQAyjv.
    • ENCODING: the encoding of the audio data sent in the request, such as LINEAR16.
    • RATE_HERTZ: sample rate in Hertz of the audio data sent in the request, such as 16000.
    • LANGUAGE_CODE: the language of the supplied audio as a BCP-47 language tag. See the list of supported languages and their respective language codes.
    • CHANNEL_COUNT: the number of channels in the input audio data, such as 1.
  5. Save the Python script.

  6. Run the Python script to transcribe audio:

    python SCRIPT_NAME
    

    Replace SCRIPT_NAME with the name you gave to your Python script, for example, speech.py.