This page shows you how to transcribe audio files into text using the Speech-to-Text API on Google Distributed Cloud (GDC) air-gapped appliance.
The Speech-to-Text service of Vertex AI on GDC air-gapped appliance recognizes speech from audio files. Speech-to-Text converts the detected audio into text transcriptions using its pre-trained API.
Before you begin
Before you can start using the Speech-to-Text API, you must have a project with the Speech-to-Text API enabled and have the appropriate credentials. You can also install client libraries to help you make calls to the API. For more information, see Set up a speech recognition project.
Transcribe audio with the default model
Speech-to-Text performs speech recognition. You send the audio file from which you want to recognize speech directly as content in the API request. The system returns the resulting transcribed text in the API response.
You must provide a RecognitionConfig
configuration object when making a speech
recognition request. This object tells the API how to process your audio data
and what kind of output you expect. If a model is not explicitly specified in
this configuration object, Speech-to-Text selects a default model.
Speech-to-Text on GDC air-gapped appliance only supports the default model.
Follow these steps to use the Speech-to-Text service from a Python script to transcribe speech from an audio file:
Install the latest version of the Speech-to-Text client library.
Add the following code to the Python script you created:
import base64 from google.cloud import speech_v1p1beta1 import google.auth from google.auth.transport import requests from google.api_core.client_options import ClientOptions audience="https://ENDPOINT:443" api_endpoint="ENDPOINT:443" def get_client(creds): opts = ClientOptions(api_endpoint=api_endpoint) return speech_v1p1beta1.SpeechClient(credentials=creds, client_options=opts) def main(): creds = None try: creds, project_id = google.auth.default() creds = creds.with_gdch_audience(audience) req = requests.Request() creds.refresh(req) print("Got token: ") print(creds.token) except Exception as e: print("Caught exception" + str(e)) raise e return creds def speech_func(creds): tc = get_client(creds) content="BASE64_ENCODED_AUDIO" audio = speech_v1p1beta1.RecognitionAudio() audio.content = base64.standard_b64decode(content) config = speech_v1p1beta1.RecognitionConfig() config.encoding= speech_v1p1beta1.RecognitionConfig.AudioEncoding.ENCODING config.sample_rate_hertz=RATE_HERTZ config.language_code="LANGUAGE_CODE" config.audio_channel_count=CHANNEL_COUNT metadata = [("x-goog-user-project", "projects/PROJECT_ID")] resp = tc.recognize(config=config, audio=audio, metadata=metadata) print(resp) if __name__=="__main__": creds = main() speech_func(creds)
Replace the following:
ENDPOINT
: the Speech-to-Text endpoint that you use for your organization. For more information, view service status and endpoints.PROJECT_ID
: your project ID.BASE64_ENCODED_AUDIO
: the audio data bytes encoded in a Base64 representation. This string begins with characters that look similar toZkxhQwAAACIQABAAAAUJABtAA+gA8AB+W8FZndQvQAyjv
.ENCODING
: the encoding of the audio data sent in the request, such asLINEAR16
.RATE_HERTZ
: sample rate in Hertz of the audio data sent in the request, such as16000
.LANGUAGE_CODE
: the language of the supplied audio as a BCP-47 language tag. See the list of supported languages and their respective language codes.CHANNEL_COUNT
: the number of channels in the input audio data, such as1
.
Save the Python script.
Run the Python script to transcribe audio:
python SCRIPT_NAME
Replace
SCRIPT_NAME
with the name you gave to your Python script, for example,speech.py
.