将长音频文件转写为文字

本页面演示了如何使用 Speech-to-Text API 和异步语音识别将长音频文件(时长超过一分钟)转写为文字。

异步语音识别简介

批量语音识别会启动一项长时间运行的音频处理操作。使用异步语音识别转写超过 60 秒的音频。对于较短的音频,同步语音识别更为简单快捷。异步语音识别的上限为 480 分钟(8 小时)。

批量语音识别功能只能转写 Cloud Storage 中存储的音频。转写输出可以在响应中以内嵌方式提供(对于单文件批量识别请求),也可以写入 Cloud Storage。

批量识别请求会返回一个 Operation,其中包含正在进行的请求识别处理的相关信息。您可以轮询操作,以了解操作何时完成以及转写功能何时可用。

准备工作

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Speech-to-Text APIs.

    Enable the APIs

  5. Make sure that you have the following role or roles on the project: Cloud Speech Administrator

    Check for the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.

    4. For all rows that specify or include you, check the Role colunn to see whether the list of roles includes the required roles.

    Grant the roles

    1. In the Google Cloud console, go to the IAM page.

      前往 IAM
    2. 选择项目。
    3. 点击 授予访问权限
    4. 新的主账号字段中,输入您的用户标识符。 这通常是 Google 账号的电子邮件地址。

    5. 选择角色列表中,选择一个角色。
    6. 如需授予其他角色,请点击 添加其他角色,然后添加其他各个角色。
    7. 点击保存
    8. Install the Google Cloud CLI.
    9. To initialize the gcloud CLI, run the following command:

      gcloud init
    10. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

      Go to project selector

    11. Make sure that billing is enabled for your Google Cloud project.

    12. Enable the Speech-to-Text APIs.

      Enable the APIs

    13. Make sure that you have the following role or roles on the project: Cloud Speech Administrator

      Check for the roles

      1. In the Google Cloud console, go to the IAM page.

        Go to IAM
      2. Select the project.
      3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.

      4. For all rows that specify or include you, check the Role colunn to see whether the list of roles includes the required roles.

      Grant the roles

      1. In the Google Cloud console, go to the IAM page.

        前往 IAM
      2. 选择项目。
      3. 点击 授予访问权限
      4. 新的主账号字段中,输入您的用户标识符。 这通常是 Google 账号的电子邮件地址。

      5. 选择角色列表中,选择一个角色。
      6. 如需授予其他角色,请点击 添加其他角色,然后添加其他各个角色。
      7. 点击保存
      8. Install the Google Cloud CLI.
      9. To initialize the gcloud CLI, run the following command:

        gcloud init
      10. 客户端库可以使用应用默认凭据轻松进行 Google API 身份验证,并向这些 API 发送请求。借助应用默认凭据,您可以在本地测试应用并部署它,无需更改底层代码。有关详情,请参阅使用客户端库进行身份验证

      11. If you're using a local shell, then create local authentication credentials for your user account:

        gcloud auth application-default login

        You don't need to do this if you're using Cloud Shell.

      此外,请确保您已安装客户端库

      启用对 Cloud Storage 的访问权限

      Speech-to-Text 使用服务账号访问 Cloud Storage 中的文件。默认情况下,服务账号可以访问同一项目中的 Cloud Storage 文件。

      服务账号电子邮件地址如下所示:

      service-PROJECT_NUMBER@gcp-sa-speech.iam.gserviceaccount.com
      

      如需转写另一个项目中的 Cloud Storage 文件,您可以向此服务账号授予另一个项目中的 Speech-to-Text Service Agent 角色:

      gcloud projects add-iam-policy-binding PROJECT_ID \
          --member=serviceAccount:service-PROJECT_NUMBER@gcp-sa-speech.iam.gserviceaccount.com \
          --role=roles/speech.serviceAgent

      如需详细了解项目 IAM 政策,请参阅管理对项目、文件夹和组织的访问权限

      您还可以通过向服务账号授予对特定 Cloud Storage 存储桶的权限,为服务账号授予更精细的访问权限:

      gcloud storage buckets add-iam-policy-binding gs://BUCKET_NAME \
          --member=serviceAccount:service-PROJECT_NUMBER@gcp-sa-speech.iam.gserviceaccount.com \
          --role=roles/storage.admin

      如需详细了解如何管理对 Cloud Storage 的访问权限,请参阅 Cloud Storage 文档中的创建和管理访问控制列表

      使用内嵌结果执行批量识别

      以下示例演示了如何对 Cloud Storage 中的音频文件执行批量语音识别,并从响应中读取内嵌的转写结果:

      Python

      import os
      
      from google.cloud.speech_v2 import SpeechClient
      from google.cloud.speech_v2.types import cloud_speech
      
      PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT")
      
      
      def transcribe_batch_gcs_input_inline_output_v2(
          audio_uri: str,
      ) -> cloud_speech.BatchRecognizeResults:
          """Transcribes audio from a Google Cloud Storage URI using the Google Cloud Speech-to-Text API.
              The transcription results are returned inline in the response.
          Args:
              audio_uri (str): The Google Cloud Storage URI of the input audio file.
                  E.g., gs://[BUCKET]/[FILE]
          Returns:
              cloud_speech.BatchRecognizeResults: The response containing the transcription results.
          """
          # Instantiates a client
          client = SpeechClient()
      
          config = cloud_speech.RecognitionConfig(
              auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
              language_codes=["en-US"],
              model="long",
          )
      
          file_metadata = cloud_speech.BatchRecognizeFileMetadata(uri=audio_uri)
      
          request = cloud_speech.BatchRecognizeRequest(
              recognizer=f"projects/{PROJECT_ID}/locations/global/recognizers/_",
              config=config,
              files=[file_metadata],
              recognition_output_config=cloud_speech.RecognitionOutputConfig(
                  inline_response_config=cloud_speech.InlineOutputConfig(),
              ),
          )
      
          # Transcribes the audio into text
          operation = client.batch_recognize(request=request)
      
          print("Waiting for operation to complete...")
          response = operation.result(timeout=120)
      
          for result in response.results[audio_uri].transcript.results:
              print(f"Transcript: {result.alternatives[0].transcript}")
      
          return response.results[audio_uri].transcript
      
      

      执行批量识别并将结果写入 Cloud Storage

      以下示例演示了如何对 Cloud Storage 中的音频文件执行批量语音识别,并从 Cloud Storage 的输出文件中读取转写结果。请注意,写入 Cloud Storage 的文件是 JSON 格式的 BatchRecognizeResults 消息:

      Python

      import os
      
      import re
      
      from google.cloud import storage
      from google.cloud.speech_v2 import SpeechClient
      from google.cloud.speech_v2.types import cloud_speech
      
      PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT")
      
      
      def transcribe_batch_gcs_input_gcs_output_v2(
          audio_uri: str,
          gcs_output_path: str,
      ) -> cloud_speech.BatchRecognizeResults:
          """Transcribes audio from a Google Cloud Storage URI using the Google Cloud Speech-to-Text API.
          The transcription results are stored in another Google Cloud Storage bucket.
          Args:
              audio_uri (str): The Google Cloud Storage URI of the input audio file.
                  E.g., gs://[BUCKET]/[FILE]
              gcs_output_path (str): The Google Cloud Storage bucket URI where the output transcript will be stored.
                  E.g., gs://[BUCKET]
          Returns:
              cloud_speech.BatchRecognizeResults: The response containing the URI of the transcription results.
          """
          # Instantiates a client
          client = SpeechClient()
      
          config = cloud_speech.RecognitionConfig(
              auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
              language_codes=["en-US"],
              model="long",
          )
      
          file_metadata = cloud_speech.BatchRecognizeFileMetadata(uri=audio_uri)
      
          request = cloud_speech.BatchRecognizeRequest(
              recognizer=f"projects/{PROJECT_ID}/locations/global/recognizers/_",
              config=config,
              files=[file_metadata],
              recognition_output_config=cloud_speech.RecognitionOutputConfig(
                  gcs_output_config=cloud_speech.GcsOutputConfig(
                      uri=gcs_output_path,
                  ),
              ),
          )
      
          # Transcribes the audio into text
          operation = client.batch_recognize(request=request)
      
          print("Waiting for operation to complete...")
          response = operation.result(timeout=120)
      
          file_results = response.results[audio_uri]
      
          print(f"Operation finished. Fetching results from {file_results.uri}...")
          output_bucket, output_object = re.match(
              r"gs://([^/]+)/(.*)", file_results.uri
          ).group(1, 2)
      
          # Instantiates a Cloud Storage client
          storage_client = storage.Client()
      
          # Fetch results from Cloud Storage
          bucket = storage_client.bucket(output_bucket)
          blob = bucket.blob(output_object)
          results_bytes = blob.download_as_bytes()
          batch_recognize_results = cloud_speech.BatchRecognizeResults.from_json(
              results_bytes, ignore_unknown_fields=True
          )
      
          for result in batch_recognize_results.results:
              print(f"Transcript: {result.alternatives[0].transcript}")
      
          return batch_recognize_results
      
      

      对多个文件执行批量识别

      以下示例演示了如何对 Cloud Storage 中的多个音频文件执行批量语音识别,并从 Cloud Storage 的输出文件中读取转写结果:

      Python

      import os
      import re
      from typing import List
      
      from google.cloud import storage
      from google.cloud.speech_v2 import SpeechClient
      from google.cloud.speech_v2.types import cloud_speech
      
      PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT")
      
      
      def transcribe_batch_multiple_files_v2(
          audio_uris: List[str],
          gcs_output_path: str,
      ) -> cloud_speech.BatchRecognizeResponse:
          """Transcribes audio from multiple Google Cloud Storage URIs using the Google Cloud Speech-to-Text API.
          The transcription results are stored in another Google Cloud Storage bucket.
          Args:
              audio_uris (List[str]): The list of Google Cloud Storage URIs of the input audio files.
                  E.g., ["gs://[BUCKET]/[FILE]", "gs://[BUCKET]/[FILE]"]
              gcs_output_path (str): The Google Cloud Storage bucket URI where the output transcript will be stored.
                  E.g., gs://[BUCKET]
          Returns:
              cloud_speech.BatchRecognizeResponse: The response containing the URIs of the transcription results.
          """
          # Instantiates a client
          client = SpeechClient()
      
          config = cloud_speech.RecognitionConfig(
              auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
              language_codes=["en-US"],
              model="long",
          )
      
          files = [cloud_speech.BatchRecognizeFileMetadata(uri=uri) for uri in audio_uris]
      
          request = cloud_speech.BatchRecognizeRequest(
              recognizer=f"projects/{PROJECT_ID}/locations/global/recognizers/_",
              config=config,
              files=files,
              recognition_output_config=cloud_speech.RecognitionOutputConfig(
                  gcs_output_config=cloud_speech.GcsOutputConfig(
                      uri=gcs_output_path,
                  ),
              ),
          )
      
          # Transcribes the audio into text
          operation = client.batch_recognize(request=request)
      
          print("Waiting for operation to complete...")
          response = operation.result(timeout=120)
      
          print("Operation finished. Fetching results from:")
          for uri in audio_uris:
              file_results = response.results[uri]
              print(f"  {file_results.uri}...")
              output_bucket, output_object = re.match(
                  r"gs://([^/]+)/(.*)", file_results.uri
              ).group(1, 2)
      
              # Instantiates a Cloud Storage client
              storage_client = storage.Client()
      
              # Fetch results from Cloud Storage
              bucket = storage_client.bucket(output_bucket)
              blob = bucket.blob(output_object)
              results_bytes = blob.download_as_bytes()
              batch_recognize_results = cloud_speech.BatchRecognizeResults.from_json(
                  results_bytes, ignore_unknown_fields=True
              )
      
              for result in batch_recognize_results.results:
                  print(f"     Transcript: {result.alternatives[0].transcript}")
      
          return response
      
      

      对批量识别启用动态批处理

      动态批处理可降低转写费用,但延迟时间较长此功能仅适用于批量识别。

      以下示例演示了如何在启用动态批处理的情况下对 Cloud Storage 中的音频文件执行批量识别:

      Python

      import os
      
      from google.cloud.speech_v2 import SpeechClient
      from google.cloud.speech_v2.types import cloud_speech
      
      PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT")
      
      
      def transcribe_batch_dynamic_batching_v2(
          audio_uri: str,
      ) -> cloud_speech.BatchRecognizeResults:
          """Transcribes audio from a Google Cloud Storage URI using dynamic batching.
          Args:
              audio_uri (str): The Cloud Storage URI of the input audio.
              E.g., gs://[BUCKET]/[FILE]
          Returns:
              cloud_speech.BatchRecognizeResults: The response containing the transcription results.
          """
          # Instantiates a client
          client = SpeechClient()
      
          config = cloud_speech.RecognitionConfig(
              auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
              language_codes=["en-US"],
              model="long",
          )
      
          file_metadata = cloud_speech.BatchRecognizeFileMetadata(uri=audio_uri)
      
          request = cloud_speech.BatchRecognizeRequest(
              recognizer=f"projects/{PROJECT_ID}/locations/global/recognizers/_",
              config=config,
              files=[file_metadata],
              recognition_output_config=cloud_speech.RecognitionOutputConfig(
                  inline_response_config=cloud_speech.InlineOutputConfig(),
              ),
              processing_strategy=cloud_speech.BatchRecognizeRequest.ProcessingStrategy.DYNAMIC_BATCHING,
          )
      
          # Transcribes the audio into text
          operation = client.batch_recognize(request=request)
      
          print("Waiting for operation to complete...")
          response = operation.result(timeout=120)
      
          for result in response.results[audio_uri].transcript.results:
              print(f"Transcript: {result.alternatives[0].transcript}")
      
          return response.results[audio_uri].transcript
      
      

      替换每个文件的识别功能

      默认情况下,批量识别功能对批量识别请求中的每个文件使用相同的识别配置。如果不同的文件需要不同的配置或功能,您可以使用 [BatchRecognizeFileMetadata][batch-file-metadata-grpc] 消息中的 config 字段为每个文件替换配置。如需查看覆盖识别功能的示例,请参阅识别器文档

      清理

      为避免因本页中使用的资源导致您的 Google Cloud 账号产生费用,请按照以下步骤操作。

      1. Optional: Revoke the authentication credentials that you created, and delete the local credential file.

        gcloud auth application-default revoke
      2. Optional: Revoke credentials from the gcloud CLI.

        gcloud auth revoke

      控制台

    14. In the Google Cloud console, go to the Manage resources page.

      Go to Manage resources

    15. In the project list, select the project that you want to delete, and then click Delete.
    16. In the dialog, type the project ID, and then click Shut down to delete the project.
    17. gcloud

      Delete a Google Cloud project:

      gcloud projects delete PROJECT_ID

      后续步骤