请试用 Gemini 1.5 Pro（Vertex AI 中最先进的多模态模型），看看您可以通过包含 100 万个词元的上下文窗口构建什么。 请试用 Gemini 1.5 Pro（Vertex AI 中最先进的多模态模型），看看您可以通过包含 100 万个词元的上下文窗口构建什么。

添加识别元数据

本页面介绍了如何添加向 Speech-to-Text 发出的语音识别请求中包含的源音频的更多详细信息。

Speech-to-Text 有一些种机器学习模型可用于将录制的音频转换成文本。每个模型都根据音频输入的具体特性进行了训练，这些特性包括：音频文件类型、原始录音设备、讲话人与录音设备之间的距离、音频文件中讲话人的数量，以及其他因素。

向 Speech-to-Text 发送转录请求时，可以将这些关于音频数据的附加详细信息作为“识别元数据”一并发送。Speech-to-Text 可以使用这些详细信息更准确地转录您的音频数据。

通过收集这些元数据，Google 还可以分析和汇总 Speech-to-Text 最常见的用例。这样，Google 就能优先针对最常见的用例对 Speech-to-Text 进行改进。

可用的元数据字段

您可以在转录请求的元数据中提供以下列表中的任何字段。

字段	类型	说明
`interactionType`	`ENUM`	音频的用例。
`industryNaicsCodeOfAudio`	数字	音频文件的所属的行业，即六位数的 NAICS 代码。
`microphoneDistance`	`ENUM`	麦克风与讲话人之间的距离。
`originalMediaType`	`ENUM`	音频的原始媒体，即音频还是视频。
`recordingDeviceType`	`ENUM`	用于捕获音频的设备，包括智能手机、PC 麦克风、传播媒体等。
`recordingDeviceName`	字符串	用于制作录音的设备。这是一个任意字符串，可包括“Pixel XL”、“VoIP”、“Cardioid Microphone”等名称或其他值。
`originalMimeType`	字符串	原始音频文件的 MIME 类型。示例包括：audio/m4a、audio/x-alaw-basic、audio/mp3、audio/3gpp 或其他音频文件 MIME 类型。
`obfuscatedId`	字符串	用户受隐私权保护的 ID，用于确定使用该服务的唯一身份用户的数量。
`audioTopic`	字符串	音频文件中讨论主题的任意描述，例如“纽约市导览”、“庭审听证会”或“两人之间的现场访谈”。

如需了解有关这些字段的详情，请参阅 RecognitionMetadata 参考文档。

启用识别元数据

如需为 Speech-to-Text API 的语音识别请求添加识别元数据，请将语音识别请求的 metadata 字段设置为 RecognitionMetadata 对象。Speech-to-Text API 的识别元数据支持以下所有语音识别方法：speech:recognize、speech:longrunningrecognize 和流式。如需详细了解可为请求添加哪些类型的元数据，请参阅 RecognitionMetadata 参考文档。

以下代码演示了如何在转录请求中指定其他元数据字段。

协议

如需了解完整的详细信息，请参阅 speech:recognize API 端点。

如需执行同步语音识别，请发出 POST 请求并提供相应的请求正文。以下示例展示了一个使用 curl 发出的 POST 请求。该示例针对通过 Google Cloud CLI 为项目设置的服务帐号使用访问令牌。如需了解有关安装 gcloud CLI、使用服务帐号设置项目以及获取访问令牌的说明，请参阅快速入门。

curl -s -H "Content-Type: application/json" \
    -H "Authorization: Bearer "$(gcloud auth print-access-token) \
    https://speech.googleapis.com/v1p1beta1/speech:recognize \
    --data '{
    "config": {
        "encoding": "FLAC",
        "sampleRateHertz": 16000,
        "languageCode": "en-US",
        "enableWordTimeOffsets":  false,
        "metadata": {
            "interactionType": "VOICE_SEARCH",
            "industryNaicsCodeOfAudio": 23810,
            "microphoneDistance": "NEARFIELD",
            "originalMediaType": "AUDIO",
            "recordingDeviceType": "OTHER_INDOOR_DEVICE",
            "recordingDeviceName": "Polycom SoundStation IP 6000",
            "originalMimeType": "audio/mp3",
            "obfuscatedId": "11235813",
            "audioTopic": "questions about landmarks in NYC"
        }
    },
    "audio": {
        "uri":"gs://cloud-samples-tests/speech/brooklyn.flac"
    }
}

如需详细了解如何配置请求正文，请参阅 RecognitionConfig 参考文档。

如果请求成功，服务器将返回一个 200 OK HTTP 状态代码以及 JSON 格式的响应：

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "how old is the Brooklyn Bridge",
          "confidence": 0.98360395
        }
      ]
    }
  ]
}

Node.js

在 GitHub 上查看反馈

// Imports the Google Cloud client library for Beta API
/**
 * TODO(developer): Update client library import to use new
 * version of API when desired features become available
 */
const speech = require('@google-cloud/speech').v1p1beta1;
const fs = require('fs');

// Creates a client
const client = new speech.SpeechClient();

async function syncRecognizeWithMetaData() {
  /**
   * TODO(developer): Uncomment the following lines before running the sample.
   */
  // const filename = 'Local path to audio file, e.g. /path/to/audio.raw';
  // const encoding = 'Encoding of the audio file, e.g. LINEAR16';
  // const sampleRateHertz = 16000;
  // const languageCode = 'BCP-47 language code, e.g. en-US';

  const recognitionMetadata = {
    interactionType: 'DISCUSSION',
    microphoneDistance: 'NEARFIELD',
    recordingDeviceType: 'SMARTPHONE',
    recordingDeviceName: 'Pixel 2 XL',
    industryNaicsCodeOfAudio: 519190,
  };

  const config = {
    encoding: encoding,
    sampleRateHertz: sampleRateHertz,
    languageCode: languageCode,
    metadata: recognitionMetadata,
  };

  const audio = {
    content: fs.readFileSync(filename).toString('base64'),
  };

  const request = {
    config: config,
    audio: audio,
  };

  // Detects speech in the audio file
  const [response] = await client.recognize(request);
  response.results.forEach(result => {
    const alternative = result.alternatives[0];
    console.log(alternative.transcript);
  });

Python

在 GitHub 上查看反馈

from google.cloud import speech_v1p1beta1 as speech

client = speech.SpeechClient()

speech_file = "resources/commercial_mono.wav"

with io.open(speech_file, "rb") as audio_file:
    content = audio_file.read()

# Here we construct a recognition metadata object.
# Most metadata fields are specified as enums that can be found
# in speech.enums.RecognitionMetadata
metadata = speech.RecognitionMetadata()
metadata.interaction_type = speech.RecognitionMetadata.InteractionType.DISCUSSION
metadata.microphone_distance = (
    speech.RecognitionMetadata.MicrophoneDistance.NEARFIELD
)
metadata.recording_device_type = (
    speech.RecognitionMetadata.RecordingDeviceType.SMARTPHONE
)

# Some metadata fields are free form strings
metadata.recording_device_name = "Pixel 2 XL"
# And some are integers, for instance the 6 digit NAICS code
# https://www.naics.com/search/
metadata.industry_naics_code_of_audio = 519190

audio = speech.RecognitionAudio(content=content)
config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=8000,
    language_code="en-US",
    # Add this in the request to send metadata.
    metadata=metadata,
)

response = client.recognize(config=config, audio=audio)

for i, result in enumerate(response.results):
    alternative = result.alternatives[0]
    print("-" * 20)
    print(u"First alternative of result {}".format(i))
    print(u"Transcript: {}".format(alternative.transcript))