此页面记录 Speech-to-Text 的正式版更新。您可以定期查看此页面,了解有关新增功能、功能更新、问题修复、已知问题和弃用功能的公告。
您可以在 Google Cloud 版本说明页面上查看 Google Cloud 所有产品的最新产品动态。
要接收最新产品动态,请将本页面的网址添加到您的 Feed 阅读器,或直接添加 Feed 网址:https://cloud.google.com/feeds/speech-release-notes.xml
October 07, 2024
Speech-to-Text has updated the Generally Available Chirp 2 model, further enhancing its ASR accuracy and multilingual capabilities. Under the existing chirp_2
model flag, you can experience significant improvements in accuracy and speed, as well as support for word-level timestamps, model adaptation, and speech translation. Finally, Chirp 2 can support Streaming Recognizer requests, in addition to the already supported Sync and Batch Recognition requests, allowing its use in realtime applications.
Explore the new chirp_2 model's capabilities and learn how to leverage its full potential by visiting our updated documentation and tutorials.
January 09, 2024
Model adaptation is now available for latest_long
models in 13 languages. Also, its quality was substantially improved for latest_short
models. To determine whether this feature is available for your language, see Language support.
January 08, 2024
Speech-to-Text has launched a new model, named chirp_telephony
to bring the accuracy gains of our chirp
model to telephony-specific use cases. The new model is a fine-tuned version of our very successful chirp
model, based on the Universal large Speech Model(USM) architecture, on audio that originated from a phone call typically recorded at an 8 kHz sampling rate. For more information, see Speech-to-Text supported languages.
November 06, 2023
Speech-to-Text has launched two models, named telephony
and telephony_short
. The two models are customized to recognize audio that originates from a phone call and corresponds to the most recent versions of the existing phone_call
model. For more information, see Speech-to-Text supported languages.
February 07, 2023
We are removing SpeechContext.strength
field within the next 4
weeks, because it has been deprecated and unused for more than a year. The
documentation doesn't have references to this field anymore, and the clients aren't supposed to use it.
November 11, 2022
Speech-to-Text has updated its pricing policy. Enhanced models are no longer priced differently than standard models. Usage of all models will be reported to and priced like standard models. Also, all Cloud Speech-to-Text requests will now be rounded up to the nearest 1 second, with no minimum audio length (requests were previously rounded up to the nearest 15 seconds). See the Pricing page for details.
October 03, 2022
Speaker Diarization is now available for "Latest" models in en-US. This feature recognizes multiple speakers in the same audio clip. Latest models use a new model for diarization from previous models. For more information see Speaker Diarization.
April 21, 2022
"Latest" models are available in more than 20 languages. These models employ new end-to-end machine learning techniques and can improve the accuracy of your recognized speech. For more information see Latest models.
November 08, 2021
Speech-to-Text has launched two new medical speech models, which are tailored for recognition of words that are common in medical settings. See the medical models documentation for more details.
July 21, 2021
Speech-to-Text has launched a GA version of the Spoken Emoji and Spoken Punctuation features. See the documentation for details.
June 28, 2021
The Speech-to-Text now supports multi-region endpoints as a GA feature. See the multi-region endpoints documentation for more information.
May 24, 2021
Speech-to-Text now supports Spoken Punctuation and Spoken Emoji as Preview features. See the documentation for details.
May 07, 2021
The Speech-to-Text model adaptation feature is now a GA feature. See the model adaptation concepts page for more information about using this feature.
March 23, 2021
Speech-to-Text now allows you to upload your longrunning transcription results directly into a Cloud Storage bucket. See the asynchronous speech recognition documentation for more details.
March 15, 2021
Speech-to-Text has launched the Model Adaptation feature. You can now create custom classes and build phrase sets to improve your transcription results.
January 26, 2021
Speech-to-Text now supports regional EU and US endpoints. See the multi-region endpoints documentation for more information.
August 25, 2020
Speech-to-Text has launched the new On-Prem API. Speech-to-Text On-Prem enables easy integration of Google speech recognition technologies into your on-premises solution.
March 05, 2020
Cloud Speech-to-Text now supports seven new languages: Burmese, Estonian, Uzbek, Punjabi, Albanian, Macedonian, and Mongolian.
The speaker diarization, automatic punctuation, speech adaptation boost, and enhanced telephony model features are now available for new languages. See the supported languages page for a complete list.
Class tokens are now available for general use. You can use class tokens with speech adaptation to help the model recognize concepts in your recorded audio data.
November 26, 2019
Automatic punctuation is now available for general use. Cloud Speech-to-Text can insert punctuation into transcription results, including commas, periods, and question marks.
July 23, 2019
Cloud Speech-to-Text has several endless streaming tutorials that demonstrate how to transcribe an infinite audio stream.
You can now use speech adaptation to provide 'hints' to Cloud Speech-to-Text when it performs speech recognition. This feature is now in beta.
June 18, 2019
Cloud Speech-to-Text now has expanded to a 5 minute limit for
streaming recognition.
To use streaming recognition with the 5 minute limit, you must use the
v1p1beta1
API version.
Cloud Speech-to-Text now supports transcription of MP3 encoded audio data.
As this feature is in beta, you must use the
v1p1beta1
API version.
April 04, 2019
The v1beta
version of the service is no longer available for use. You must migrate your solutions to either the
v1
or
v1p1beta1
version of the API.
February 20, 2019
Data logging is now available for general use. When you enable data logging, you can reduce the cost of using Cloud Speech-to-Text by allowing Google to log your data in order to improve the service.
Enhanced models are now available for general use. Using enhanced models can improve audio transcription results.
Using enhanced models no longer requires you to opt-in for data logging. Enhanced models are available for use by any transcription requests for a different price as standard models.
Selecting a transcription model is now available for general use. You can select different speech recognition models when you send a request to Cloud Speech-to-Text, including a model optimized for transcribing audio data from video files.
Cloud Speech-to-Text can transcribe audio data that includes multiple channels. This feature is now available for general use.
You can now include more details about your audio source files in transcription requests to Cloud Speech-to-Text in the form of recognition metadata, which can improve the results of the speech recognition. This feature is now available for general use.
July 24, 2018
Cloud Speech-to-Text provides word-level confidence Developers can use this feature to get the degree of confidence on a word-by-word level. This feature is in Beta.
Cloud Speech-to-Text can automatically detect the language used in an audio file. To use this feature, developers must specify alternative languages in their transcription request. This feature is in Beta.
Cloud Speech-to-Text can identify different speakers present in an audio file. This feature is in Beta.
Cloud Speech-to-Text can transcribe audio data that includes multiple channels. This feature is in Beta.
April 09, 2018
Cloud Speech-to-Text now provides data logging and enhanced models. Developers that want to take advantage of the enhanced speech recognition models can opt-in for data logging. This feature is in Beta.
Cloud Speech-to-Text can insert punctuation into transcription results, including commas, periods, and question marks. This feature is in Beta.
You can now select different speech recognition models when you send a request to Cloud Speech-to-Text, including a model optimized for transcribing audio from video files. This feature is in Beta.
You can now include more details about your audio source files in transcription requests to Cloud Speech-to-Text in the form of recognition metadata, which can improve the results of the speech recognition. This feature is in Beta.
January 16, 2018
Support for the OGG_OPUS
audio
encoding has been expanded to support 8000 Hz, 12000 Hz, 16000 Hz, 24000 Hz, or 48000 Hz.
August 10, 2017
Time offsets (timestamps) are now available.
Set the enableWordTimeOffsets
parameter to true in your request configuration
and Cloud Speech-to-Text will include time offset values for the beginning and
end of each spoken word that is recognized in the audio for your request.
For more information, see
Time offsets (timestamps).
Cloud Speech-to-Text has added recognition support for 30 new languages. For a complete list of all supported languages, see the language support reference.
The limit on the length of audio that you can send with an asynchronous recognition request has been increased from ~80 to ~180 minutes. For information on Cloud Speech-to-Text limits, see the quotas & limits. For more information on asynchronous recognition requests, see the transcribing long audio files guide.
April 18, 2017
Release of Cloud Speech-to-Text v1.
The v1beta1
release of Cloud Speech-to-Text
has been deprecated. The v1beta1
endpoint
continues to be available for a period of time as defined in the
terms of service.
To avoid being impacted when the v1beta1
is discontinued, replace references to v1beta1
in your code with
v1
and update your code with valid v1
API names and values.
A language_code
is now required with requests to Cloud Speech-to-Text. Requests
with a missing or invalid language_code
will return an error.
(Pre-release versions of the API used en-US
if the language_code
was omitted from the request.)
SyncRecognize
is renamed to Recognize
.
v1beta1/speech:syncrecognize
is renamed to v1/speech:recognize
. The behavior is unchanged.
The sample_rate
field has been renamed to sample_rate_hertz
. The behavior is unchanged.
The EndpointerType
enum has been renamed to SpeechEventType
.
The following SpeechEventType
enums have been removed.
START_OF_SPEECH
END_OF_SPEECH
END_OF_AUDIO
The END_OF_UTTERANCE
enum
has been renamed to END_OF_SINGLE_UTTERANCE
. The behavior is unchanged.
The result_index field
has been removed.
The speech_context
field has been replaced by the
speech_contexts
field, which is a repeated field. However, you can specify,
at most, one speech context. The behavior is unchanged.
The SPEEX_WITH_HEADER_BYTE
and OGG_OPUS
codecs have been
added to support audio encoder implementations for legacy applications.
We do not recommend using lossy codes, as they result in a lower-quality
speech transcription. If you must use a low-bitrate encoder, OGG_OPUS
is preferred.
You are no longer required to specify the encoding and sample rate for WAV or FLAC files. If omitted, Cloud Speech-to-Text automatically determines the encoding and sample rate for WAV or FLAC files based on the file header. If you specify an encoding or sample rate value that does not match the value in the file header, then Cloud Speech-to-Text will return an error. This change is backwards-compatible and will not invalidate any currently valid requests.