Overview of custom speech models

Custom Speech-to-Text models help you fine-tune speech recognition models to your specific needs. This service is designed to enhance the accuracy and relevance of speech recognition service in diverse environments and use cases, using your domain-specific audio and text data.

Accessible in both our Google Cloud console and API, Custom Speech-to-Text models allow to train, evaluate and deploy a dedicated speech model in a no-code integrated environment. For training you can provide audio data only that are representative of your audio conditions, without reference transcriptions as a training set. However, you need to provide audio data and their reference transcriptions as part of your evaluation set.

Creating and using a custom Speech-to-Text model involves the following steps:

Prepare and upload training data in a Cloud Storage bucket.
Train a new custom model.
Deploy and manage your custom model using endpoints.
Use and evaluate your custom model in your application.

How does it work

You can use Custom Speech-to-Text models to augment a base transcription model to improve transcription recognition. Some audio conditions, including sirens, music, and excessive background noise can pose acoustic challenges. Certain accents or unusual vocabulary, such product names can as well.

Every Custom Speech-to-Text model uses pre-trained, Conformer-based architecture as a base model trained with proprietary data of commonly spoken language. During the training process, the base model is fine-tuned by adapting a significant percentage of the original weights to improve recognition of domain-specific vocabulary and audio conditions specific to your application.

For the effective training of a Custom Speech-to-Text model, you must provide:

Minimum 100 audio-hours of training data, either audio-only or audio with the corresponding text transcript as ground-truth. This data is crucial for the initial training phase, so the model comprehensively learns the nuances of the speech patterns and vocabulary. For details, see Create a ground-truth dataset.
A separate dataset of at least 10 audio-hours of validation data, with the corresponding text transcript as ground-truth. You can learn more about the expected format and ground-truth conventions to be followed in our data preparation instructions.

Following a successful training, you can deploy a Custom Speech-to-Text model in an endpoint with one click, and use it directly through the Cloud Speech-to-Text V2 API for inference and benchmarking.

Supported models, languages and regions

Custom Speech-to-Text models support the following combinations of models and languages and locales for training:

Language	BCP-47	Base model
German (Germany)	de-DE	`latest_long`
English (Australia)	en-AU	`latest_long`
English (United Kingdom)	en-GB	`latest_long`
English (India)	en-IN	`latest_long`
English (United States)	en-US	`latest_long`
Spanish (United States)	es-US	`latest_long`
Spanish (Spain)	es-ES	`latest_long`
French (Canada)	fr-CA	`latest_long`
French (France)	fr-FR	`latest_long`
Hindi (India)	hi-IN	`latest_long`
Italian (Italy)	it-IT	`latest_long`
Japanese (Japan)	ja-JP	`latest_long`
Korean (South Korea)	ko-KR	`latest_long`
Dutch (Netherlands)	nl-NL	`latest_long`
Portuguese (Brazil)	pt-BR	`latest_long`
Portuguese (Portugal)	pt-PT	`latest_long`

Additionally, to comply with your data residency requirements we offer training and deployment hardware in different regions. Dedicated hardware is supported in the following combinations of models and regions:

Base model	Google Cloud Region	Supported tasks
`latest_long`	`us-east1`	Training and Deployment
`latest_long`	`europe-west4`	Training and Deployment

Quota

For Custom Speech-to-Text model training, each Google Cloud project should have enough default quota to run multiple training jobs concurrently and is intended to meet the needs of most projects without additional adjustments. However if you need to run a higher number of concurrent training jobs or require more extensive labeling or compute resources, then request additional quota.

For a Custom Speech-to-Text model serving an endpoint deployment, each endpoint has a theoretical limit of 20 queries per second (QPS). If higher throughput is required, request additional serving quota.

Pricing

Creating and using a Custom Speech-to-Text model involves certain costs which are primarily based on the resources used during the training and subsequent deployment of the model. Specifically, the Custom Speech-to-Text model will incur the following costs in a typical model lifecycle:

Training: You will be charged for the number of model-training hours. This time is proportional to the amount of audio-hours in the training dataset. As a rule, training takes a tenth of the number of audio-hours in the dataset.
Deployment: You will be charged for each hour that a model is deployed on an endpoint.
Inference: You will be charged for the number of streamed seconds of audio for transcription, in alignment with the general Speech-to-Text billing.

Understanding these costs is crucial for effective budgeting and resource allocation. For more information, in the Custom Speech-to-Text models section, see Cloud Speech-to-Text pricing.

What's next

Follow the resources to take advantage of custom speech models in your application: