[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-07-24。"],[],[],null,["# Prepare training data\n\n| **Preview**\n|\n|\n| This feature is subject to the \"Pre-GA Offerings Terms\" in the General Service Terms section\n| of the [Service Specific Terms](/terms/service-terms#1).\n|\n| Pre-GA features are available \"as is\" and might have limited support.\n|\n| For more information, see the\n| [launch stage descriptions](/products#product-launch-stages).\n\nLearn to prepare your audio and text data for fine-tuning a Custom Speech-to-Text model in Google Cloud Speech console. The quality of your training data impacts the effectiveness of the models you create. You will need to compose a diverse dataset that contains representative audio and text context that's directly relevant to what the model will respond to during inference time in production, including noise and unusual vocabulary.\n\nFor the effective training of a Custom Speech-to-Text model, you need:\n\n- Minimum 100 audio-hours of training data, either audio-only or with the corresponding text transcript as ground-truth. This data is crucial for the initial training phase, so the model learns the nuances of the speech patterns and vocabulary. For details, see [Create a ground-truth dataset](/speech-to-text/v2/docs/custom-speech-models/use-model#create_a_ground-truth_dataset.)\n- A separate dataset of at least 10 audio-hours of validation data, with the corresponding text transcript as ground-truth.\n\nBefore you begin\n----------------\n\nEnsure you have signed up for a Google Cloud account, created a Google Cloud project, and enabled the Speech-to-Text API:\n\n1. Navigate to Cloud Storage.\n2. Create a bucket, if you don't already have one.\n\nCreate a dataset\n----------------\n\nTo create a dataset, you will need to create two subdirectories in the Cloud Storage bucket of your choice. Follow simple naming conventions:\n\n1. Create a **training_dataset** subdirectory to store all your training files.\n2. Create a **validation_dataset** subdirectory to store all your training files.\n3. Upload your audio and text files in the directories by following the [Ground-truth annotation guidelines](/speech-to-text/v2/docs/custom-speech-models/prepare-data#ground-truth_annotation_guidelines).\n\nDataset guidelines\n------------------\n\n- For both training and validation, supported file formats are `.wav` for audio files in LINEAR16 encoding and `.txt` for text files, if available. Avoid non-ASCII characters in the filenames.\n- Audio files in the same directory should be provided in a separate TXT file, each with the same name as the corresponding WAV file, for example, my_file_1.wav, my_file_1.txt. There should be only one transcription file per audio file.\n\n### Training data\n\n- All files for training must be provided under the same directory, without any nested folders.\n- Optional: If available, provide transcriptions to the audio files. No timestamps are required.\n- Ensure that the cumulative audio length of your audio files is longer than 100 hours. If it isn't, the training job will fail.\n\nHere is an example of how the directory structure should look after the files are uploaded as a training dataset: \n\n```\n├── training_dataset\n│ ├── example_1.wav\n│ ├── example_1.txt\n│ ├── example_2.wav\n│ ├── example_2.txt\n│ ├── example_3.wav (Note: Audio only instance, without corresponding text)\n│ └── example_4.wav (Note: Audio only instance, without corresponding text)\n```\n\n### Validation data\n\n- All files for validation are provided in the same directory named **validation_dataset** without any nested folders.\n- Validation audios shouldn't be longer than 30 seconds each.\n- Provide ground-truth transcriptions for each of the audio files in the same directory in a separate TXT file.\n\nHere is an example of how the directory structure should look after the files are uploaded as a validation dataset: \n\n```\n├── validation_dataset\n│ ├── example_1.wav\n│ ├── example_1.txt\n│ ├── example_2.wav\n│ └── example_2.txt\n```\n\nGround-truth annotation guidelines\n----------------------------------\n\nRefer to the following formatting instructions.\n\n### Numbers\n\nCardinals and ordinals should be transcribed only in digits.\n\n- **Audio**: \"A deck of cards has fifty two cards, thirteen ranks of the four suits, diamonds, hearts, and spades\"\n- **Ground-truth text**: \"A deck of cards has 52 cards, 13 ranks of the four suits, diamonds, hearts, and spades\"\n\n### Currency and units\n\nTranscribe them as they're commonly written in the transcription locale. Abbreviate all units that follow numeric values. If it's clear from context that a number or number sequence refers to currency or time, format it as such.\n\n### Date and time\n\nTranscribe in the common form for dates and times used in the transcription language. Write times in `hh:mm` format, when possible.\n\n### Addresses\n\nTranscribe with full names of locations, roads, and states, for example, with abbreviations when explicitly spoken. Entities and locations should be transcribed using a comma between them.\n\n### Proper names and accents\n\nTranscribe using the official spelling and punctuation. If a personal name could have multiple spellings and context does not help, use the most frequent spelling.\n\n### Brand, product names, and media titles\n\nTranscribe them as they're officially formatted and most commonly written.\n\n### Interjections\n\nLaughter or other non-speech vocalizations should be transcribed using up to three syllables. Laughter that is included within speech should be ignored completely.\nExample:\n\n- Audio: \"ha ha ha ha ha\"\n- Ground-truth text: \"hahaha\"\n\n### Multiple speakers\n\nDon't separate them with speaker tags, because diarization is generally not supported.\n\nWhat's next\n-----------\n\nFollow the resources to take advantage of custom speech models in your application:\n\n- [Train and manage your custom models](/speech-to-text/v2/docs/custom-speech-models/train-model)\n- [Deploy and manage model endpoints](/speech-to-text/v2/docs/custom-speech-models/deploy-model)\n- [Use your custom models](/speech-to-text/v2/docs/custom-speech-models/use-model)\n- [Evaluate your custom models](/speech-to-text/v2/docs/custom-speech-models/evaluate-model)"]]