Google Cloud Speech 콘솔에서 커스텀 Speech-to-Text 모델을 미세 조정하기 위한 오디오 및 텍스트 데이터를 준비하는 방법을 알아보세요. 학습 데이터의 품질은 생성하는 모델의 효율성에 영향을 줍니다. 프로덕션에서 추론 시간 동안 모델이 응답하는 내용과 직접적으로 관련된 대표적인 오디오 및 텍스트 컨텍스트(예: 노이즈, 특이한 어휘)를 포함하는 다양한 데이터 세트를 구성해야 합니다.
커스텀 Speech-to-Text 모델을 효과적으로 학습하려면 다음이 필요합니다.
오디오 시간 최소 100시간의 학습 데이터(오디오만 또는 실사본의 해당 텍스트 스크립트가 있는 오디오). 이 데이터는 초기 학습 단계에서 중요하므로 모델이 음성 패턴과 어휘의 미묘한 차이를 학습합니다. 자세한 내용은 정답 데이터 세트 만들기를 참조하세요.
오디오 시간 최소 10시간 분량의 검증 데이터로 구성된 별도의 데이터 세트(실사본의 해당 텍스트 스크립트가 있는 오디오).
시작하기 전에
Google Cloud 계정에 가입하고, Google Cloud 프로젝트를 만들고, Speech-to-Text API를 사용 설정했는지 확인합니다.
Cloud Storage로 이동
버킷이 없는 경우 버킷을 만듭니다.
데이터 세트 만들기
데이터 세트를 만들려면 원하는 Cloud Storage 버킷에 하위 디렉터리 2개를 만들어야 합니다. 간단한 이름 지정 규칙을 따르세요.
[[["이해하기 쉬움","easyToUnderstand","thumb-up"],["문제가 해결됨","solvedMyProblem","thumb-up"],["기타","otherUp","thumb-up"]],[["이해하기 어려움","hardToUnderstand","thumb-down"],["잘못된 정보 또는 샘플 코드","incorrectInformationOrSampleCode","thumb-down"],["필요한 정보/샘플이 없음","missingTheInformationSamplesINeed","thumb-down"],["번역 문제","translationIssue","thumb-down"],["기타","otherDown","thumb-down"]],["최종 업데이트: 2025-07-24(UTC)"],[],[],null,["# Prepare training data\n\n| **Preview**\n|\n|\n| This feature is subject to the \"Pre-GA Offerings Terms\" in the General Service Terms section\n| of the [Service Specific Terms](/terms/service-terms#1).\n|\n| Pre-GA features are available \"as is\" and might have limited support.\n|\n| For more information, see the\n| [launch stage descriptions](/products#product-launch-stages).\n\nLearn to prepare your audio and text data for fine-tuning a Custom Speech-to-Text model in Google Cloud Speech console. The quality of your training data impacts the effectiveness of the models you create. You will need to compose a diverse dataset that contains representative audio and text context that's directly relevant to what the model will respond to during inference time in production, including noise and unusual vocabulary.\n\nFor the effective training of a Custom Speech-to-Text model, you need:\n\n- Minimum 100 audio-hours of training data, either audio-only or with the corresponding text transcript as ground-truth. This data is crucial for the initial training phase, so the model learns the nuances of the speech patterns and vocabulary. For details, see [Create a ground-truth dataset](/speech-to-text/v2/docs/custom-speech-models/use-model#create_a_ground-truth_dataset.)\n- A separate dataset of at least 10 audio-hours of validation data, with the corresponding text transcript as ground-truth.\n\nBefore you begin\n----------------\n\nEnsure you have signed up for a Google Cloud account, created a Google Cloud project, and enabled the Speech-to-Text API:\n\n1. Navigate to Cloud Storage.\n2. Create a bucket, if you don't already have one.\n\nCreate a dataset\n----------------\n\nTo create a dataset, you will need to create two subdirectories in the Cloud Storage bucket of your choice. Follow simple naming conventions:\n\n1. Create a **training_dataset** subdirectory to store all your training files.\n2. Create a **validation_dataset** subdirectory to store all your training files.\n3. Upload your audio and text files in the directories by following the [Ground-truth annotation guidelines](/speech-to-text/v2/docs/custom-speech-models/prepare-data#ground-truth_annotation_guidelines).\n\nDataset guidelines\n------------------\n\n- For both training and validation, supported file formats are `.wav` for audio files in LINEAR16 encoding and `.txt` for text files, if available. Avoid non-ASCII characters in the filenames.\n- Audio files in the same directory should be provided in a separate TXT file, each with the same name as the corresponding WAV file, for example, my_file_1.wav, my_file_1.txt. There should be only one transcription file per audio file.\n\n### Training data\n\n- All files for training must be provided under the same directory, without any nested folders.\n- Optional: If available, provide transcriptions to the audio files. No timestamps are required.\n- Ensure that the cumulative audio length of your audio files is longer than 100 hours. If it isn't, the training job will fail.\n\nHere is an example of how the directory structure should look after the files are uploaded as a training dataset: \n\n```\n├── training_dataset\n│ ├── example_1.wav\n│ ├── example_1.txt\n│ ├── example_2.wav\n│ ├── example_2.txt\n│ ├── example_3.wav (Note: Audio only instance, without corresponding text)\n│ └── example_4.wav (Note: Audio only instance, without corresponding text)\n```\n\n### Validation data\n\n- All files for validation are provided in the same directory named **validation_dataset** without any nested folders.\n- Validation audios shouldn't be longer than 30 seconds each.\n- Provide ground-truth transcriptions for each of the audio files in the same directory in a separate TXT file.\n\nHere is an example of how the directory structure should look after the files are uploaded as a validation dataset: \n\n```\n├── validation_dataset\n│ ├── example_1.wav\n│ ├── example_1.txt\n│ ├── example_2.wav\n│ └── example_2.txt\n```\n\nGround-truth annotation guidelines\n----------------------------------\n\nRefer to the following formatting instructions.\n\n### Numbers\n\nCardinals and ordinals should be transcribed only in digits.\n\n- **Audio**: \"A deck of cards has fifty two cards, thirteen ranks of the four suits, diamonds, hearts, and spades\"\n- **Ground-truth text**: \"A deck of cards has 52 cards, 13 ranks of the four suits, diamonds, hearts, and spades\"\n\n### Currency and units\n\nTranscribe them as they're commonly written in the transcription locale. Abbreviate all units that follow numeric values. If it's clear from context that a number or number sequence refers to currency or time, format it as such.\n\n### Date and time\n\nTranscribe in the common form for dates and times used in the transcription language. Write times in `hh:mm` format, when possible.\n\n### Addresses\n\nTranscribe with full names of locations, roads, and states, for example, with abbreviations when explicitly spoken. Entities and locations should be transcribed using a comma between them.\n\n### Proper names and accents\n\nTranscribe using the official spelling and punctuation. If a personal name could have multiple spellings and context does not help, use the most frequent spelling.\n\n### Brand, product names, and media titles\n\nTranscribe them as they're officially formatted and most commonly written.\n\n### Interjections\n\nLaughter or other non-speech vocalizations should be transcribed using up to three syllables. Laughter that is included within speech should be ignored completely.\nExample:\n\n- Audio: \"ha ha ha ha ha\"\n- Ground-truth text: \"hahaha\"\n\n### Multiple speakers\n\nDon't separate them with speaker tags, because diarization is generally not supported.\n\nWhat's next\n-----------\n\nFollow the resources to take advantage of custom speech models in your application:\n\n- [Train and manage your custom models](/speech-to-text/v2/docs/custom-speech-models/train-model)\n- [Deploy and manage model endpoints](/speech-to-text/v2/docs/custom-speech-models/deploy-model)\n- [Use your custom models](/speech-to-text/v2/docs/custom-speech-models/use-model)\n- [Evaluate your custom models](/speech-to-text/v2/docs/custom-speech-models/evaluate-model)"]]