Training data requirements

Training a custom voice can be an exciting experience. To ensure the resulting model adheres to your vision, follow these instructions and consider working with a voice partner or director.

While some stylistic variability helps bring a voice to life, performance consistency is important throughout your recordings. Any recordings with significant changes in energy, persona, projection level, or vocal fry (for example, due to fatigue) should be re-taken, possibly after a short break for the voice actor. Match reference files should be regularly played for the actor and director to ensure consistency across all recorded lines.

Scripting

If you build your own script, the format should follow a similar pattern:

500 individual recordings (The total sum of all recording files should be around 20 to 30 minutes.)
Roughly one recording per line

Data formatting

You will need to provide a csv file to help properly align the audio to your script. Here is an example CSV file.

Each recording should only include one line from the script, saved as WAV file. Name your first file 0001.wav, name your second file 0002.wav, and so on.
Column 1: No header. The lines of script in the audio file.
The gcloud storage URI of the WAV audio file. For example: gs://YOUR_BUCKET_NAME/0001.wav.
Align the CSV to the audio exactly so that there are corresponding audio files for each transcript line and there are no blank lines.
Tip: Only include what is spoken in the transcription.
- Don't add line numbers (5. Where are the rainbows?) or unverbalized codes (The zip code is 08654 should be formatted as The zip code is zero eight six five four.).
- Often times the final spoken words vary from the initial script. For the best quality, make sure to adjust the CSV to the final spoken word instead of copy and pasting the script itself.
- If you see a sequence of characters separated by spaces, pronounce each character individually. Pronounce each letter in optimize individually.

Recording recommendations

These are the ideal recording requirements. While a model can still be trained without having met these requirements, we cannot guarantee the quality of the model. The most important, and commonly overlooked, requirements are:

Standard audio file format (48kHz/24bit, WAV). Audio can be recorded at a higher sampling rate and downsampled to 48kHz/24bit. Do not up-sample audio from lower rates.
Target average volume is -23 LUFS +- 2 (ITU-R BS.1770-3).

Recording specifications

Standard audio file format (48kHz/24bit, WAV). Audio can be recorded at a higher sampling rate and downsampled to 48kHz/24bit. Do not up-sample audio from lower rates.
The audio should be recorded without lossy compression. Linear PCM (LPCM) format with a WAV header is required. Provide mono audio.
High quality professional recording studio with low reflection time (RT) or decay time (room sound).
- Any reflective surfaces should have acoustic treatment foam applied until RT time is reduced as low a level as possible.
Professional large diaphragm condenser microphone (U87, TLM 193, or comparable).
High signal-to-noise ratio (SNR), with proper gain staging and microphone placement.
Audio files should have short silences at the beginning and end (>100 ms and <500ms). Please do not append digital silence (that is, append sequences of 0).
Audio should be recorded flat with no equalization, compression or other DSP.
Make sure the recording is clean, with no obvious background or channel noise.
Specific linguistic artifacts to avoid: Vocal fry/creak, breathy delivery, stuttering or improper pauses in the middle of a sentence

Match reference files

Reference recordings, or match files, are files that are captured at the beginning of a recording project. These files are used during the entirety of the recording project and should not change. They represent the hallmark characteristics of the performance in terms of persona, volume, energy, cadence, articulation, intonation, and spectral properties. The match file is used as a reference for all subsequent recordings. It is used throughout a recording session to calibrate signal capture and provide guidance and consistency for a performance.

Create a match reference file

The process of recording match files is done in collaboration with the director (who indicates the type of performance they are seeking) and the recording engineer (who makes sure that the proper audio spec level is captured in the match file). All audio recorded should conform to the match file's characteristics. Use these files to ensure consistency of the following parameters throughout the recording:

Persona and style continuity
Root pitch or tone of the performance
Rate of speech
Volume

What's next

Now that the data is ready, you can create your custom voice model.