Model training

We suggest that you find and work with a voice actor who represents the custom voice you're aiming for. You need to record about 10 seconds of audio with your voice actor to use as training data. You also need to record the consent statement of the voice actor. It takes us less than a few minutes to train and serve the cloned model. There is no SLA support for critical bugs for pre-GA features.

Step 1: Create training data for cloning

Record the consent statement: To comply with legal and ethical guidelines for voice cloning, record the required consent statement as a mono WAV file, with LINEAR16 encoding and a 24 kHz sampling rate, in the appropriate language. (I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.)
Record initial audio: Use your computer microphone to record 10 seconds of audio as a LINEAR16-encoded, mono WAV file at a 24 kHz sampling rate. Ensure there is no background noise during the recording.
Store audio files: Save the recorded audio files in a designated Cloud Storage location.

Step 2: Create a cloned model

You can create a cloning project through the Text-to-Speech console.

Navigate to the Synthesize page within the Text-to-Speech console
Enter the text which will be synthesized into speech, and select the target language code (only en-US applies).
Select the Custom voice checkbox, and click Generate key.
Complete all the required fields in the subtask that opens.
A voice cloning key should now appear in the synthesize form:
- You can save this key for future use to skip the "Generate key" process in the future
- Note: We don't retain your key. Anyone with access to your Cloud project can use it to generate synthetic speech with your cloned voice, so be sure to keep it secure.
Toggle the Advanced settings section, and enter 24000 in the Sample rate (Hertz) field, then click Synthesize.

We only synthesize cloned voices at 24kHz as of now.

You can download or play the audio right away to hear how it sounds.