Measure and improve accuracy

In this quickstart, learn how to measure and improve the accuracy of the Google Cloud Speech-to-Text for your audio data. Also explore the various models and options available from the API to enhance transcription accuracy. Explore how to use the Speech-to-Text UI in the Google Cloud console and a ground-truth file to measure accuracy and to gain insights into the Speech-to-Text system.

Machine Learning (ML) systems are inherently subject to inaccuracies, and Automatic Speech Recognition (ASR) systems, also known as Speech-to-Text systems, are no exception. Accurate measurement of accuracy is strongly coupled to specific use cases and the systems being evaluated, as differences in audio recording quality and acoustic conditions can significantly impact accuracy. As a result, a singular accuracy score for all customers and use cases is impractical. To ensure reliable performance of ASR systems in critical production-facing systems performance. It is also essential to understand how Speech-to-Text performs within the broader context of your system.

For the purposes of this quickstart guide,use the industry standard method for comparison, Word Error Rate (WER), often abbreviated as WER. For more information on how WER is calculated and interpreted see Measure and improve speech accuracy. Let's start.

Getting started with Speech-to-Text Console

Ensure you have signed up for a Google Cloud account and created a project. 1. Go to Speech in Google Cloud console, and navigate to Speech-to-Text UI. 2. Using an audio file that is acoustically representative of your use case and how you are planning to use the ASR system, follow the quickstart instructions for making your first transcription using the Speech-to-Text.

Calculating Transcription Accuracy

  1. After you have successfully transcribed your audio file, use the Transcription Accuracy section. This section remains empty until accuracy is calculated for your transcription.
  2. Using the Upload Ground Truth button at the top of the section, you can begin calculating accuracy.
    Screenshot of the Speech-to-Text transcription details page, showing transcription accuracy section and the upload ground truth button

Specifying ground truth

  1. To calculate the accuracy of the transcription, provide a ground truth file. This is a .txt or .csv file, usually a human-generated transcription file that contains the correct or expected transcriptions for comparison.
  2. Using gs://cloud-samples-data/speech/brooklyn_bridge.wav as an example. The ground truth file contains: How old is the Brooklyn Bridge. If you don't have a ground truth file available, a recommendation is to download the transcription in a text format. Edit the transcription file as needed. Upload the transcription file as the ground truth file.
  3. Using Upload or an existing Cloud Storage file, specify the ground truth file, and click Save.
    Screenshot of the Speech-to-Text transcription creation page, showing selection or upload for a ground truth file.

Confirming ground truth

  1. After clicking Save, a prompt displays to confirm that the specified ground truth file is correct. Verify that the ground truth file accurately represents the correct transcriptions, as it directly affects the accuracy metrics.
  2. Click Confirm to proceed.
    Screenshot of the Speech-to-Text transcription page, showing the contents of the uploaded ground truth file.

Review evaluation results

  1. Depending on the size of the input data, the evaluation process might take some time, and the results are displayed upon completion.
  2. Once the evaluation is complete, the following sections are displayed:
    • The Transcription Accuracy table, the accuracy metrics, and a link to the ground truth file that were used in the process.
    • The Transcription with a toggle for comparing to the ground truth file along with a breakdown of accuracy metrics and highlights.
  3. Review and interpret the accuracy results to understand the performance of the Speech-to-Text recognizer that are used to identify areas for improvement, as the results vary depending on the inputs and transcription used. In the following examples, you can see indicative cases of the accuracy results, which provide valuable insights for optimization of the Google Cloud Speech-to-Text system.
    • An example of 0% WER:
      Screenshot of the Speech-to-Text transcription accuracy page, showing computed evaluation results for the given transcript with 0% word error rate.
    • An example of 40% WER:
      Screenshot of the Speech-to-Text transcription accuracy page, showing computed evaluation results for the given transcript with 40% word error rate.

Optional: updating ground truth

You can test a different ground truth against the existing transcription, by reattaching a different file and then repeating steps three and four with an updated ground truth file.

Try it for yourself

If you're new to Google Cloud, create an account to evaluate how Speech-to-Text performs in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Try Speech-to-Text free