Calculate Provisioned Throughput requirements

This section explains the concepts of generative AI scale unit (GSU) and burndown rates. Provisioned Throughput is calculated and priced using generative AI scale units (GSUs) and burndown rates.

GSU and burndown rate

A generative AI scale unit (GSU) is a measure of throughput for your prompts and responses. This amount specifies how much throughput to provision a model with.

A burndown rate is a ratio that converts the input-and-output characters to input characters per second (throughput). This ratio is used to produce a standard unit across models.

Different models use different amounts of throughput. For information about the minimum GSU purchase amount and increments for each model, see Supported models and burndown rates in this document.

This equation demonstrates how throughput is calculated:

inputs_per_query = inputs_across_modalities_converted_using_burndown_rates
outputs_per_query = outputs_across_modalities_converted_using_burndown_rates

throughput_per_second = (inputs_per_query + outputs_per_query) * queries_per_second

The calculated throughput per second determines how many GSUs that you need for your use case.

Important Considerations

To help you plan for your Provisioned Throughput needs, review the following important considerations:

  • Requests are prioritized.

    Provisioned Throughput customers are prioritized and serviced first before on-demand requests.

  • Throughput doesn't accumulate.

    Unused throughput doesn't accumulate or carry over to the next month.

  • Provisioned Throughput is measured on characters or tokens per second.

    Provisioned Throughput is measured on characters or tokens per second, not on queries per minute (QPM). As a result, measuring Provisioned Throughput depends on your use case's query size, response size, and QPM.

  • Provisioned Throughput checks your quota.

    Your Provisioned Throughput quota is checked each time you make a request within your quota window. For gemini-2.0-flash-001, gemini-1.5-flash-002, and gemini-1.5-pro-002 models, the quota window can range up to 30 seconds and is subject to change. This means that you might temporarily experience prioritized traffic that exceeds your quota amount on a per-second basis in some cases, but you shouldn't exceed your quota on a 30-second basis. The quota window for other models can range up to one minute. The quota windows are based on the Vertex AI clock time and are independent of when requests are made.

    For example, if you purchase 1 GSU of gemini-1.5-pro-002, then you should expect 800 characters per second of always-on throughput. On average, you shouldn't be able to exceed 24,000 characters on a 30-second basis, which is calculated using this formula:

    800 characters per second * 30 seconds = 24,000 characters

    If you submitted a single request all day that consumed 1,600 characters in a second, it might still be processed as a provisioned throughput request even though you exceeded your 800 characters per second limit at the time of the request.

  • Provisioned Throughput is specific to a project, region, model, and version.

    Provisioned Throughput is assigned to a specific project-region-model-version combination. The same model called from a different region won't count against your Provisioned Throughput quota and won't be prioritized over on-demand requests.

Example of estimating your Provisioned Throughput needs

To estimate your Provisioned Throughput needs, use the estimation tool in the Google Cloud console. The following example illustrates the process of estimating the amount of Provisioned Throughput for your model. The region isn't considered in the estimation calculations.

This table provides the burndown rates for gemini-1.5-flash that you can use to follow the example.

Model Throughput per GSU (chars/sec) Minimum GSU purchase increment Burndown rates
Gemini 1.5 Flash Less than or equal to 128,000-token context window:
54,000

Greater than 128,000-token context window:
27,000
1 Less than or equal to 128,000-token context window:
1 input char = 1 char
1 output char = 4 chars
1 image = 1,067 chars
1 video per second = 1,067 chars
1 audio per second = 107 chars

Greater than 128,000-token context window:
1 input char = 2 chars
1 output char = 8 chars
1 image = 2,134 chars
1 video per second = 2,134 chars
1 audio per second = 214 chars
  1. Gather your requirements.

    1. In this example, your requirement is to ensure that you can send 2,000 characters with 2 images and receive 300 characters of output for 10 queries per second using gemini-1.5-flash.

      This step means that you understand your use case, because you have identified the size of your inputs and outputs, the number of queries per second (QPS), and your model.

    2. To estimate your throughput, specify your model. In this example, your model is gemini-1.5-flash.

    3. Specify the type of input, and identify the burndown rate. Use the burndown rate to identify the burndown rate based on your type of input.

      An image's burndown rate for the gemini-1.5-flash model is 1,067 characters.

  2. Calculate your throughput.

    1. Multiply the number of images by the burndown rate for the input type for your specific model.

      2 images * 1,067 input characters per image = 2,134 input characters

    2. Your total output characters is 300. Return to the burndown rates table, and find the burndown rate for output characters (four characters per output character) for your specific model (gemini-1.5-flash).

      300 output characters * 4 characters per output character = 1,200 converted input characters

    3. Add your totals together.

      2,000 input characters + 2,134 converted input characters for the images + 1,200 converted input characters for the output = 5,334 converted input characters per query

    4. Multiply the characters per query by your expected queries per second to get the total throughput per second.

      5,334 converted input characters per query * 10 QPS = 53,340 total converted input characters per second

  3. Calculate your GSUs.

    1. The GSUs are the total throughput per second divided by throughput per GSU from the burndown table.

      53,340 total converted input chars per second ÷ 54,000 throughput per GSU = 0.988 GSUs

    2. The minimum GSU purchase increment for gemini-1.5-flash is 1, which meets your requirement.

What's next