Jump to Content
AI & Machine Learning

Your guide to Provisioned Throughput (PT) on Vertex AI

February 18, 2026
Raiyaan Serang

Senior Product Manager, Vertex AI

Try Gemini 3

Our most intelligent model is now available on Vertex AI and Gemini Enterprise

Try now

When AI agents make thousands of decisions a day, consistent performance isn't just a technical detail — it's a business requirement. 

Provisioned Throughput (PT) solves this by giving you reserved resources that guarantee capacity and predictable performance. To help you scale, we are updating PT on Vertex AI with three key improvements:

  • Model diversity: Run the right model for the right job.

  • Multimodal innovation: Process text, images, and video seamlessly.

  • Operational flexibility: Adapt your resources as your agents grow.

In this post, we’ll share the resources available to you today on Vertex AI, and how you can get started. 

Expanding support for a diverse model portfolio

A mature AI strategy requires selecting the right model for the specific task. Vertex AI Model Garden, our curated set of 200+ first-party, third-party, and open-source models, makes it easy to use the best resource for your business needs. 

We standardized the PT experience across this infrastructure to ensure your capacity strategy remains consistent regardless of the model you deploy.

  • Anthropic integration (private preview): You can now purchase and manage PT for Anthropic models directly from the Vertex AI console, bringing one of the industry's leading third-party providers into your primary capacity workflow.

  • Open model ecosystem: We have extended PT support to the most popular open-source models, including Llama 4, Qwen3, GLM-4.7, and DeepSeek-OCR, all from the same console experience.

  • Unified governance: Because PT now covers all types of models under a single framework, engineering teams no longer need to design separate reservation or procurement strategies for different model providers.

Powering multimodal innovation

The next wave of AI agents are seeing, hearing, and acting in real time. This movement toward native audio, high-definition video, and complex reasoning creates a massive, non-negotiable demand for reliable compute. 

We are ensuring that PT supports these advanced modalities as soon as they reach your production environment.

  • Gemini 3 and Nano Banana: You can now secure dedicated PT for our most capable Gemini 3 models and Nano Banana, our state-of-the-art model for high-fidelity image generation and editing.

  • Gemini Live API: By using PT with Gemini Live API, you get the guaranteed throughput required for high-bandwidth multimodal streams – whether your agents are processing live video feeds or providing real-time audio responses.

  • Veo 3 and 3.1: For video workloads, PT GSU (Generative AI Scale Unit) minimums and incremental limits have been removed for Veo 3 and Veo 3.1. This allows you to purchase the exact amount of capacity you need, making it easier to scale video generation without being forced into high entry-level commitments.

Increasing operational flexibility

Scaling for global production shouldn't mean sacrificing agility. We provide levers to treat AI compute as a dynamic resource that aligns with actual business cycles.

  • Flexible term lengths: We now offer 1-week PT terms for select models. This allows you to secure guaranteed capacity for high-impact, short-term windows – like a holiday traffic spike or a product launch – without a monthly or yearly commitment.

  • Proactive capacity planning: You can now schedule change orders for your PT requests up to two weeks in advance for select models. This enables your team to automate the ramp-up of resources for known peak events, shifting your strategy from reactive scaling to proactive planning.

  • Maximizing token value: For agentic workloads with long, repetitive contexts, PT now integrates with explicit caching for select models. This delivers reserved performance alongside the significant input cost reductions of caching, ensuring the price of your reservation aligns with actual business value.

How customers are scaling with confidence on Vertex AI

https://storage.googleapis.com/gweb-cloudblog-publish/images/1_-_reve_ai_logo.max-700x700.jpg

"Reve leverages Provisioned Throughput on Vertex AI to power the computational intelligence behind our creative tools. Vertex AI has proven to be the fastest, lowest-latency platform for the foundation models our users depend on – making our most critical interactive features over twice as fast. We’ve been impressed by the performance, availability, and flexibility Vertex AI provides our engineering teams." Jon Watte, CTO, Reve AI

https://storage.googleapis.com/gweb-cloudblog-publish/images/2_-_knowunity_logo.max-700x700.jpg

"At Knowunity, we rely on Provisioned Throughput on Vertex AI to help 20 million students study smarter. We see massive traffic spikes in the afternoon when students return from school – processing over 1 million tokens per second at peak. Before PT, we frequently hit capacity constraints during these hours; now, with the premium performance and the flexibility to change models as needed, we have the guaranteed scale to support our global users with confidence." Lucas Hild, co-founder and CTO, Knowunity

https://storage.googleapis.com/gweb-cloudblog-publish/images/3_-_paloalto_network.max-700x700.jpg

"At Palo Alto Networks, we are integrating Gemini models across our ecosystem – from Strata Copilot to autonomous operations, and AI Canvas to our internal tools. Moving from pay-as-you-go to Provisioned Throughput on Vertex AI was a turning point for us; it provided the guaranteed latency we need for production and the ability to isolate reservations per use case. This ensures we can serve various applications with the specific performance guarantees each one requires, delivering AI-driven security at a global scale." Rajesh Bhagwat, VP of Engineering, Palo Alto Networks

https://storage.googleapis.com/gweb-cloudblog-publish/images/4_-_juicebox_logo.max-700x700.jpg

"Provisioned Throughput on Vertex AI made it easy for us to take open models from experimentation into real production traffic. Because we use LLMs for search, we need extremely high throughput, high concurrency, and predictable latency at scale. Vertex AI was the only platform that could reliably support those requirements for us with its multi-tenant system and flexible commitment model." Ishan Gupta, co-founder, Juicebox

https://storage.googleapis.com/gweb-cloudblog-publish/images/5_-_Freepik-Logo.max-700x700.jpg

"Provisioned Throughput on Vertex AI has given us a predictable and controllable way to scale our Gen AI workloads globally, which is critical for both operational stability and cost-management. It provides the solid foundation we need to plan capacity ahead of demand – especially during peak periods like Black Friday and Cyber Monday – allowing us to balance high-speed performance with cost efficiency and grow our platform with confidence.” Francisco Castro Barea, VP of Engineering, Freepik

Four steps to secure your 2026 production capacity

According to our  AI Agent Trends 2026 Report, 2026 will be the year every employee goes from "guessing" to "knowing" – provided organizations invest in the right infrastructure. 

By aligning your token requirements with reserved capacity, you ensure your agents are always ready to act.

  1. Calculate: Use the Vertex AI Generative AI Scale Unit Estimator to determine the GSUs required for your mission-critical baseload.
  2. Reserve: Visit the Provisioned Throughput dashboard in the Vertex AI console to purchase capacity for your choice of models.
  3. Request access: For customers looking to manage Anthropic models within their primary capacity workflow, complete this form to request access to the Private Preview.
  4. Implement: Contact your Google Cloud account team to discuss a 2026 capacity plan that layers PT with our broader consumption portfolio for maximum resilience.
Posted in