Kimi models

Kimi models on Vertex AI offer fully managed and serverless models as APIs. To use a Kimi model on Vertex AI, send a request directly to the Vertex AI API endpoint. Because Kimi models use a managed API, there's no need to provision or manage infrastructure.

You can stream your responses to reduce the end-user latency perception. A streamed response uses server-sent events (SSE) to incrementally stream the response.

Available Kimi models

The following models are available from Kimi to use in Vertex AI. To access a Kimi model, go to its Model Garden model card.

Kimi K2 Thinking

Kimi K2 Thinking is a thinking model from Kimi that excels at complex problem-solving and deep reasoning.

Go to the Kimi K2 Thinking model card

Use Kimi models

You can use curl commands to send requests to the Vertex AI endpoint using the following model names:

For Kimi K2 Thinking, use kimi-k2-thinking-maas

To learn how to make streaming and non-streaming calls to Kimi models, see Call open model APIs.

Kimi model region availability and quotas

For Kimi models, a quota applies for each region where the model is available. The quota is specified in queries per minute (QPM).

Model	Region	Quotas	Context length
Kimi K2 Thinking
Kimi K2 Thinking	`global`		262144

If you want to increase any of your quotas for Generative AI on Vertex AI, you can use the Google Cloud console to request a quota increase. To learn more about quotas, see the Cloud Quotas overview.

What's next

Learn how to Call open model APIs.