Use a context cache

You can use REST APIs or the Python SDK to reference content stored in a context cache in a generative AI application. Before it can be used, you must first create the context cache.

The context cache object you use in your code includes the following properties:

  • name - The context cache resource name. Its format is projects/PROJECT_NUMBER/locations/LOCATION/cachedContents/CACHE_ID. When you create a context cache, you can find its resource name is in the response. The project number is a unique identifier for your project. The cache ID is an ID for your cache. When you specify a context cache in your code, you must use the full context cache resource name. The following is an example that shows how you specify a cached content resource name in a request body:

    "cached_content": "projects/123456789012/locations/us-central1/123456789012345678"
    
  • model - The resource name for the model used to create the cache. Its format is projects/PROJECT_NUMBER/locations/LOCATION/publishers/PUBLISHER_NAME/models/MODEL_ID.

  • createTime - A Timestamp that specifies the create time of the context cache.

  • updateTime - A Timestamp that specifies the most recent update time of a context cache. After a context cache is created, and before it's updated, its createTime and updateTime are the same.

  • expireTime - A Timestamp that specifies when a context cache expires. The default expireTime is 60 minutes after the createTime. You can update the cache with a new expiration time. For more information, see Update the context cache.
    After a cache expires, it's marked for deletion and you shouldn't assume that it can be used or updated. If you need to use a context cache that expired, you need to recreate it with an appropriate expiration time.

Context cache use restrictions

The following features can be specified when you create a context cache. You shouldn't specify them again in your request:

  • The GenerativeModel.system_instructions property. This property is used to specify instructions to the model before the model receives instructions from a user. For more information, see System instructions.

  • The GenerativeModel.tool_config property. The tool_config property is used to specify tools used by the Gemini model, such as a tool used by the function calling feature.

  • The GenerativeModel.tools property. The GenerativeModel.tools property is used to specify functions to create a function calling application. For more information, see Function calling.

Use a context cache sample

The following shows how to use a context cache. When you use a context cache, you can't specify the following properties:

  • GenerativeModel.system_instructions
  • GenerativeModel.tool_config
  • GenerativeModel.tools

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Vertex AI SDK for Python API reference documentation.

Streaming and non-streaming responses

You can choose whether the model generates streaming responses or non-streaming responses. For streaming responses, you receive each response as soon as its output token is generated. For non-streaming responses, you receive all responses after all of the output tokens are generated.

For a streaming response, use the stream parameter in generate_content.

  response = model.generate_content(contents=[...], stream = True)
  

For a non-streaming response, remove the parameter, or set the parameter to False.

Sample code

import vertexai

from vertexai.preview.generative_models import GenerativeModel
from vertexai.preview import caching

# TODO(developer): Update and un-comment below lines
# PROJECT_ID = "your-project-id"
# cache_id = "your-cache-id"

vertexai.init(project=PROJECT_ID, location="us-central1")

cached_content = caching.CachedContent(cached_content_name=cache_id)

model = GenerativeModel.from_cached_content(cached_content=cached_content)

response = model.generate_content("What are the papers about?")

print(response.text)
# Example response:
# The provided text is about a new family of multimodal models called Gemini, developed by Google.
# ...

Go

Before trying this sample, follow the Go setup instructions in the Vertex AI quickstart. For more information, see the Vertex AI Go SDK for Gemini reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

Streaming and non-streaming responses

You can choose whether the model generates streaming responses or non-streaming responses. For streaming responses, you receive each response as soon as its output token is generated. For non-streaming responses, you receive all responses after all of the output tokens are generated.

For a streaming response, use the GenerateContentStream method.

  iter := model.GenerateContentStream(ctx, genai.Text("Tell me a story about a lumberjack and his giant ox. Keep it very short."))
  

For a non-streaming response, use the GenerateContent method.

  resp, err := model.GenerateContent(ctx, genai.Text("What is the average size of a swallow?"))
  

Sample code

import (
	"context"
	"errors"
	"fmt"
	"io"

	"cloud.google.com/go/vertexai/genai"
)

// useContextCache shows how to use an existing cached content, when prompting the model
// contentName is the ID of the cached content
func useContextCache(w io.Writer, contentName string, projectID, location, modelName string) error {
	// location := "us-central1"
	// modelName := "gemini-1.5-pro-001"
	ctx := context.Background()

	client, err := genai.NewClient(ctx, projectID, location)
	if err != nil {
		return fmt.Errorf("unable to create client: %w", err)
	}
	defer client.Close()

	model := client.GenerativeModel(modelName)
	model.CachedContentName = contentName
	prompt := genai.Text("What are the papers about?")

	res, err := model.GenerateContent(ctx, prompt)
	if err != nil {
		return fmt.Errorf("error generating content: %w", err)
	}

	if len(res.Candidates) == 0 ||
		len(res.Candidates[0].Content.Parts) == 0 {
		return errors.New("empty response from model")
	}

	fmt.Fprintf(w, "generated response: %s\n", res.Candidates[0].Content.Parts[0])
	return nil
}

REST

You can use REST to use a context cache with a prompt by using the Vertex AI API to send a POST request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/gemini-1.5-pro-002:generateContent

Request JSON body:

{
  "cachedContent": "projects/PROJECT_NUMBER/locations/LOCATION/cachedContents/CACHE_ID",
  "contents": [
      {"role":"user","parts":[{"text":"PROMPT_TEXT"}]}
  ],
  "generationConfig": {
      "maxOutputTokens": 8192,
      "temperature": 1,
      "topP": 0.95,
  },
  "safetySettings": [
      {
          "category": "HARM_CATEGORY_HATE_SPEECH",
          "threshold": "BLOCK_MEDIUM_AND_ABOVE"
      },
      {
          "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
          "threshold": "BLOCK_MEDIUM_AND_ABOVE"
      },
      {
          "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
          "threshold": "BLOCK_MEDIUM_AND_ABOVE"
      },
      {
          "category": "HARM_CATEGORY_HARASSMENT",
          "threshold": "BLOCK_MEDIUM_AND_ABOVE"
      }
  ],
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/gemini-1.5-pro-002:generateContent"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/gemini-1.5-pro-002:generateContent" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Example curl command

LOCATION="us-central1"
MODEL_ID="gemini-1.5-pro-002"
PROJECT_ID="test-project"

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}/publishers/google/models/${MODEL_ID}:generateContent" -d \
'{
  "cachedContent": "projects/${PROJECT_NUMBER}/locations/${LOCATION}/cachedContents/${CACHE_ID}",
  "contents": [
      {"role":"user","parts":[{"text":"What are the benefits of exercise?"}]}
  ],
  "generationConfig": {
      "maxOutputTokens": 8192,
      "temperature": 1,
      "topP": 0.95,
  },
  "safetySettings": [
    {
      "category": "HARM_CATEGORY_HATE_SPEECH",
      "threshold": "BLOCK_MEDIUM_AND_ABOVE"
    },
    {
      "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
      "threshold": "BLOCK_MEDIUM_AND_ABOVE"
    },
    {
      "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
      "threshold": "BLOCK_MEDIUM_AND_ABOVE"
    },
    {
      "category": "HARM_CATEGORY_HARASSMENT",
      "threshold": "BLOCK_MEDIUM_AND_ABOVE"
    }
  ],
}'