Service Extensions enables supported Application Load Balancers to configure extensions by using callouts to Google services. This page shows you how to configure such extensions.
For an overview, see Integration with Google services.
Configure a traffic extension to call the Model Armor service
You can configure a traffic extension to call Model Armor to uniformly enforce security policies on generative AI inference traffic to application load balancers, including GKE Inference Gateway.
A traffic extension groups related extension services into one or more chains. You can configure both plugins and callouts in the same extension chain. Each extension chain selects the traffic to act on by using Common Expression Language (CEL) match conditions. The load balancer evaluates a request against each chain's match condition in a sequential manner. When a request matches the conditions defined by a chain, all extensions in the chain act on the request. Only one chain matches a given request.
The extension attaches to a load balancer forwarding rule that is created using the Inference Gateway. After you configure the resource, matching requests are sent to the Model Armor service.
Before you begin
Identify a suitable project where you have either a project owner or editor role or the following Compute Engine IAM roles:
- To create instances: Compute Instance Admin (v1)
(
roles/compute.instanceAdmin.v1
) - To create Cloud Load Balancing components: Compute Network Admin
(
roles/compute.networkAdmin
)
- To create instances: Compute Instance Admin (v1)
(
Enable the required APIs.
Console
In the Google Cloud console, go to the Enable access to APIs page.
Follow the instructions to enable the required APIs, which include the Compute Engine API, the Model Armor API, and the Network Services API.
gcloud
Use the
gcloud services enable
command:gcloud services enable compute.googleapis.com modelarmor.googleapis.com networkservices.googleapis.com
Create the required Model Armor templates.
Set up your Google Kubernetes Engine infrastructure by deploying an Inference Gateway. Test it by sending an inference request.
Subject to a few limitations, the following OpenAI API endpoints are supported: Assistants, Chat Completions, Completions (legacy), Messages, and Threads.
Limitations when configuring an OpenAI API endpoint
When configuring an OpenAI API endpoint for your GKE infrastructure, consider the following limitations pertaining to sanitizing prompts and responses:
Streaming API responses aren't supported for any API. If you use a mix of streaming and non-streaming APIs, then, when you configure the traffic extension, set
failOpen
totrue
. Model Armor sanitizes the non-streaming responses and ignores the streaming responses.When sanitizing prompts and responses, only the following operations are supported:
- Assistants API:
Create
,Delete
,List
,Modify
, andRetrieve
- Chat Completions API:
Create
,Delete
,Get Chat Completion
,Get Chat Message
,List
, andUpdate
- Completions (legacy) API:
Create
- Messages API:
Create
,Delete
,List
,Modify
, andRetrieve
- Threads API:
Create
,Delete
,Modify
, andRetrieve
- Assistants API:
For API calls that return multiple choices in the response (such as
POST https://api.openai.com/v1/chat/completions
), only the first item in the list of choices is sanitized.
Configure the traffic extension
Check the behavior before the extension is configured by sending an inference request to the load balancer and specifying the load balancer's exposed IP address:
curl -v http://${IP}/v1/chat/completions -H "Content-Type: application/json" \ -H 'Authorization: Bearer $(gcloud auth print-access-token)' \ -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ { "role": "user", "content": "Can you remember my ITIN: 123-45-6789" } ], "max_tokens": 250, "temperature": 0.1}'
The request generates an HTTP
200 OK
status code although sensitive data has been sent to the LLM.So that Model Armor blocks prompts that contain sensitive data, configure a traffic extension.
Console
In the Google Cloud console, go to the Service Extensions page.
Click Create extension. A wizard opens to guide you through some initial steps.
For the product, select Load Balancing. Then, click Continue. A list of supported Application Load Balancers appears.
Select a load balancer type.
Specify the region as
us-central1
. Click Continue.For the extension type, select Traffic extensions, and then click Continue.
To open the Create extension form, click Continue. In the Create extension form, notice that the preceding selections aren't editable.
In the Basics section, do the following:
Specify a unique name for the extension.
The name must start with a lowercase letter followed by up to 62 lowercase letters, numbers, or hyphens and must not end with a hyphen.
Optional: Enter a brief description about the extension by using up to 1,024 characters.
Optional: In the Labels section, click Add label. Then, in the row that appears, do the following:
- For Key, enter a key name.
- For Value, enter a value for the key.
To add more key-value pairs, click Add label. You can add a maximum of 64 key-value pairs.
For more information about labels, see Create and update labels for projects.
For Forwarding rules, select one or more forwarding rules to associate with the extension. Choose a forwarding rule that is generated as part of deploying the Inference Gateway. Forwarding rules that are already associated with another extension can't be selected and appear unavailable.
For Extension chains, add one or more extension chains to execute for a matching request.
To add an extension chain, do the following, and then click Done:
For New extension chain name, specify a unique name.
The name must conform with RFC-1034, use only lowercase letters, numbers, and hyphens, and have a maximum length of 63 characters. Additionally, the first character must be a letter and the last character must be a letter or a number.
To match requests for which the extension chain is executed, for Match condition, specify a Common Expression Language (CEL) expression—for example,
request.path == "/v1/chat/completions"
.For more information about CEL expressions, click Get syntax help or see CEL matcher language reference.
Add one or more extensions to execute for a matching request.
For each extension, under Extensions, do the following, and then click Done:
For Extension name, specify a unique name.
The name must conform with RFC-1034, use only lowercase letters, numbers, and hyphens, and have a maximum length of 63 characters. Additionally, the first character must be a letter and the last character must be a letter or a number.
For Programmability type, select Google services and then select a Model Armor service endpoint—for example
modelarmor.us-central1.rep.googleapis.com
.For Timeout, specify a value between 10 and 1000 milliseconds after which a message on the stream times out. Consider that Model Armor has a latency of approximately 250 milliseconds.
For Events, select all HTTP event types.
For Forward headers, click Add header, and then add HTTP headers to forward to the extension (from the client or the backend). If a header isn't specified, all headers are sent.
For Fail open, select Enabled. If the call to the extension fails or times out, request or response processing continues without error. Any subsequent extensions in the extension chain are also run.
By default, the Fail open field isn't selected. In this case, if response headers haven't been delivered to the downstream client, a generic
500
status code is returned to the client. If response headers have been delivered, the HTTP stream to the downstream client is reset.For Metadata, click Add metadata to specify the Model Armor templates to be used to screen prompts and responses corresponding to specific models.
For Key, specify
model_armor_settings
. For Value, specify the templates as a JSON string, such as the following:[{ "model": "MODEL_NAME", "model_response_template_id": "projects/PROJECT_ID/locations/LOCATION/templates/RESPONSE_TEMPLATE", "user_prompt_template_id": "projects/PROJECT_ID/locations/LOCATION/templates/PROMPT_TEMPLATE" }]
Replace the following:
MODEL_NAME
: the name of the model as configured with theInferenceModel
resource—for example,meta-llama/Llama-3.1-8B-Instruct
PROJECT_ID
: the project IDLOCATION
: the location of the Model Armor template—for example,us-central1
RESPONSE_TEMPLATE
: the response template for the model to usePROMPT_TEMPLATE
: the prompt template for the model to use
A default template can additionally be specified for use when a request doesn't exactly match a model. To configure a default template, specify
MODEL_NAME
asdefault
.If you don't want to screen prompt or response traffic, create and include an empty filter template.
The total size of
metadata
must be less than 1 KiB. The total number of keys in the metadata must be less than 20. The length of each key must be less than 64 characters. The length of each value must be less than 1,024 characters. All values must be strings.When a request is blocked, Model Armor returns a standard
403 Forbidden
status code. You can override the status by defining custom response settings (including a custom status code and message) in the security policy of the Model Armor template. For details, see TemplateMetadata.
Click Create extension.
gcloud
Define the callout in a YAML file and associate it with the forwarding rule that is generated when deploying the Inference Gateway. Use the sample values provided.
cat >traffic_callout_service.yaml <<EOF name: traffic-ext forwardingRules: - https://www.googleapis.com/compute/v1/projects/PROJECT_ID/regions/us-central1/forwardingRules/FORWARDING_RULE loadBalancingScheme: INTERNAL_MANAGED extensionChains: - name: "chain1-model-armor" matchCondition: celExpression: 'request.path == "/v1/chat/completions"' extensions: - name: extension-chain-1-model-armor service: modelarmor.us-central1.rep.googleapis.com failOpen: true supportedEvents: - REQUEST_HEADERS - REQUEST_BODY - REQUEST_TRAILERS - RESPONSE_HEADERS - RESPONSE_BODY - RESPONSE_TRAILERS timeout: 1s metadata: model_armor_settings: '[ { "model": "MODEL_NAME", "model_response_template_id": "projects/PROJECT_ID/locations/LOCATION/templates/RESPONSE_TEMPLATE", "user_prompt_template_id": "projects/PROJECT_ID/locations/LOCATION/templates/PROMPT_TEMPLATE" } ]' EOF
Replace the following:
PROJECT_ID
: the project IDFORWARDING_RULE
: one or more forwarding rules to associate with the extension. Choose a forwarding rule that is generated as part of deploying the Inference Gateway.Forwarding rules that are already associated with another extension can't be selected and appear unavailable.
MODEL_NAME
: the name of the model as configured with theInferenceModel
resource—for example,meta-llama/Llama-3.1-8B-Instruct
LOCATION
: the location of the Model Armor template—for example,us-central1
RESPONSE_TEMPLATE
: the response template for the model to usePROMPT_TEMPLATE
: the prompt template for the model to use
In the
metadata
field, specify the Model Armor settings and templates to be used while screening prompts and responses corresponding to specific models.A default template can additionally be specified for use when a request doesn't exactly match a model. To configure a default template, specify
MODEL_NAME
asdefault
.If you don't want to screen prompt or response traffic, create and include an empty filter template.
The total size of
metadata
must be less than 1 KiB. The total number of keys in the metadata must be less than 16. The length of each key must be less than 64 characters. The length of each value must be less than 1,024 characters. All values must be strings.When a request is blocked, Model Armor returns a standard
403 Forbidden
status code. You can override the status by defining custom response settings (including a custom status code and message) in the security policy of the Model Armor template. For details, see TemplateMetadata.Import the traffic extension. Use the
gcloud service-extensions lb-traffic-extensions import
command with the following sample values.gcloud service-extensions lb-traffic-extensions import traffic-ext \ --source=traffic_callout_service.yaml \ --location=us-central1
kubectl
If you are using a GKE version earlier than v1.32.2-gke.1182001, install the traffic extension CRD:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/gke-gateway-api/refs/heads/main/config/crd/networking.gke.io_gcptrafficextensions.yaml
Define the extension in a YAML file. This custom resource links your Inference Gateway to the Model Armor service. Use the sample values provided.
cat >traffic_callout_service.yaml <<EOF apiVersion: networking.gke.io/v1 kind: GCPTrafficExtension metadata: name: traffic-ext spec: targetRefs: - group: "gateway.networking.k8s.io" kind: Gateway name: inference-gateway extensionChains: - name: "chain1-model-armor" matchCondition: celExpressions: - celMatcher: 'request.path == "/v1/chat/completions"' extensions: - name: extension-chain-1-model-armor googleAPIServiceName: modelarmor.us-central1.rep.googleapis.com failOpen: true supportedEvents: - RequestHeaders - RequestBody - RequestTrailers - ResponseHeaders - ResponseBody - ResponseTrailers timeout: 1s metadata: model_armor_settings: '[ { "model": "MODEL_NAME", "model_response_template_id": "projects/PROJECT_ID/locations/LOCATION/templates/RESPONSE_TEMPLATE", "user_prompt_template_id": "projects/PROJECT_ID/locations/LOCATION/templates/PROMPT_TEMPLATE" } ]' EOF
Replace the following:
MODEL_NAME
: the name of the model as configured with theInferenceModel
resource—for example,meta-llama/Llama-3.1-8B-Instruct
PROJECT_ID
: the project IDLOCATION
: the location of the Model Armor template—for example,us-central1
RESPONSE_TEMPLATE
: the response template for the model to usePROMPT_TEMPLATE
: the prompt template for the model to use
In the
metadata
field, specify the Model Armor settings and templates to be used while screening prompts and responses corresponding to specific models.A default template can additionally be specified for use when a request doesn't exactly match a model. To configure a default template, specify
MODEL_NAME
asdefault
.If you don't want to screen prompt or response traffic, create and include an empty filter template.
The total size of
metadata
must be less than 1 KiB. The total number of keys in the metadata must be less than 16. The length of each key must be less than 64 characters. The length of each value must be less than 1,024 characters. All values must be strings.When a request is blocked, Model Armor returns a standard
403 Forbidden
status code. You can override the status by defining custom response settings (including a custom status code and message) in the security policy of the Model Armor template. For details, see TemplateMetadata.Apply the configuration defined in the
traffic_callout_service.yaml
file to your GKE cluster. This command creates theGCPTrafficExtension
resource, which links your Inference Gateway to the Model Armor service.kubectl apply -f traffic_callout_service.yaml
Grant the required roles to the Service Extensions service account. Use the
gcloud projects add-iam-policy-binding
command:gcloud projects add-iam-policy-binding PROJECT_ID \ --member=serviceAccount:service-PROJECT_NUMBER@gcp-sa-dep.iam.gserviceaccount.com \ --role=roles/container.admin gcloud projects add-iam-policy-binding PROJECT_ID \ --member=serviceAccount:service-PROJECT_NUMBER@gcp-sa-dep.iam.gserviceaccount.com \ --role=roles/modelarmor.calloutUser gcloud projects add-iam-policy-binding PROJECT_ID \ --member=serviceAccount:service-PROJECT_NUMBER@gcp-sa-dep.iam.gserviceaccount.com \ --role=roles/serviceusage.serviceUsageConsumer gcloud projects add-iam-policy-binding PROJECT_ID \ --member=serviceAccount:service-PROJECT_NUMBER@gcp-sa-dep.iam.gserviceaccount.com \ --role=roles/modelarmor.user
Replace the following:
PROJECT_ID
: the ID of the projectPROJECT_NUMBER
: the project number
These values are listed in the Project info panel in the Google Cloud console for your project.
To verify that the traffic extension works as expected, run the same
curl
command:curl -v http://${IP}/v1/chat/completions \ -H "Content-Type: application/json" \ -H 'Authorization: Bearer $(gcloud auth print-access-token)' \ -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ { "role": "user", "content": "Can you remember my ITIN: 123-45-6789" } ], "max_tokens": 250, "temperature": 0.1}' ```
With the service extension configured, a request with sensitive data
generates an HTTP 403 Forbidden
status code, logs an error message as
configured in the template, and closes the connection.
When the request is safe, it generates an HTTP 200 OK
status
code and returns the LLM response to the user.
To monitor the behavior of the extension, use the Logs Explorer. In the query pane, depending on your Inference Gateway configuration, filter by the appropriate load balancer resource type.
Application Load Balancer log entries contain information that helps you debug your HTTP or HTTPS traffic.
To perform a more detailed analysis of security assessments, enable Model Armor audit logging.