This tutorial demonstrates how to deploy and manage containerized agentic AI/ML applications by using Google Kubernetes Engine (GKE). By combining the Google Agent Development Kit (ADK) with a self-hosted large language model (LLM) like Llama 3.1 served by vLLM, you can operationalize AI agents efficiently and at scale while maintaining full control over the model stack. This tutorial walks you through the end-to-end process of taking a Python-based agent from development to production deployment on a GKE Autopilot cluster with GPU acceleration.
This tutorial is intended for Machine learning (ML) engineers, Developers, and Cloud architects who are interested in using Kubernetes container orchestration capabilities for serving agentic AI/ML applications. To learn more about common roles and example tasks that we reference in Google Cloud content, see Common GKE Enterprise user roles and tasks.
Before you begin, ensure you are familiar with the following:
Background
This section describes the key technologies used in this tutorial.
Agent Development Kit (ADK)
Agent Development Kit (ADK) is a flexible and modular framework for developing and deploying AI agents. Although it's optimized for Gemini and the Google ecosystem, ADK doesn't require you to use a specific model or deployment, and is built for compatibility with other frameworks. ADK was designed to make agent development feel more like software development, to make it easier for developers to create, deploy, and orchestrate agentic architectures that range from basic tasks to complex workflows.
For more information, see the ADK documentation.
GKE managed Kubernetes service
Google Cloud offers a range of services, including GKE, which is well-suited to deploying and managing AI/ML workloads. GKE is a managed Kubernetes service that simplifies deploying, scaling, and managing containerized applications. GKE provides the necessary infrastructure—including scalable resources, distributed computing, and efficient networking—to handle the computational demands of LLMs.
For more information about key Kubernetes concepts, see Start learning about Kubernetes. For more information about the GKE and how it helps you scale, automate, and manage Kubernetes, see GKE overview.
vLLM
vLLM is a highly optimized open source LLM serving framework that can increase serving throughput on GPUs, with features such as the following:
- Optimized transformer implementation with PagedAttention.
- Continuous batching to improve the overall serving throughput.
- Tensor parallelism and distributed serving on multiple GPUs.
For more information, see the vLLM documentation.
Prepare the environment
This tutorial uses Cloud Shell to manage resources hosted on Google Cloud.
Cloud Shell comes preinstalled with the software you need for this tutorial,
including kubectl
, terraform
, and the Google Cloud CLI
.
To set up your environment with Cloud Shell, follow these steps:
- In the Google Cloud console, launch a Cloud Shell session and click
Activate Cloud Shell. This action launches a session in a Google Cloud console pane.
Set the default environment variables:
gcloud config set project PROJECT_ID export GOOGLE_CLOUD_REGION=REGION export PROJECT_ID=PROJECT_ID
Replace the following values:
- PROJECT_ID: your Google Cloud project ID.
- REGION: the Google Cloud region
(for example,
us-east4
) to provision your GKE cluster, Artifact Registry, and other regional resources. Make sure to specify a region that supports L4 GPUs and G2 machine type instances. To check for region availability, see GPU regions and zones in the Compute Engine documentation.
Clone the sample project
From your Cloud Shell terminal, clone the tutorial's sample code repository:
git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git
Navigate to the tutorial directory:
cd kubernetes-engine-samples/ai-ml/adk-vllm
Create and configure Google Cloud resources
To deploy your agent, you first need to provision the necessary Google Cloud resources. You can create the GKE cluster and Artifact Registry repository using either the gcloud CLI or Terraform.
gcloud
This section provides gcloud CLI commands to set up your GKE cluster and Artifact Registry.
Create a GKE cluster: You can deploy your containerized agentic application in a GKE Autopilot or Standard cluster. Use an Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that best fits your workloads, see About GKE modes of operation.
Autopilot
In Cloud Shell, run the following command:
gcloud container clusters create-auto CLUSTER_NAME \ --location=$GOOGLE_CLOUD_REGION
Replace CLUSTER_NAME with the name of your GKE cluster.
With Autopilot, GKE automatically provisions nodes based on your workload's resource requests. The GPU required for the LLM is requested in the
deploy-llm.yaml
manifest by using anodeSelector
.To add a
nodeSelector
request thenvidia-l4
GPU, follow these steps:- Open
kubernetes-engine-samples/ai-ml/adk-vllm/deploy-llm/deploy-llm.yaml
in an editor. Add the following
nodeSelector
underspec.template.spec
:nodeSelector: cloud.google.com/gke-accelerator: nvidia-l4
Standard
In Cloud Shell, create a Standard cluster by running the following command:
gcloud container clusters create CLUSTER_NAME \ --location=$GOOGLE_CLOUD_REGION
Replace CLUSTER_NAME with the name of your GKE cluster.
Create a GPU-enabled node pool for your cluster by running the following command:
gcloud container node-pools create gpu-node-pool \ --cluster=CLUSTER_NAME \ --location=$GOOGLE_CLOUD_REGION \ --machine-type=g2-standard-8 \ --accelerator=type=nvidia-l4,count=1 \ --enable-gvnic
The
deploy-llm.yaml
file specifies annvidia-l4
GPU, which is available in the G2 machine series. For more information about this machine type, see GPU machine types in the Compute Engine documentation.
- Open
Create an Artifact Registry repository: Create an Artifact Registry repository to securely store and manage your agent's Docker container image.
gcloud artifacts repositories create REPO_NAME \ --repository-format=docker \ --location=$GOOGLE_CLOUD_REGION
Replace REPO_NAME with the name of the Artifact Registry repository you want to use (for example,
adk-repo
).Get the repository URL: To verify the full path to your repository, run this command. You'll use this format to tag your Docker image when you build the agent image.
gcloud artifacts repositories describe REPO_NAME \ --location $GOOGLE_CLOUD_REGION
Terraform
This section describes how to use the Terraform configuration included in the sample repository to provision your Google Cloud resources automatically.
Navigate to the Terraform directory: The
\terraform
directory contains all the necessary configuration files to create the GKE cluster and other required resources.cd terraform
Create a Terraform variables file: Copy the provided example variables file (
example_vars.tfvars
) to create your ownvars.tfvars
file.cp example_vars.tfvars vars.tfvars
Open the
vars.tfvars
file in an editor and replace the placeholder values with your specific configuration. At a minimum, you must replace PROJECT_ID with your Google Cloud project ID and CLUSTER_NAME with the name of your GKE cluster.Initialize Terraform: To download the necessary provider plugins for Google Cloud, run this command.
terraform init
Review the execution plan: This command shows the infrastructure changes Terraform will make.
terraform plan -var-file=vars.tfvars
Apply the configuration: To create the resources in your Google Cloud project, execute the Terraform plan. Confirm with
yes
when prompted.terraform apply -var-file=vars.tfvars
After you run these commands, Terraform provisions your GKE cluster and Artifact Registry repository, and configures the necessary IAM roles and service accounts, including Workload Identity Federation for GKE.
To learn more about using Terraform, see Provision GKE resources with Terraform.
Configure kubectl
to communicate with your cluster
To configure kubectl
to communicate with your cluster, run the following
command:
gcloud container clusters get-credentials CLUSTER_NAME \
--location=${GOOGLE_CLOUD_REGION}
Replace CLUSTER_NAME with the name of your GKE cluster.
Build the agent image
After you create the infrastructure using either gcloud CLI or Terraform, follow these steps to build your agent application.
Grant the required IAM role for Cloud Build: The Cloud Build service requires permissions to push the agent's container image to Artifact Registry. Grant the
roles/artifactregistry.writer
role to the Compute Engine default service account, which is used by Cloud Build.Construct the email for the Compute Engine default service account:
export PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)") export COMPUTE_SA_EMAIL=${PROJECT_NUMBER}-compute@
Grant the
roles/artifactregistry.writer
role to the service account:gcloud projects add-iam-policy-binding $PROJECT_ID \ --member=serviceAccount:${COMPUTE_SA_EMAIL} \ --role=roles/artifactregistry.writer
Build and push the agent container image: From the project root directory (
adk/llama/vllm
), build your Docker image and push it to your Artifact Registry by running these commands.export IMAGE_URL="${GOOGLE_CLOUD_REGION}-docker.pkg.dev/${PROJECT_ID}/REPO_NAME/adk-agent:latest" gcloud builds submit --tag $IMAGE_URL
Verify that the image has been pushed: After the build process completes successfully, verify that your agent's container image was pushed to Artifact Registry by listing the images in your repository.
gcloud artifacts docker images list ${GOOGLE_CLOUD_REGION}-docker.pkg.dev/${PROJECT_ID}/REPO_NAME
You should see an output that lists the image you just pushed and tagged as
latest
.
Deploy the model
After setting up your GKE cluster and building the agent image, the next step is to deploy the self-hosted Llama 3.1 model to your cluster. To do this, deploy a pre-configured vLLM inference server that pulls the model from Hugging Face and serves it internally within the cluster.
Create a Kubernetes secret for Hugging Face credentials: To allow the GKE cluster to download the gated Llama 3.1 model, you must provide your Hugging Face token as a Kubernetes secret. The
deploy-llm.yaml
manifest is configured to use this secret for authentication.kubectl create secret generic hf-secret \ --from-literal=hf-token-secret=HUGGING_FACE_TOKEN
Replace HUGGING_FACE_TOKEN with your token.
View the manifest: From your project root directory (
adk/llama/vllm
), navigate to the/deploy-llm
directory that contains the model Deployment manifest.cd deploy-llm
Apply the manifest: Run the following command to apply the
deploy-llm.yaml
manifest to your cluster.kubectl apply -f deploy-llm.yaml
The command creates three Kubernetes resources:
- A Deployment that runs the vLLM server, configured to use the
meta-llama/Llama-3.1-8B-Instruct
model. - A Service named
vllm-llama3-service
that exposes the vLLM server on an internal cluster IP address, allowing the ADK agent to communicate with it. - A ConfigMap containing a Jinja chat template required by the Llama 3.1 model.
- A Deployment that runs the vLLM server, configured to use the
Verify the model Deployment: The vLLM server pulls the model files from Hugging Face. This process can take several minutes. You can monitor the status of the Pod to ensure its readiness.
Wait for the Deployment to become available.
kubectl wait --for=condition=available --timeout=600s deployment/vllm-llama3-deployment
View the logs from the running Pod to confirm that the server started successfully.
export LLM_POD=$(kubectl get pods -l app=vllm-llama3 -o jsonpath='{.items[0].metadata.name}') kubectl logs -f $LLM_POD
The deployment is ready when you see log output similar to the following, indicating the LLM server has started and the API routes are available:
INFO 07-16 14:15:16 api_server.py:129] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Send a request directly to the model server to confirm that the LLM is ready. To do this, open a new Cloud Shell terminal and run the following command to forward the
vllm-llama3-service
to your local machine:kubectl port-forward service/vllm-llama3-service 8000:8000
In another terminal, send a sample request to the model's API endpoint by using
curl
. For example:curl -X POST http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "Hello!", "max_tokens": 10 }'
If the command returns a successful JSON response, your LLM is ready. You can now terminate the port-forward process by returning to its terminal window and pressing
Ctrl+C
, then proceed to deploy the agent.
Deploy the agent application
The next step is to deploy the ADK-based agent application.
Navigate to the
/deploy-agent
directory: From your project root directory (adk/llama/vllm
), navigate to the/deploy-agent
directory that contains the agent's source code and deployment manifest.cd ../deploy-agent
Update the agent deployment manifest:
The sample
deploy-agent.yaml
manifest file contains a placeholder for your project ID in the container image URL. You must replace the placeholder with your Google Cloud project ID.image: us-central1-docker.pkg.dev/PROJECT_ID/adk-repo/adk-agent:latest
To perform this substitution in place, you can run the following command:
sed -i "s/<PROJECT_ID>/$PROJECT_ID/g" deploy-agent.yaml
Make sure the
readinessProbe
path is set to/
instead of/dev-ui
. To perform this substitution in place, you can run the following command:sed -i "s|path: /dev-ui/|path: /|g" deploy-agent.yaml
Apply the manifest: Run the following command to apply the
deploy-agent.yaml
manifest to your cluster.kubectl apply -f deploy-agent.yaml
This command creates two Kubernetes resources:
- A Deployment named
adk-agent
that runs your custom-built agent container image. - A Service named
adk-agent
of type NodePort that exposes the agent application so it can be accessed for testing.
- A Deployment named
Verify the agent deployment: Check the status of the Pod to ensure it is running correctly.
Wait for the Deployment to become available:
kubectl wait --for=condition=available --timeout=300s deployment/adk-agent
View the logs from the running agent Pod:
export AGENT_POD=$(kubectl get pods -l app=adk-agent -o jsonpath='{.items[0].metadata.name}') kubectl logs -f $AGENT_POD
The deployment is successful when you see log output similar to the following, indicating that the Uvicorn server is running and ready to accept requests:
INFO: Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)
Test your deployed agent
After successfully deploying both the vLLM server and the agent application, you can test the end-to-end functionality by interacting with the agent's web UI.
Forward the agent's service to your local machine: The
adk-agent
service is of typeNodePort
, but the most direct way to access it from your Cloud Shell environment is to use thekubectl port-forward
command. Create a secure tunnel to the agent's Pod by running this command.kubectl port-forward $AGENT_POD 8001:8001
Access the agent's web UI: In Cloud Shell, click the Web Preview button and select Preview on port 8001. A new browser tab opens, displaying the agent's chat interface.
Interact with the agent: Ask the agent a question that will invoke its
get_weather
tool. For example:What's the weather like in Tokyo?
The agent will first call the LLM to understand the intent and identify the need to use the
get_weather
tool. Then, it will execute the tool with "Tokyo" as the parameter. Finally, it will use the tool's output to generate a response. You should see a response similar to the following:The weather in Tokyo is 25°C and sunny.
(Optional) Verify the tool call in the logs: You can observe the agent's interaction with the LLM and the tool execution by viewing the logs of the respective Pods.
Agent Pod logs: In a new terminal, view the logs of the
adk-agent
Pod. You see the tool call and its result.kubectl logs -f $AGENT_POD
The output shows the tool being called and the result being processed.
LLM Pod logs: View the logs of the
vllm-llama3-deployment
Pod to see the incoming request from the agent.kubectl logs -f $LLM_POD
The logs shows the full prompt sent by the agent to the LLM, including the system message, your query, and the definition of the
get_weather
tool.
After you finish testing, you can terminate the port-forward
process
by returning to its terminal window and pressing Ctrl+C
.