This tutorial demonstrates how to deploy and manage containerized agentic AI/ML applications by using Google Kubernetes Engine (GKE). By combining the Google Agent Development Kit (ADK) with a self-hosted large language model (LLM) like Llama 3.1 served by vLLM, you can operationalize AI agents efficiently and at scale while maintaining full control over the model stack. This tutorial walks you through the end-to-end process of taking a Python-based agent from development to production deployment on a GKE Autopilot cluster with GPU acceleration.
This tutorial is intended for Machine learning (ML) engineers, Developers, and Cloud architects who are interested in using Kubernetes container orchestration capabilities for serving agentic AI/ML applications. To learn more about common roles and example tasks that we reference in Google Cloud content, see Common GKE Enterprise user roles and tasks.
Before you begin, ensure you are familiar with the following:
Background
This section describes the key technologies used in this tutorial.
Agent Development Kit (ADK)
Agent Development Kit (ADK) is a flexible and modular framework for developing and deploying AI agents. Although it's optimized for Gemini and the Google ecosystem, ADK doesn't require you to use a specific model or deployment, and is built for compatibility with other frameworks. ADK was designed to make agent development feel more like software development, to make it easier for developers to create, deploy, and orchestrate agentic architectures that range from basic tasks to complex workflows.
For more information, see the ADK documentation.
GKE managed Kubernetes service
Google Cloud offers a range of services, including GKE, which is well-suited to deploying and managing AI/ML workloads. GKE is a managed Kubernetes service that simplifies deploying, scaling, and managing containerized applications. GKE provides the necessary infrastructure—including scalable resources, distributed computing, and efficient networking—to handle the computational demands of LLMs.
For more information about key Kubernetes concepts, see Start learning about Kubernetes. For more information about the GKE and how it helps you scale, automate, and manage Kubernetes, see GKE overview.
vLLM
vLLM is a highly optimized open source LLM serving framework that can increase serving throughput on GPUs, with features such as the following:
- Optimized transformer implementation with PagedAttention.
- Continuous batching to improve the overall serving throughput.
- Tensor parallelism and distributed serving on multiple GPUs.
For more information, see the vLLM documentation.
Objectives
This tutorial shows you how to do the following:
- Set up your Google Cloud environment.
- Provision a GPU-enabled GKE cluster.
- Deploy a Llama 3.1 model by using the vLLM inference server.
- Build a container image for your ADK-based agent.
- Deploy the agent to the GKE cluster and connect it to the self-hosted LLM.
- Test your deployed agent.
Architecture
This tutorial presents a scalable architecture for deploying agentic AI applications on GKE. The ADK agent application runs on a standard CPU node pool and the self-hosted LLM (Llama 3.1 on vLLM) runs on a GPU-enabled node pool, both within the same GKE cluster. This architecture separates the agent's application logic from the LLM inference workload, which allows each component to be scaled and managed independently.
The architecture has two core components, each on its own GKE Deployment:
ADK agent application: your agent's custom-built business logic and tools (like
get_weather
) are in a container image. The image runs on a standard CPU node pool and communicates with the LLM using an internal Kubernetes service.Self-hosted LLM (Llama 3.1 on vLLM): the Llama 3.1 model runs on a dedicated vLLM server on a GPU-enabled node pool. This deployment uses a public container image (
vllm/vllm-openai:v0.8.5
) that's configured to download and serve the specified model from Hugging Face when the container starts. The agent communicates with this server through a REST API exposed by thevllm-llama3-service
Kubernetes service.
Both the ADK agent and the vLLM deployments run on the same GKE cluster. This colocation within a single cluster simplifies networking, management, and deployment while still allowing for specialized hardware assignment for components of the application.
Costs
This tutorial uses the following billable components of Google Cloud:
Review the pricing for each service to understand potential costs.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator
(
roles/resourcemanager.projectCreator
), which contains theresourcemanager.projects.create
permission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
-
Enable the required APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin
), which contains theserviceusage.services.enable
permission. Learn how to grant roles. -
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator
(
roles/resourcemanager.projectCreator
), which contains theresourcemanager.projects.create
permission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
-
Enable the required APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin
), which contains theserviceusage.services.enable
permission. Learn how to grant roles. -
Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin, roles/artifactregistry.admin, roles/cloudbuild.builds.editor, roles/resourcemanager.projectIamAdmin
Check for the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
-
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
- For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
- Click Grant access.
-
In the New principals field, enter your user identifier. This is typically the email address for a Google Account.
- In the Select a role list, select a role.
- To grant additional roles, click Add another role and add each additional role.
- Click Save.
-
- Obtain a read-access token from Hugging Face to download the Llama model. You also need to request access to the Llama 3.1 model.
Prepare the environment
This tutorial uses Cloud Shell to manage resources hosted on Google Cloud.
Cloud Shell comes preinstalled with the software you need for this tutorial,
including kubectl
, terraform
, and the Google Cloud CLI
.
To set up your environment with Cloud Shell, follow these steps:
- In the Google Cloud console, launch a Cloud Shell session and click
Activate Cloud Shell. This action launches a session in a Google Cloud console pane.
Set the default environment variables:
gcloud config set project PROJECT_ID export GOOGLE_CLOUD_REGION=REGION export PROJECT_ID=PROJECT_ID
Replace the following values:
- PROJECT_ID: your Google Cloud project ID.
- REGION: the Google Cloud region
(for example,
us-east4
) to provision your GKE cluster, Artifact Registry, and other regional resources. Make sure to specify a region that supports L4 GPUs and G2 machine type instances. To check for region availability, see GPU regions and zones in the Compute Engine documentation.
Clone the sample project
From your Cloud Shell terminal, clone the tutorial's sample code repository:
git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git
Navigate to the tutorial directory:
cd kubernetes-engine-samples/ai-ml/adk-vllm
Create and configure Google Cloud resources
To deploy your agent, you first need to provision the necessary Google Cloud resources. You can create the GKE cluster and Artifact Registry repository using either the gcloud CLI or Terraform.
gcloud
This section provides gcloud CLI commands to set up your GKE cluster and Artifact Registry.
Create a GKE cluster: You can deploy your containerized agentic application in a GKE Autopilot or Standard cluster. Use an Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that best fits your workloads, see About GKE modes of operation.
Autopilot
In Cloud Shell, run the following command:
gcloud container clusters create-auto CLUSTER_NAME \ --location=$GOOGLE_CLOUD_REGION
Replace CLUSTER_NAME with the name of your GKE cluster.
With Autopilot, GKE automatically provisions nodes based on your workload's resource requests. The GPU required for the LLM is requested in the
deploy-llm.yaml
manifest by using anodeSelector
.To add a
nodeSelector
request thenvidia-l4
GPU, follow these steps:- Open
kubernetes-engine-samples/ai-ml/adk-vllm/deploy-llm/deploy-llm.yaml
in an editor. Add the following
nodeSelector
underspec.template.spec
:nodeSelector: cloud.google.com/gke-accelerator: nvidia-l4
Standard
In Cloud Shell, create a Standard cluster by running the following command:
gcloud container clusters create CLUSTER_NAME \ --location=$GOOGLE_CLOUD_REGION
Replace CLUSTER_NAME with the name of your GKE cluster.
Create a GPU-enabled node pool for your cluster by running the following command:
gcloud container node-pools create gpu-node-pool \ --cluster=CLUSTER_NAME \ --location=$GOOGLE_CLOUD_REGION \ --machine-type=g2-standard-8 \ --accelerator=type=nvidia-l4,count=1 \ --enable-gvnic
The
deploy-llm.yaml
file specifies annvidia-l4
GPU, which is available in the G2 machine series. For more information about this machine type, see GPU machine types in the Compute Engine documentation.
- Open
Create an Artifact Registry repository: Create an Artifact Registry repository to securely store and manage your agent's Docker container image.
gcloud artifacts repositories create REPO_NAME \ --repository-format=docker \ --location=$GOOGLE_CLOUD_REGION
Replace REPO_NAME with the name of the Artifact Registry repository you want to use (for example,
adk-repo
).Get the repository URL: To verify the full path to your repository, run this command. You'll use this format to tag your Docker image when you build the agent image.
gcloud artifacts repositories describe REPO_NAME \ --location $GOOGLE_CLOUD_REGION
Terraform
This section describes how to use the Terraform configuration included in the sample repository to provision your Google Cloud resources automatically.
Navigate to the Terraform directory: The
\terraform
directory contains all the necessary configuration files to create the GKE cluster and other required resources.cd terraform
Create a Terraform variables file: Copy the provided example variables file (
example_vars.tfvars
) to create your ownvars.tfvars
file.cp example_vars.tfvars vars.tfvars
Open the
vars.tfvars
file in an editor and replace the placeholder values with your specific configuration. At a minimum, you must replace PROJECT_ID with your Google Cloud project ID and CLUSTER_NAME with the name of your GKE cluster.Initialize Terraform: To download the necessary provider plugins for Google Cloud, run this command.
terraform init
Review the execution plan: This command shows the infrastructure changes Terraform will make.
terraform plan -var-file=vars.tfvars
Apply the configuration: To create the resources in your Google Cloud project, execute the Terraform plan. Confirm with
yes
when prompted.terraform apply -var-file=vars.tfvars
After you run these commands, Terraform provisions your GKE cluster and Artifact Registry repository, and configures the necessary IAM roles and service accounts, including Workload Identity Federation for GKE.
To learn more about using Terraform, see Provision GKE resources with Terraform.
Configure kubectl
to communicate with your cluster
To configure kubectl
to communicate with your cluster, run the following
command:
gcloud container clusters get-credentials CLUSTER_NAME \
--location=${GOOGLE_CLOUD_REGION}
Replace CLUSTER_NAME with the name of your GKE cluster.
Build the agent image
After you create the infrastructure using either gcloud CLI or Terraform, follow these steps to build your agent application.
Grant the required IAM role for Cloud Build: The Cloud Build service requires permissions to push the agent's container image to Artifact Registry. Grant the
roles/artifactregistry.writer
role to the Compute Engine default service account, which is used by Cloud Build.Construct the email for the Compute Engine default service account:
export PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)") export COMPUTE_SA_EMAIL=${PROJECT_NUMBER}-compute@developer.gserviceaccount.com
Grant the
roles/artifactregistry.writer
role to the service account:gcloud projects add-iam-policy-binding $PROJECT_ID \ --member=serviceAccount:${COMPUTE_SA_EMAIL} \ --role=roles/artifactregistry.writer
Build and push the agent container image: From the project root directory (
adk/llama/vllm
), build your Docker image and push it to your Artifact Registry by running these commands.export IMAGE_URL="${GOOGLE_CLOUD_REGION}-docker.pkg.dev/${PROJECT_ID}/REPO_NAME/adk-agent:latest" gcloud builds submit --tag $IMAGE_URL
Verify that the image has been pushed: After the build process completes successfully, verify that your agent's container image was pushed to Artifact Registry by listing the images in your repository.
gcloud artifacts docker images list ${GOOGLE_CLOUD_REGION}-docker.pkg.dev/${PROJECT_ID}/REPO_NAME
You should see an output that lists the image you just pushed and tagged as
latest
.
Deploy the model
After setting up your GKE cluster and building the agent image, the next step is to deploy the self-hosted Llama 3.1 model to your cluster. To do this, deploy a pre-configured vLLM inference server that pulls the model from Hugging Face and serves it internally within the cluster.
Create a Kubernetes secret for Hugging Face credentials: To allow the GKE cluster to download the gated Llama 3.1 model, you must provide your Hugging Face token as a Kubernetes secret. The
deploy-llm.yaml
manifest is configured to use this secret for authentication.kubectl create secret generic hf-secret \ --from-literal=hf-token-secret=HUGGING_FACE_TOKEN
Replace HUGGING_FACE_TOKEN with your token.
View the manifest: From your project root directory (
adk/llama/vllm
), navigate to the/deploy-llm
directory that contains the model Deployment manifest.cd deploy-llm
Apply the manifest: Run the following command to apply the
deploy-llm.yaml
manifest to your cluster.kubectl apply -f deploy-llm.yaml
The command creates three Kubernetes resources:
- A Deployment that runs the vLLM server, configured to use the
meta-llama/Llama-3.1-8B-Instruct
model. - A Service named
vllm-llama3-service
that exposes the vLLM server on an internal cluster IP address, allowing the ADK agent to communicate with it. - A ConfigMap containing a Jinja chat template required by the Llama 3.1 model.
- A Deployment that runs the vLLM server, configured to use the
Verify the model Deployment: The vLLM server pulls the model files from Hugging Face. This process can take several minutes. You can monitor the status of the Pod to ensure its readiness.
Wait for the Deployment to become available.
kubectl wait --for=condition=available --timeout=600s deployment/vllm-llama3-deployment
View the logs from the running Pod to confirm that the server started successfully.
export LLM_POD=$(kubectl get pods -l app=vllm-llama3 -o jsonpath='{.items[0].metadata.name}') kubectl logs -f $LLM_POD
The deployment is ready when you see log output similar to the following, indicating the LLM server has started and the API routes are available:
INFO 07-16 14:15:16 api_server.py:129] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Send a request directly to the model server to confirm that the LLM is ready. To do this, open a new Cloud Shell terminal and run the following command to forward the
vllm-llama3-service
to your local machine:kubectl port-forward service/vllm-llama3-service 8000:8000
In another terminal, send a sample request to the model's API endpoint by using
curl
. For example:curl -X POST http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "Hello!", "max_tokens": 10 }'
If the command returns a successful JSON response, your LLM is ready. You can now terminate the port-forward process by returning to its terminal window and pressing
Ctrl+C
, then proceed to deploy the agent.
Deploy the agent application
The next step is to deploy the ADK-based agent application.
Navigate to the
/deploy-agent
directory: From your project root directory (adk/llama/vllm
), navigate to the/deploy-agent
directory that contains the agent's source code and deployment manifest.cd ../deploy-agent
Update the agent deployment manifest:
The sample
deploy-agent.yaml
manifest file contains a placeholder for your project ID in the container image URL. You must replace the placeholder with your Google Cloud project ID.image: us-central1-docker.pkg.dev/PROJECT_ID/adk-repo/adk-agent:latest
To perform this substitution in place, you can run the following command:
sed -i "s/<PROJECT_ID>/$PROJECT_ID/g" deploy-agent.yaml
Make sure the
readinessProbe
path is set to/
instead of/dev-ui
. To perform this substitution in place, you can run the following command:sed -i "s|path: /dev-ui/|path: /|g" deploy-agent.yaml
Apply the manifest: Run the following command to apply the
deploy-agent.yaml
manifest to your cluster.kubectl apply -f deploy-agent.yaml
This command creates two Kubernetes resources:
- A Deployment named
adk-agent
that runs your custom-built agent container image. - A Service named
adk-agent
of type NodePort that exposes the agent application so it can be accessed for testing.
- A Deployment named
Verify the agent deployment: Check the status of the Pod to ensure it is running correctly.
Wait for the Deployment to become available:
kubectl wait --for=condition=available --timeout=300s deployment/adk-agent
View the logs from the running agent Pod:
export AGENT_POD=$(kubectl get pods -l app=adk-agent -o jsonpath='{.items[0].metadata.name}') kubectl logs -f $AGENT_POD
The deployment is successful when you see log output similar to the following, indicating that the Uvicorn server is running and ready to accept requests:
INFO: Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)
Test your deployed agent
After successfully deploying both the vLLM server and the agent application, you can test the end-to-end functionality by interacting with the agent's web UI.
Forward the agent's service to your local machine: The
adk-agent
service is of typeNodePort
, but the most direct way to access it from your Cloud Shell environment is to use thekubectl port-forward
command. Create a secure tunnel to the agent's Pod by running this command.kubectl port-forward $AGENT_POD 8001:8001
Access the agent's web UI: In Cloud Shell, click the Web Preview button and select Preview on port 8001. A new browser tab opens, displaying the agent's chat interface.
Interact with the agent: Ask the agent a question that will invoke its
get_weather
tool. For example:What's the weather like in Tokyo?
The agent will first call the LLM to understand the intent and identify the need to use the
get_weather
tool. Then, it will execute the tool with "Tokyo" as the parameter. Finally, it will use the tool's output to generate a response. You should see a response similar to the following:The weather in Tokyo is 25°C and sunny.
(Optional) Verify the tool call in the logs: You can observe the agent's interaction with the LLM and the tool execution by viewing the logs of the respective Pods.
Agent Pod logs: In a new terminal, view the logs of the
adk-agent
Pod. You see the tool call and its result.kubectl logs -f $AGENT_POD
The output shows the tool being called and the result being processed.
LLM Pod logs: View the logs of the
vllm-llama3-deployment
Pod to see the incoming request from the agent.kubectl logs -f $LLM_POD
The logs shows the full prompt sent by the agent to the LLM, including the system message, your query, and the definition of the
get_weather
tool.
After you finish testing, you can terminate the port-forward
process
by returning to its terminal window and pressing Ctrl+C
.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the deployed resources
To avoid incurring charges to your Google Cloud account for the resources that you created in this tutorial, run the following commands:
gcloud
If you used the gcloud CLI to create your resources, run the following commands to delete the GKE cluster and the Artifact Registry repository, and return the service account's permissions to its original state.
gcloud container clusters delete CLUSTER_NAME \
--location=$GOOGLE_CLOUD_REGION
gcloud artifacts repositories delete REPO_NAME \
--location=$GOOGLE_CLOUD_REGION
gcloud projects remove-iam-policy-binding $PROJECT_ID \
--member=serviceAccount:${COMPUTE_SA_EMAIL} \
--role=roles/artifactregistry.writer
Terraform
If you used Terraform to provision your infrastructure, you can destroy all the
resources by running a single command from within the /terraform
directory.
From your project root directory (
adk/llama/vllm
), navigate to the/terraform
directory:cd terraform
Run this command to remove all resources defined in your Terraform configuration files:
terraform destroy
What's next
- Learn how to configure Horizontal Pod Autoscaler (HPA) to automatically adjust your agent's resources on-demand.
- Learn how to configure Identity-Aware Proxy (IAP) for your web applications running on Google Cloud, providing centralized authorization for access to your agent's UI.
- Learn how to use Cloud Logging and Cloud Monitoring to gain insights into your agent's performance and health within your GKE cluster.
- Explore experimental samples in GKE AI Labs that can help you use GKE to accelerate your agentic AI initiatives.