Deploy an agentic AI application on GKE with the Agent Development Kit (ADK) and a self-hosted LLM

This tutorial demonstrates how to deploy and manage containerized agentic AI/ML applications by using Google Kubernetes Engine (GKE). By combining the Google Agent Development Kit (ADK) with a self-hosted large language model (LLM) like Llama 3.1 served by vLLM, you can operationalize AI agents efficiently and at scale while maintaining full control over the model stack. This tutorial walks you through the end-to-end process of taking a Python-based agent from development to production deployment on a GKE Autopilot cluster with GPU acceleration.

This tutorial is intended for Machine learning (ML) engineers, Developers, and Cloud architects who are interested in using Kubernetes container orchestration capabilities for serving agentic AI/ML applications. To learn more about common roles and example tasks that we reference in Google Cloud content, see Common GKE Enterprise user roles and tasks.

Before you begin, ensure you are familiar with the following:

Background

This section describes the key technologies used in this tutorial.

Agent Development Kit (ADK)

Agent Development Kit (ADK) is a flexible and modular framework for developing and deploying AI agents. Although it's optimized for Gemini and the Google ecosystem, ADK doesn't require you to use a specific model or deployment, and is built for compatibility with other frameworks. ADK was designed to make agent development feel more like software development, to make it easier for developers to create, deploy, and orchestrate agentic architectures that range from basic tasks to complex workflows.

For more information, see the ADK documentation.

GKE managed Kubernetes service

Google Cloud offers a range of services, including GKE, which is well-suited to deploying and managing AI/ML workloads. GKE is a managed Kubernetes service that simplifies deploying, scaling, and managing containerized applications. GKE provides the necessary infrastructure—including scalable resources, distributed computing, and efficient networking—to handle the computational demands of LLMs.

For more information about key Kubernetes concepts, see Start learning about Kubernetes. For more information about the GKE and how it helps you scale, automate, and manage Kubernetes, see GKE overview.

vLLM

vLLM is a highly optimized open source LLM serving framework that can increase serving throughput on GPUs, with features such as the following:

Optimized transformer implementation with PagedAttention.
Continuous batching to improve the overall serving throughput.
Tensor parallelism and distributed serving on multiple GPUs.

For more information, see the vLLM documentation.

Prepare the environment

This tutorial uses Cloud Shell to manage resources hosted on Google Cloud. Cloud Shell comes preinstalled with the software you need for this tutorial, including kubectl, terraform, and the Google Cloud CLI.

To set up your environment with Cloud Shell, follow these steps:

In the Google Cloud console, launch a Cloud Shell session and click Activate Cloud Shell. This action launches a session in a Google Cloud console pane.
Set the default environment variables:
```
gcloud config set project PROJECT_ID
export GOOGLE_CLOUD_REGION=REGION
export PROJECT_ID=PROJECT_ID
```
Replace the following values:
- PROJECT_ID: your Google Cloud project ID.
- REGION: the Google Cloud region (for example, us-east4) to provision your GKE cluster, Artifact Registry, and other regional resources. Make sure to specify a region that supports L4 GPUs and G2 machine type instances. To check for region availability, see GPU regions and zones in the Compute Engine documentation.

Clone the sample project

From your Cloud Shell terminal, clone the tutorial's sample code repository:

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git

Navigate to the tutorial directory:

cd kubernetes-engine-samples/ai-ml/adk-vllm

Create and configure Google Cloud resources

To deploy your agent, you first need to provision the necessary Google Cloud resources. You can create the GKE cluster and Artifact Registry repository using either the gcloud CLI or Terraform.

gcloud

This section provides gcloud CLI commands to set up your GKE cluster and Artifact Registry.

Create a GKE cluster: You can deploy your containerized agentic application in a GKE Autopilot or Standard cluster. Use an Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that best fits your workloads, see About GKE modes of operation.
Autopilot
In Cloud Shell, run the following command:
```
gcloud container clusters create-auto CLUSTER_NAME \
    --location=$GOOGLE_CLOUD_REGION
```
Replace CLUSTER_NAME with the name of your GKE cluster.

With Autopilot, GKE automatically provisions nodes based on your workload's resource requests. The GPU required for the LLM is requested in the deploy-llm.yaml manifest by using a nodeSelector.

To add a nodeSelector request the nvidia-l4 GPU, follow these steps:
1. Open kubernetes-engine-samples/ai-ml/adk-vllm/deploy-llm/deploy-llm.yaml in an editor.
2. Add the following nodeSelector under spec.template.spec:
  nodeSelector: cloud.google.com/gke-accelerator: nvidia-l4
Standard
1. In Cloud Shell, create a Standard cluster by running the following command:
  gcloud container clusters create CLUSTER_NAME \ --location=$GOOGLE_CLOUD_REGION
  Replace CLUSTER_NAME with the name of your GKE cluster.
2. Create a GPU-enabled node pool for your cluster by running the following command:
  gcloud container node-pools create gpu-node-pool \ --cluster=CLUSTER_NAME \ --location=$GOOGLE_CLOUD_REGION \ --machine-type=g2-standard-8 \ --accelerator=type=nvidia-l4,count=1 \ --enable-gvnic
  The deploy-llm.yaml file specifies an nvidia-l4 GPU, which is available in the G2 machine series. For more information about this machine type, see GPU machine types in the Compute Engine documentation.
Create an Artifact Registry repository: Create an Artifact Registry repository to securely store and manage your agent's Docker container image.
```
gcloud artifacts repositories create REPO_NAME \
    --repository-format=docker \
    --location=$GOOGLE_CLOUD_REGION
```
Replace REPO_NAME with the name of the Artifact Registry repository you want to use (for example, adk-repo).
Get the repository URL: To verify the full path to your repository, run this command. You'll use this format to tag your Docker image when you build the agent image.
```
gcloud artifacts repositories describe REPO_NAME \
    --location $GOOGLE_CLOUD_REGION
```

Terraform

This section describes how to use the Terraform configuration included in the sample repository to provision your Google Cloud resources automatically.

Navigate to the Terraform directory: The \terraform directory contains all the necessary configuration files to create the GKE cluster and other required resources.
```
cd terraform
```
Create a Terraform variables file: Copy the provided example variables file (example_vars.tfvars) to create your own vars.tfvars file.
```
cp example_vars.tfvars vars.tfvars
```
Open the vars.tfvars file in an editor and replace the placeholder values with your specific configuration. At a minimum, you must replace PROJECT_ID with your Google Cloud project ID and CLUSTER_NAME with the name of your GKE cluster.
Initialize Terraform: To download the necessary provider plugins for Google Cloud, run this command.
```
terraform init
```
Review the execution plan: This command shows the infrastructure changes Terraform will make.
```
terraform plan -var-file=vars.tfvars
```
Apply the configuration: To create the resources in your Google Cloud project, execute the Terraform plan. Confirm with yes when prompted.
```
terraform apply -var-file=vars.tfvars
```

After you run these commands, Terraform provisions your GKE cluster and Artifact Registry repository, and configures the necessary IAM roles and service accounts, including Workload Identity Federation for GKE.

To learn more about using Terraform, see Provision GKE resources with Terraform.

Configure `kubectl` to communicate with your cluster

To configure kubectl to communicate with your cluster, run the following command:

gcloud container clusters get-credentials CLUSTER_NAME \
    --location=${GOOGLE_CLOUD_REGION}

Replace CLUSTER_NAME with the name of your GKE cluster.

Build the agent image

After you create the infrastructure using either gcloud CLI or Terraform, follow these steps to build your agent application.

Grant the required IAM role for Cloud Build: The Cloud Build service requires permissions to push the agent's container image to Artifact Registry. Grant the roles/artifactregistry.writer role to the Compute Engine default service account, which is used by Cloud Build.
1. Construct the email for the Compute Engine default service account:
```
export PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
export COMPUTE_SA_EMAIL=${PROJECT_NUMBER}-compute@
```
2. Grant the roles/artifactregistry.writer role to the service account:
```
gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:${COMPUTE_SA_EMAIL} \
    --role=roles/artifactregistry.writer
```
Build and push the agent container image: From the project root directory (adk/llama/vllm), build your Docker image and push it to your Artifact Registry by running these commands.
```
export IMAGE_URL="${GOOGLE_CLOUD_REGION}-docker.pkg.dev/${PROJECT_ID}/REPO_NAME/adk-agent:latest"
gcloud builds submit --tag $IMAGE_URL
```
Verify that the image has been pushed: After the build process completes successfully, verify that your agent's container image was pushed to Artifact Registry by listing the images in your repository.
```
gcloud artifacts docker images list ${GOOGLE_CLOUD_REGION}-docker.pkg.dev/${PROJECT_ID}/REPO_NAME
```
You should see an output that lists the image you just pushed and tagged as latest.

Deploy the model

After setting up your GKE cluster and building the agent image, the next step is to deploy the self-hosted Llama 3.1 model to your cluster. To do this, deploy a pre-configured vLLM inference server that pulls the model from Hugging Face and serves it internally within the cluster.

Create a Kubernetes secret for Hugging Face credentials: To allow the GKE cluster to download the gated Llama 3.1 model, you must provide your Hugging Face token as a Kubernetes secret. The deploy-llm.yaml manifest is configured to use this secret for authentication.
```
kubectl create secret generic hf-secret \
    --from-literal=hf-token-secret=HUGGING_FACE_TOKEN
```
Replace HUGGING_FACE_TOKEN with your token.
View the manifest: From your project root directory (adk/llama/vllm), navigate to the /deploy-llm directory that contains the model Deployment manifest.
```
cd deploy-llm
```
Apply the manifest: Run the following command to apply the deploy-llm.yaml manifest to your cluster.
```
kubectl apply -f deploy-llm.yaml
```
The command creates three Kubernetes resources:
- A Deployment that runs the vLLM server, configured to use the meta-llama/Llama-3.1-8B-Instruct model.
- A Service named vllm-llama3-service that exposes the vLLM server on an internal cluster IP address, allowing the ADK agent to communicate with it.
- A ConfigMap containing a Jinja chat template required by the Llama 3.1 model.
Verify the model Deployment: The vLLM server pulls the model files from Hugging Face. This process can take several minutes. You can monitor the status of the Pod to ensure its readiness.
1. Wait for the Deployment to become available.
```
kubectl wait --for=condition=available --timeout=600s deployment/vllm-llama3-deployment
```
2. View the logs from the running Pod to confirm that the server started successfully.
```
export LLM_POD=$(kubectl get pods -l app=vllm-llama3 -o jsonpath='{.items[0].metadata.name}')
kubectl logs -f $LLM_POD
```
  The deployment is ready when you see log output similar to the following, indicating the LLM server has started and the API routes are available:
```
INFO 07-16 14:15:16 api_server.py:129] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```
3. Send a request directly to the model server to confirm that the LLM is ready. To do this, open a new Cloud Shell terminal and run the following command to forward the vllm-llama3-service to your local machine:
```
kubectl port-forward service/vllm-llama3-service 8000:8000
```
4. In another terminal, send a sample request to the model's API endpoint by using curl. For example:
```
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "Hello!",
    "max_tokens": 10
  }'
```
  If the command returns a successful JSON response, your LLM is ready. You can now terminate the port-forward process by returning to its terminal window and pressing Ctrl+C, then proceed to deploy the agent.

Deploy the agent application

The next step is to deploy the ADK-based agent application.

Navigate to the /deploy-agent directory: From your project root directory (adk/llama/vllm), navigate to the /deploy-agent directory that contains the agent's source code and deployment manifest.
```
cd ../deploy-agent
```
Update the agent deployment manifest:
1. The sample deploy-agent.yaml manifest file contains a placeholder for your project ID in the container image URL. You must replace the placeholder with your Google Cloud project ID.
```
image: us-central1-docker.pkg.dev/PROJECT_ID/adk-repo/adk-agent:latest
```
  To perform this substitution in place, you can run the following command:
```
sed -i "s/<PROJECT_ID>/$PROJECT_ID/g" deploy-agent.yaml
```
2. Make sure the readinessProbe path is set to / instead of /dev-ui. To perform this substitution in place, you can run the following command:
```
sed -i "s|path: /dev-ui/|path: /|g" deploy-agent.yaml
```
Apply the manifest: Run the following command to apply the deploy-agent.yaml manifest to your cluster.
```
kubectl apply -f deploy-agent.yaml
```
This command creates two Kubernetes resources:
- A Deployment named adk-agent that runs your custom-built agent container image.
- A Service named adk-agent of type NodePort that exposes the agent application so it can be accessed for testing.

Verify the agent deployment: Check the status of the Pod to ensure it is running correctly.

Wait for the Deployment to become available:

kubectl wait --for=condition=available --timeout=300s deployment/adk-agent

View the logs from the running agent Pod:

export AGENT_POD=$(kubectl get pods -l app=adk-agent -o jsonpath='{.items[0].metadata.name}')
kubectl logs -f $AGENT_POD

The deployment is successful when you see log output similar to the following, indicating that the Uvicorn server is running and ready to accept requests:

INFO:     Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)

Test your deployed agent

After successfully deploying both the vLLM server and the agent application, you can test the end-to-end functionality by interacting with the agent's web UI.

Forward the agent's service to your local machine: The adk-agent service is of type NodePort, but the most direct way to access it from your Cloud Shell environment is to use the kubectl port-forward command. Create a secure tunnel to the agent's Pod by running this command.
```
kubectl port-forward $AGENT_POD 8001:8001
```
Access the agent's web UI: In Cloud Shell, click the Web Preview button and select Preview on port 8001. A new browser tab opens, displaying the agent's chat interface.
Interact with the agent: Ask the agent a question that will invoke its get_weather tool. For example:
```
What's the weather like in Tokyo?
```
The agent will first call the LLM to understand the intent and identify the need to use the get_weather tool. Then, it will execute the tool with "Tokyo" as the parameter. Finally, it will use the tool's output to generate a response. You should see a response similar to the following:
```
  The weather in Tokyo is 25°C and sunny.
```
(Optional) Verify the tool call in the logs: You can observe the agent's interaction with the LLM and the tool execution by viewing the logs of the respective Pods.
1. Agent Pod logs: In a new terminal, view the logs of the adk-agent Pod. You see the tool call and its result.
```
kubectl logs -f $AGENT_POD
```
  The output shows the tool being called and the result being processed.
2. LLM Pod logs: View the logs of the vllm-llama3-deployment Pod to see the incoming request from the agent.
```
kubectl logs -f $LLM_POD
```
  The logs shows the full prompt sent by the agent to the LLM, including the system message, your query, and the definition of the get_weather tool.

After you finish testing, you can terminate the port-forward process by returning to its terminal window and pressing Ctrl+C.