Serve Gemma open models using GPUs on GKE with vLLM

Autopilot Standard

This tutorial shows you how to serve a Gemma large language model (LLM) using graphical processing units (GPUs) on Google Kubernetes Engine (GKE) with the vLLM serving framework. In this tutorial, you download the 2B and 7B parameter instruction tuned and pre-trained Gemma models from Hugging Face and deploy them on a GKE Autopilot or Standard cluster using a container that runs vLLM.

This guide is a good starting point if you need the granular control, scalability, resilience, portability, and cost-effectiveness of managed Kubernetes when deploying and serving your AI/ML workloads. If you need a unified managed AI platform to rapidly build and serve ML models cost effectively, we recommend that you try our Vertex AI deployment solution.

Background

By serving Gemma using GPUs on GKE with vLLM, you can implement a robust, production-ready inference serving solution with all the benefits of managed Kubernetes, including efficient scalability and higher availability. This section describes the key technologies used in this guide.

Gemma

Gemma is a set of openly available, lightweight, generative artificial intelligence (AI) models released under an open license. These AI models are available to run in your applications, hardware, mobile devices, or hosted services. You can use the Gemma models for text generation, however you can also tune these models for specialized tasks.

To learn more, see the Gemma documentation.

GPUs

GPUs let you accelerate specific workloads running on your nodes such as machine learning and data processing. GKE provides a range of machine type options for node configuration, including machine types with NVIDIA H100, L4, and A100 GPUs.

Before you use GPUs in GKE, we recommend that you complete the following learning path:

Learn about current GPU version availability
Learn about GPUs in GKE

vLLM

vLLM is a highly optimized open source LLM serving framework that can increase serving throughput on GPUs, with features such as:

Optimized transformer implementation with PagedAttention
Continuous batching to improve the overall serving throughput
Tensor parallelism and distributed serving on multiple GPUs

To learn more, refer to the vLLM documentation.

Objectives

This guide is intended for Generative AI customers who use PyTorch, new or existing users of GKE, ML Engineers, MLOps (DevOps) engineers, or platform administrators who are interested in using Kubernetes container orchestration capabilities for serving LLMs on H100, A100, and L4 GPU hardware.

By the end of this guide, you should be able to perform the following steps:

Prepare your environment with a GKE cluster in Autopilot or Standard mode.
Deploy a vLLM container to your cluster.
Use vLLM to serve the Gemma 2B or 7B model through curl and a web chat interface.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the required API.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the required API.

Enable the API

Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin
Check for the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. In the Principal column, find the row that has your email address.
  
  If your email address isn't in that column, then you do not have any roles.
4. In the Role column for the row with your email address, check whether the list of roles includes the required roles.
Grant the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. Click Grant access.
4. In the New principals field, enter your email address.
5. In the Select a role list, select a role.
6. To grant additional roles, click Add another role and add each additional role.
7. Click Save.

Create a Hugging Face account, if you don't already have one.
Ensure your project has sufficient quota for GPUs. To learn more, see About GPUs and Allocation quotas.

Get access to the model

To get access to the Gemma models for deployment to GKE, you must first sign the license consent agreement then generate a Hugging Face access token.

You must sign the consent agreement to use Gemma. Follow these instructions:

Access the model consent page on Kaggle.com.
Verify consent using your Hugging Face account.
Accept the model terms.

Generate an access token

To access the model through Hugging Face, you need a Hugging Face token.

Follow these steps to generate a new token if you don't have one already:

Click Your Profile > Settings > Access Tokens.
Select New Token.
Specify a Name of your choice and a Role of at least `Read.
Select Generate a token.
Copy the generated token to your clipboard.

Prepare your environment

In this tutorial, you use Cloud Shell to manage resources hosted on Google Cloud. Cloud Shell comes preinstalled with the software you'll need for this tutorial, including kubectl and gcloud CLI.

To set up your environment with Cloud Shell, follow these steps:

In the Google Cloud console, launch a Cloud Shell session by clicking Activate Cloud Shell in the Google Cloud console. This launches a session in the bottom pane of Google Cloud console.
Set the default environment variables:
```
gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export REGION=REGION
export CLUSTER_NAME=vllm
export HF_TOKEN=HF_TOKEN
```
Replace the following values:
- PROJECT_ID: Your Google Cloud project ID.
- REGION: A region that supports the accelerator type you want to use, for example, us-central1 for L4 GPU.
- HF_TOKEN: The Hugging Face token you generated earlier.

Create and configure Google Cloud resources

Follow these instructions to create the required resources.

Create a GKE cluster and node pool

You can serve Gemma on GPUs in a GKE Autopilot or Standard cluster. We recommend that you use a Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that's the best fit for your workloads, see Choose a GKE mode of operation.

Autopilot

In Cloud Shell, run the following command:

gcloud container clusters create-auto ${CLUSTER_NAME} \
  --project=${PROJECT_ID} \
  --region=${REGION} \
  --release-channel=rapid \
  --cluster-version=1.28

GKE creates an Autopilot cluster with CPU and GPU nodes as requested by the deployed workloads.

Standard

In Cloud Shell, run the following command to create a Standard cluster:

gcloud container clusters create ${CLUSTER_NAME} \
  --project=${PROJECT_ID} \
  --region=${REGION} \
  --workload-pool=${PROJECT_ID}.svc.id.goog \
  --release-channel=rapid \
  --num-nodes=1

The cluster creation might take several minutes.

Run the following command to create a node pool for your cluster:

gcloud container node-pools create gpupool \
  --accelerator type=nvidia-l4,count=2,gpu-driver-version=latest \
  --project=${PROJECT_ID} \
  --location=${REGION} \
  --node-locations=${REGION}-a \
  --cluster=${CLUSTER_NAME} \
  --machine-type=g2-standard-24 \
  --num-nodes=1

GKE creates a single node pool containing two L4 GPUs for each node.

Create a Kubernetes secret for Hugging Face credentials

In Cloud Shell, do the following:

Configure kubectl to communicate with your cluster:

gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${REGION}

Create a Kubernetes Secret that contains the Hugging Face token:

kubectl create secret generic hf-secret \
--from-literal=hf_api_token=$HF_TOKEN \
--dry-run=client -o yaml | kubectl apply -f -

Deploy vLLM

In this section, you deploy the vLLM container to serve the Gemma model you want to use. To learn about the instruction tuned and pre-trained models and which one to select for your use case, see Tuned models.

Gemma 2B-it

Follow these instructions to deploy the Gemma 2B instruction tuned model.

Create the following vllm-2b-it.yaml manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-2b-it
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20240220_0936_RC01
        resources:
          requests:
            cpu: "2"
            memory: "7Gi"
            ephemeral-storage: "10Gi"
            nvidia.com/gpu: 1
          limits:
            cpu: "2"
            memory: "7Gi"
            ephemeral-storage: "10Gi"
            nvidia.com/gpu: 1
        command: ["python3", "-m", "vllm.entrypoints.api_server"]
        args:
        - --model=$(MODEL_ID)
        - --tensor-parallel-size=1
        env:
        - name: MODEL_ID
          value: google/gemma-2b-it
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
            medium: Memory
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000

Apply the manifest:
```
kubectl apply -f vllm-2b-it.yaml
```

Gemma 7B-it

Follow these instructions to deploy the Gemma 7B instruction tuned model.

Create the following vllm-7b-it.yaml manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-7b-it
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20240220_0936_RC01
        resources:
          requests:
            cpu: "2"
            memory: "25Gi"
            ephemeral-storage: "25Gi"
            nvidia.com/gpu: 2
          limits:
            cpu: "2"
            memory: "25Gi"
            ephemeral-storage: "25Gi"
            nvidia.com/gpu: 2
        command: ["python3", "-m", "vllm.entrypoints.api_server"]
        args:
        - --model=$(MODEL_ID)
        - --tensor-parallel-size=2
        env:
        - name: MODEL_ID
          value: google/gemma-7b-it
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
            medium: Memory
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000

Apply the manifest:
```
kubectl apply -f vllm-7b-it.yaml
```

Gemma 2B

Follow these instructions to deploy the Gemma 2B pre-trained model.

Create the following vllm-2b.yaml manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-2b
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20240220_0936_RC01
        resources:
          requests:
            cpu: "2"
            memory: "7Gi"
            ephemeral-storage: "10Gi"
            nvidia.com/gpu: 1
          limits:
            cpu: "2"
            memory: "7Gi"
            ephemeral-storage: "10Gi"
            nvidia.com/gpu: 1
        command: ["python3", "-m", "vllm.entrypoints.api_server"]
        args:
        - --model=$(MODEL_ID)
        - --tensor-parallel-size=1
        env:
        - name: MODEL_ID
          value: google/gemma-2b
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
            medium: Memory
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000

Apply the manifest:
```
kubectl apply -f vllm-2b.yaml
```

Gemma 7B

Follow these instructions to deploy the Gemma 7B pre-trained model.

Create the following vllm-7b.yaml manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-7b
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20240220_0936_RC01
        resources:
          requests:
            cpu: "2"
            memory: "25Gi"
            ephemeral-storage: "25Gi"
            nvidia.com/gpu: 2
          limits:
            cpu: "2"
            memory: "25Gi"
            ephemeral-storage: "25Gi"
            nvidia.com/gpu: 2
        command: ["python3", "-m", "vllm.entrypoints.api_server"]
        args:
        - --model=$(MODEL_ID)
        - --tensor-parallel-size=2
        env:
        - name: MODEL_ID
          value: google/gemma-7b
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
            medium: Memory
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000

Apply the manifest:
```
kubectl apply -f vllm-7b.yaml
```

A Pod in the cluster downloads the model weights from Hugging Face and starts the serving engine.

Wait for the Deployment to be available:

kubectl wait --for=condition=Available --timeout=700s deployment/vllm-gemma-deployment

View the logs from the running Deployment:

kubectl logs -f -l app=gemma-server

The Deployment resource downloads the model data. This process can take a few minutes. The output is similar to the following:

INFO 01-26 19:02:54 model_runner.py:689] Graph capturing finished in 4 secs.
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Make sure the model is fully downloaded before proceeding to the next section.

Serve the model

In this section, you interact with the model.

Set up port forwarding

Run the following command to set up port forwarding to the model:

kubectl port-forward service/llm-service 8000:8000

The output is similar to the following:

Forwarding from 127.0.0.1:8000 -> 8000

Interact with the model using curl

This section shows how you can perform a basic smoke test to verify your deployed pre-trained or instruction tuned models. For simplicity, this section describes the testing approach only using the 2B pre-trained and instruction tuned models.

Pre-trained (2B)

In a new terminal session, use curl to chat with your model:

USER_PROMPT="Java is a"

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d @- <<EOF
{
    "prompt": "${USER_PROMPT}",
    "temperature": 0.90,
    "top_p": 1.0,
    "max_tokens": 128
}
EOF

The following output shows an example of the model response:

{"predictions":["Prompt:\nJava is a\nOutput:\n<strong>programming language</strong> that is primarily aimed at developers. It was originally created by creators of the Java Virtual Machine (JVM). Java is multi-paradigm, which means it supports object-oriented, procedural, and functional programming paradigms. Java is object-oriented, which means that it is designed to support classes and objects. Java is a dynamically typed language, which means that the type of variables are not determined at compile time. Java is also a multi-paradigm language, which means it supports more than one programming paradigm. Java is also a very lightweight language, which means that it is a very low level language compared to other popular"]}

Instruction tuned (2B-it)

In a new terminal session, use curl to chat with your model:

USER_PROMPT="I'm new to coding. If you could only recommend one programming language to start with, what would it be and why?"

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d @- <<EOF
{
    "prompt": "<start_of_turn>user\n${USER_PROMPT}<end_of_turn>\n",
    "temperature": 0.90,
    "top_p": 1.0,
    "max_tokens": 128
}
EOF

The following output shows an example of the model response:

{"predictions":["Prompt:\n<start_of_turn>user\nI'm new to coding. If you could only recommend one programming language to start with, what would it be and why?<end_of_turn>\nOutput:\n**Python** is an excellent choice for beginners due to the following reasons:\n\n* **Clear and simple syntax:** Python boasts a simple and straightforward syntax that makes it easy to learn the fundamentals of programming.\n* **Extensive libraries and modules:** Python comes with a vast collection of libraries and modules that address various programming tasks, including data manipulation, machine learning, and web development.\n* **Large and supportive community:** Python has a vibrant and active community that offers resources, tutorials, and support to help you along your journey.\n* **Cross-platform compatibility:** Python can be run on various platforms, including Windows, macOS, and"]}

(Optional) Interact with the model through a Gradio chat interface

In this section, you build a web chat application that lets you interact with your instruction tuned model. For simplicity, this section describes only the testing approach using the 2B-it model.

Gradio is a Python library that has a ChatInterface wrapper that creates user interfaces for chatbots.

Deploy the chat interface

In Cloud Shell, save the following manifest as gradio.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gradio
  labels:
    app: gradio
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gradio
  template:
    metadata:
      labels:
        app: gradio
    spec:
      containers:
      - name: gradio
        image: us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.0
        resources:
          requests:
            cpu: "512m"
            memory: "512Mi"
          limits:
            cpu: "1"
            memory: "512Mi"
        env:
        - name: CONTEXT_PATH
          value: "/generate"
        - name: HOST
          value: "http://llm-service:8000"
        - name: LLM_ENGINE
          value: "vllm"
        - name: MODEL_ID
          value: "gemma"
        - name: USER_PROMPT
          value: "<start_of_turn>user\nprompt<end_of_turn>\n"
        - name: SYSTEM_PROMPT
          value: "<start_of_turn>model\nprompt<end_of_turn>\n"
        ports:
        - containerPort: 7860
---
apiVersion: v1
kind: Service
metadata:
  name: gradio
spec:
  selector:
    app: gradio
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 7860
  type: ClusterIP

Apply the manifest:
```
kubectl apply -f gradio.yaml
```

Wait for the deployment to be available:

kubectl wait --for=condition=Available --timeout=300s deployment/gradio

Use the chat interface

In Cloud Shell, run the following command:
```
kubectl port-forward service/gradio 8080:8080
```
This creates a port forward from Cloud Shell to the Gradio service.
Click the Web Preview button which can be found on the top right of the Cloud Shell taskbar. Click Preview on Port 8080. A new tab opens in your browser.
Interact with Gemma using the Gradio chat interface. Add a prompt and click Submit.

Troubleshoot issues

If you get the message Empty reply from server, it's possible the container has not finished downloading the model data. Check the Pod's logs again for the Connected message which indicates that the model is ready to serve.
If you see Connection refused, verify that your port forwarding is active.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the deployed resources

To avoid incurring charges to your Google Cloud account for the resources that you created in this guide, run the following command:

gcloud container clusters delete ${CLUSTER_NAME} \
  --region=${REGION}

What's next

Learn more about GPUs in GKE.
Learn how to use Gemma with vLLM on other accelerators, including A100 and H100 GPUs, by viewing the sample code in GitHub.
Learn how to deploy GPU workloads in Autopilot.
Learn how to deploy GPU workloads in Standard.
Explore the vLLM GitHub repository and documentation.
Explore the Vertex AI Model Garden.
Discover how to run optimized AI/ML workloads with GKE platform orchestration capabilities.

Serve Gemma open models using GPUs on GKE with vLLM

Background

Gemma

GPUs

vLLM

Objectives

Before you begin

Check for the roles

Grant the roles

Get access to the model

Sign the license consent agreement

Generate an access token

Prepare your environment

Create and configure Google Cloud resources

Create a GKE cluster and node pool

Autopilot

Standard

Create a Kubernetes secret for Hugging Face credentials

Deploy vLLM

Gemma 2B-it

Gemma 7B-it

Gemma 2B

Gemma 7B

Serve the model

Set up port forwarding

Interact with the model using curl

Pre-trained (2B)

Instruction tuned (2B-it)

(Optional) Interact with the model through a Gradio chat interface

Deploy the chat interface

Use the chat interface

Troubleshoot issues

Clean up

Delete the deployed resources

What's next