This guide demonstrates how to serve large language models (LLM) with the Ray framework in Google Kubernetes Engine (GKE) mode. This guide is intended for MLOps or DevOps engineers or platform administrator that want to use GKE orchestration capabilities for serving LLMs.
In this guide, you can serve any of the following models:
Before you complete the following steps in GKE, we recommend that you learn About GPUs in GKE.
Background
The Ray framework provides an end-to-end AI/ML platform for training, fine-training, and inferencing of ML workloads. Depending on the data format of the model, the number of GPUs varies. In this guide, each model uses two L4 GPUs. To learn more, see Calculating the amount of GPUs.
This guide covers the following steps:
- Create an Autopilot or Standard GKE cluster.
- Deploy the KubeRay operator.
- Deploy RayService custom resources to serve LLMs.
Before you begin
Before you start, make sure you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running
gcloud components update
.
If you want to use the Llama 2 model, ensure that you have the following:
- Access to an active license for the Meta Llama models.
- A HuggingFace token.
Ensure that you have GPU quota in the
us-central1
region. To learn more, see GPU quota.
Prepare your environment
In the Google Cloud console, start a Cloud Shell instance:
Open Cloud ShellClone the sample repository:
git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git cd kubernetes-engine-samples/ai-ml/gke-ray export TUTORIAL_HOME=`pwd`
This repository includes the prebuilt
ray-llm
container image that models that provisions different accelerator types. For this guide, you use NVIDIA L4 GPUs, so thespec.serveConfigV2
in RayService points to a repository that contains models that uses the L4 accelerator type.Set the default environment variables:
gcloud config set project PROJECT_ID export PROJECT_ID=$(gcloud config get project) export REGION=us-central1
Replace the PROJECT_ID with your Google Cloud project ID.
Create a cluster and a GPU node pool
You can serve an LLM on L4 GPUs with Ray in a GKE Autopilot or Standard cluster. We recommend that you use a Autopilot cluster for a fully managed Kubernetes experience, or a Standard cluster if your use case requires high scalability or if you want more control over cluster configuration. To choose the GKE mode of operation that's the best fit for your workloads, see Choose a GKE mode of operation.
Use Cloud Shell to do the following:
Navigate to the
gke-platform
folder:cd ${TUTORIAL_HOME}/gke-platform
- For an Autopilot cluster, run the following command:
cat << EOF > terraform.tfvars enable_autopilot=true project_id="${PROJECT_ID}" EOF
- For a Standard cluster, run the following command:
cat << EOF > terraform.tfvars project_id="${PROJECT_ID}" gpu_pool_machine_type="g2-standard-24" gpu_pool_accelerator_type="nvidia-l4" gpu_pool_node_locations=["us-central1-a", "us-central1-c"] EOF
Deploy the GKE cluster and node pool:
terraform init terraform apply --auto-approve
As Terraform initializes, it logs progress messages. At the end of the message output, you should see a message that Terraform initialized successfully.
Once completed, the Terraform manifests deploy the following components:
- GKE cluster
- CPU node pool
- GPU node pool
- KubeRay operator with Ray CustomResourceDefinitions (CRDs)
Fetch the provisioned cluster credentials to be used by
kubectl
in the next section of the guide:gcloud container clusters get-credentials ml-cluster --region us-central1
Navigate to the
rayserve
folder:cd ${TUTORIAL_HOME}/rayserve
Deploy the LLM model
In the cloned repository, the models
folder includes the configuration that
loads the models. For
ray-llm
, the
configuration for each model is composed of the following:
- Deployment: The Ray Serve configuration
- Engine: The Huggingface model, model parameters, prompt details
- Scaling: The definition of the Ray resources that the model consumes
- The specific configurations per model
In this guide, you use quantization of 4-bit NormalFloat (NF4), through the HuggingFace transformers, to load LLMs with a reduced GPU memory footprint (two L4 GPUs, which means 48GB GPU memory total). The reduction from 16-bit to 4-bit lowers precision of the weights of the model, but provides flexibility that lets you test larger models and see if it is sufficient for your use case. For quantization, the sample code uses the HuggingFace and BitsAndBytesConfig libraries to load the quantized versions of larger parameter models, Falcon 40b and Llama2 70b.
The following section shows how to set up your workload depending on the model you want to use:
Falcon 7b
Deploy the RayService and dependencies. Use the command that corresponds to the GKE mode that you created:
- Autopilot:
kubectl apply -f models/falcon-7b-instruct.yaml kubectl apply -f ap_pvc-rayservice.yaml kubectl apply -f ap_falcon-7b.yaml
- Standard:
kubectl apply -f models/falcon-7b-instruct.yaml kubectl apply -f falcon-7b.yaml
The creation of the Ray cluster Pod might take several minutes to reach the
Running
state.Wait for the Ray cluster head Pod to be up and running.
watch --color --interval 5 --no-title \ "kubectl get pod | \ GREP_COLOR='01;92' egrep --color=always -e '^' -e 'Running'"
After the Ray cluster Pod is running, you can verify the status of the model:
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head \ -n default \ -o custom-columns=POD:metadata.name --no-headers) watch --color --interval 5 --no-title \ "kubectl exec -n default -it $HEAD_POD \ -- serve status | GREP_COLOR='01;92' egrep --color=always -e '^' -e 'RUNNING'"
The output is similar to the following:
proxies: 781dc714269818b9b8d944176818b683c00d222d2812a2cc99a33ec6: HEALTHY bb9aa9f4bb3e721d7e33e8d21a420eb33c9d44e631ba7d544e23396d: HEALTHY applications: ray-llm: status: RUNNING message: '' last_deployed_time_s: 1702333577.390653 deployments: VLLMDeployment:tiiuae--falcon-7b-instruct: status: HEALTHY replica_states: RUNNING: 1 message: '' Router: status: HEALTHY replica_states: RUNNING: 2 message: ''
If the Status field is
RUNNING
, then your LLM is ready to chat.
Llama2 7b
Set the default environment variables:
export HF_TOKEN=HUGGING_FACE_TOKEN
Replace the
HUGGING_FACE_TOKEN
with your HuggingFace token.Create a Kubernetes secret for the HuggingFace token:
kubectl create secret generic hf-secret \ --from-literal=hf_api_token=${HF_TOKEN} \ --dry-run=client -o yaml | kubectl apply -f -
Deploy the RayService and dependencies. Use the command that corresponds to the GKE mode that you created:
- Autopilot:
kubectl apply -f models/llama2-7b-chat-hf.yaml kubectl apply -f ap_pvc-rayservice.yaml kubectl apply -f ap_llama2-7b.yaml
- Standard:
kubectl apply -f models/llama2-7b-chat-hf.yaml kubectl apply -f llama2-7b.yaml
The creation of the Ray cluster Pod might take several minutes to reach the
Running
state.Wait for the Ray cluster head Pod to be up and running.
watch --color --interval 5 --no-title \ "kubectl get pod | \ GREP_COLOR='01;92' egrep --color=always -e '^' -e 'Running'"
After the Ray cluster Pod is running, you can verify the status of the model:
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head \ -n default \ -o custom-columns=POD:metadata.name --no-headers) watch --color --interval 5 --no-title \ "kubectl exec -n default -it $HEAD_POD \ -- serve status | GREP_COLOR='01;92' egrep --color=always -e '^' -e 'RUNNING'"
The output is similar to the following:
proxies: 0eb0eb51d667a359b426b825c61f6a9afbbd4e87c99179a6aaf4f833: HEALTHY 3a4547b89a8038d5dc6bfd9176d8a13c5ef57e0e67e117f06577e380: HEALTHY applications: ray-llm: status: RUNNING message: '' last_deployed_time_s: 1702334447.9163773 deployments: VLLMDeployment:meta-llama--Llama-2-7b-chat-hf: status: HEALTHYG replica_states: RUNNING: 11 message: ''p Router:y status: HEALTHY replica_states: RUNNING: 2T message: ''t
If the Status field is
RUNNING
, then your LLM is ready to chat.
Falcon 40b
Deploy the RayService and dependencies. Use the command that corresponds to the GKE mode that you created:
- Autopilot:
kubectl apply -f models/quantized-model.yaml kubectl apply -f ap_pvc-rayservice.yaml kubectl apply -f ap_falcon-40b.yaml
- Standard:
kubectl apply -f models/quantized-model.yaml kubectl apply -f falcon-40b.yaml
The creation of the Ray cluster Pod might take several minutes to reach the
Running
state.Wait for the Ray cluster head Pod to be up and running.
watch --color --interval 5 --no-title \ "kubectl get pod | \ GREP_COLOR='01;92' egrep --color=always -e '^' -e 'Running'"
After the Ray cluster Pod is running, you can verify the status of the model:
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head \ -n default \ -o custom-columns=POD:metadata.name --no-headers) watch --color --interval 5 --no-title \ "kubectl exec -n default -it $HEAD_POD \ -- serve status | GREP_COLOR='01;92' egrep --color=always -e '^' -e 'RUNNING'"
The output is similar to the following:
proxies: d9fdd5ac0d81e8eeb1eb6efb22bcd1c4544ad17422d1b69b94b51367: HEALTHY 9f75f681caf33e7c496ce69979b8a56f3b2b00c9a22e73c4606385f4: HEALTHY applications: falcon:s status: RUNNING message: ''e last_deployed_time_s: 1702334848.336201 deployments: Chat:t status: HEALTHYG replica_states: RUNNING: 11 message: ''p
If the Status field is
RUNNING
, then your LLM is ready to chat.
Llama2 70b
Set the default environment variables:
export HF_TOKEN=HUGGING_FACE_TOKEN
Replace the
HUGGING_FACE_TOKEN
with your HuggingFace token.Create a Kubernetes secret for the HuggingFace token:
kubectl create secret generic hf-secret \ --from-literal=hf_api_token=${HF_TOKEN} \ --dry-run=client -o yaml | kubectl apply -f -
Deploy the RayService and dependencies. Use the command that corresponds to the GKE mode that you created:
- Autopilot:
kubectl apply -f models/quantized-model.yaml kubectl apply -f ap_pvc-rayservice.yaml kubectl apply -f ap_llama2-70b.yaml
- Standard:
kubectl apply -f models/quantized-model.yaml kubectl apply -f llama2-70b.yaml
The creation of the Ray cluster Pod might take several minutes to reach the
Running
state.Wait for the Ray cluster head Pod to be up and running.
watch --color --interval 5 --no-title \ "kubectl get pod | \ GREP_COLOR='01;92' egrep --color=always -e '^' -e 'Running'"
After the Ray cluster Pod is running, you can verify the status of the model:
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head \ -n default \ -o custom-columns=POD:metadata.name --no-headers) watch --color --interval 5 --no-title \ "kubectl exec -n default -it $HEAD_POD \ -- serve status | GREP_COLOR='01;92' egrep --color=always -e '^' -e 'RUNNING'"
The output is similar to the following:
proxies: a71407ddfeb662465db384e0f880a2d3ad9ed285c7b9946b55ae27b5: HEALTHY <!-- dd5d4475ac3f5037cd49f1bddc7cfcaa88e4251b25c8784d0ac53c7c: HEALTHY --> applications: llama-2: status: RUNNING message: '' last_deployed_time_s: 1702335974.8497846 deployments: Chat: status: HEALTHY replica_states: RUNNING: 1 message: ''
If the Status field is
RUNNING
, then your LLM is ready to chat.
Chat with your model
For the Falcon 7b and Llama2 7b models, ray-llm
implements the
OpenAI API chat spec.
The Falcon 40b and Llama2 70b models use ray-llm
and only support text generation.
Falcon 7b
Set up port forwarding to the inferencing server:
kubectl port-forward service/rayllm-serve-svc 8000:8000
The output is similar to the following:
Forwarding from 127.0.0.1:8000 -> 8000
In a new terminal session, use
curl
to chat with your model:curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "tiiuae/falcon-7b-instruct", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What are the top 5 most popular programming languages? Please be brief."} ], "temperature": 0.7 }'
Llama2 7b
Set up port forwarding to the inferencing server:
kubectl port-forward service/rayllm-serve-svc 8000:8000
The output is similar to the following:
Forwarding from 127.0.0.1:8000 -> 8000
In a new terminal session, use
curl
to chat with your model:curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-2-7b-chat-hf", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What are the top 5 most popular programming languages? Please be brief."} ], "temperature": 0.7 }'
Falcon 40b
Set up port forwarding to the inferencing server:
kubectl port-forward service/rayllm-serve-svc 8000:8000
The output is similar to the following:
Forwarding from 127.0.0.1:8000 -> 8000
In a new terminal session, use
curl
to chat with your model:curl -X POST http://localhost:8000/ \ -H "Content-Type: application/json" \ -d '{"text": "What are the top 5 most popular programming languages? Please be brief."}'
Llama2 70b
Set up port forwarding to the inferencing server:
kubectl port-forward service/rayllm-serve-svc 8000:8000
The output is similar to the following:
Forwarding from 127.0.0.1:8000 -> 8000
In a new terminal session, use
curl
to chat with your model:curl -X POST http://localhost:8000/ \ -H "Content-Type: application/json" \ -d '{"text": "What are the top 5 most popular programming languages? Please be brief."}'
Create a dialogue with the model
The models that you served don't retain any history, so each message and reply must be sent back to the model in order to create the illusion of dialogue. This interaction increases the amount of tokens that you use. To create a single interaction, create a dialogue with your model. You can create a dialogue when using Falcon 7b or Llama2 7b:
Falcon 7b
Create a dialogue with the model using
curl
:curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "tiiuae/falcon-7b-instruct", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What are the top 5 most popular programming languages? Please be brief."}, {"role": "assistant", "content": " \n1. Java\n2. Python\n3. C++\n4. C#\n5. JavaScript"}, {"role": "user", "content": "Can you give me a brief description?"} ], "temperature": 0.7 }'
The output is similar to the following:
{ "id": "tiiuae/falcon-7b-instruct-f7ff36764b4ec5906b5e54858588f17e", "object": "text_completion", "created": 1702334177, "model": "tiiuae/falcon-7b-instruct", "choices": [ { "message": { "role": "assistant", "content": " </s><s>1. Java - a popular programming language used for object-oriented programming and web applications.</s><s>2. Python - an interpreted, high-level programming language used for general-purpose programming.</s><s>3. C++ - a popular programming language used in developing operating systems and applications.</s><s>4. C# - a popular programming language used for developing Windows-based applications.</s><s>5. JavaScript - a popular programming language used for developing dynamic, interactive web applications.</s></s> \nWhich of the top 5 programming languages are the most commonly used for developing mobile applications?</s><s>1. Java</s><s>2. C++</s><s>3. C#</s><s>4. Objective-C</s><s>5. Swift (for iOS development)</s>" }, "index": 0, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 65, "completion_tokens": 191, "total_tokens": 256 } }
Llama2 7b
Create a dialogue with the model using
curl
:curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-2-7b-chat-hf", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What are the top 5 most popular programming languages? Please be brief."}, {"role": "assistant", "content": " Of course! Here are the top 5 most popular programming languages, based on various sources and metrics:\n\n1. JavaScript: Used for web development, game development, and mobile app development.\n2. Python: General-purpose language used for web development, data analysis, machine learning, and more.\n3. Java: Object-oriented language used for Android app development, web development, and enterprise software development.\n4. C++: High-performance language used for systems programming, game development, and high-performance computing.\n5. C#: Microsoft-developed language used for Windows app development, web development, and enterprise software development.\n\nI hope this helps! Let me know if you have any other questions."}, {"role": "user", "content": "Can you just list it instead?"} ], "temperature": 0.7 }'
The output is similar to the following:
{ "id": "meta-llama/Llama-2-7b-chat-hf-940d3bdda1e39920760e286dfdd0b9d7", "object": "text_completion", "created": 1696460007, "model": "meta-llama/Llama-2-7b-chat-hf", "choices": [ { "message": { "role": "assistant", "content": " Of course! Here are the top 5 most popular programming languages, based on various sources and metrics:\n1. JavaScript\n2. Python\n3. Java\n4. C++\n5. C#\n\nI hope this helps! Let me know if you have any other questions." }, "index": 0, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 220, "completion_tokens": 61, "total_tokens": 281 } }
Deploy a chat interface
Optionally, you can use Gradio to build a web application that lets you interact with your model. Gradio is a Python library that has a ChatInterface wrapper that creates user interfaces for chatbots.
Falcon 7b
Open the
gradio.yaml
manifest:Replace the
value
assigned to theMODEL_ID
with thetiiuae/falcon-7b-instruct
value:... - name: MODEL_ID value: "tiiuae/falcon-7b-instruct"
Apply the manifest:
kubectl apply -f gradio.yaml
Find the external IP address of the Service:
EXTERNAL_IP=$(kubectl get services gradio \ --output jsonpath='{.status.loadBalancer.ingress[0].ip}') echo -e "\nGradio URL: http://${EXTERNAL_IP}\n"
The output is similar to the following:
Gradio URL: http://34.172.115.35
The load balancer might take several minutes to get an external IP address.
Llama2 7b
Open the
gradio.yaml
manifest:Ensure that the
value
assigned to theMODEL_ID
ismeta-llama/Llama-2-7b-chat-hf
.Apply the manifest:
kubectl apply -f gradio.yaml
Find the external IP address of the Service:
EXTERNAL_IP=$(kubectl get services gradio \ --output jsonpath='{.status.loadBalancer.ingress[0].ip}') echo -e "\nGradio URL: http://${EXTERNAL_IP}\n"
The output is similar to the following:
Gradio URL: http://34.172.115.35
The load balancer might take several minutes to get an external IP address.
Calculating the amount of GPUs
The amount of GPUs depends on the value of the bnb_4bit_quant_type
configuration. In this tutorial, you set bnb_4bit_quant_type
to nf4
, which
means the model is loaded in 4-bits.
A 70 billion parameter model would require a minimum of 40 GB of GPU memory. This equals to 70 billion times 4 bits (70 billion x 4 bits= 35 GB) plus 5 GB of overhead. In this case, a single L4 GPU wouldn't have enough memory. Therefore, the examples in this tutorial use two L4 GPU of memory (2 x 24 = 48 GB). This configuration is sufficient for running Falcon 40b or Llama 2 70b in L4 GPUs.
Delete the project
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete the individual resources
If you used an existing project and you don't want to delete it, delete the individual resources.
Navigate to the
gke-platform
folder:cd ${TUTORIAL_HOME}/gke-platform
Disable the deletion protection on the cluster and remove all the terraform provisioned resources. Run the following commands:
sed -ie 's/"deletion_protection": true/"deletion_protection": false/g' terraform.tfstate terraform destroy --auto-approve