This guide demonstrates how to serve large language models (LLM) using Ray and the Ray Operator add-on with Google Kubernetes Engine (GKE).
In this guide, you can serve any of the following models:
This guide also covers model serving techniques like model multiplexing and model composition that are supported by the Ray Serve framework.
Background
The Ray framework provides an end-to-end AI/ML platform for training, fine-training, and inferencing of machine learning workloads. Ray Serve is a framework in Ray that you can use to serve popular LLMs from Hugging Face.
Depending on the data format of the model, the number of GPUs varies. In this guide, your model can use one or two L4 GPUs.
This guide covers the following steps:
- Create an Autopilot or Standard GKE cluster with the Ray Operator add-on enabled.
- Deploy a RayService resource that downloads and serves a large language model (LLM) from Hugging Face.
- Deploy a chat interface and dialog with LLMs.
Before you begin
Before you start, make sure you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running
gcloud components update
.
- Create a Hugging Face account, if you don't already have one.
- Ensure that you have a Hugging Face token.
- Ensure that you have access to the Hugging Face model that you want to use. This is usually granted by signing an agreement and requesting access from the model owner on the Hugging Face model page.
- Ensure that you have GPU quota in the
us-central1
region. To learn more, see GPU quota.
Prepare your environment
In the Google Cloud console, start a Cloud Shell instance:
Open Cloud ShellClone the sample repository:
git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git cd kubernetes-engine-samples/ai-ml/gke-ray/rayserve/llm export TUTORIAL_HOME=`pwd`
Set the default environment variables:
gcloud config set project PROJECT_ID export PROJECT_ID=$(gcloud config get project) export COMPUTE_REGION=us-central1 export CLUSTER_VERSION=CLUSTER_VERSION export HF_TOKEN=HUGGING_FACE_TOKEN
Replace the following:
PROJECT_ID
: your Google Cloud project ID.CLUSTER_VERSION
: the GKE version to use. Must be1.30.1
or later.HUGGING_FACE_TOKEN
: your Hugging Face access token.
Create a cluster with a GPU node pool
You can serve an LLM on L4 GPUs with Ray in a GKE Autopilot or Standard cluster using the Ray Operator add-on. We generally recommend that you use an Autopilot cluster for a fully managed Kubernetes experience. Choose a Standard cluster instead if your use case requires high scalability or if you want more control over cluster configuration. To choose the GKE mode of operation that's the best fit for your workloads, see Choose a GKE mode of operation.
Use Cloud Shell to create an Autopilot or Standard cluster:
Autopilot
Create an Autopilot cluster with the Ray Operator add-on enabled:
gcloud container clusters create-auto rayserve-cluster \
--enable-ray-operator \
--cluster-version=${CLUSTER_VERSION} \
--location=${COMPUTE_REGION}
Standard
Create a Standard cluster with the Ray Operator add-on enabled:
gcloud container clusters create rayserve-cluster \
--addons=RayOperator \
--cluster-version=${CLUSTER_VERSION} \
--machine-type=g2-standard-24 \
--location=${COMPUTE_ZONE} \
--num-nodes=2 \
--accelerator type=nvidia-l4,count=2,gpu-driver-version=latest
Create a Kubernetes Secret for Hugging Face credentials
In Cloud Shell, create a Kubernetes Secret by doing the following:
Configure
kubectl
to communicate with your cluster:gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${COMPUTE_REGION}
Create a Kubernetes Secret that contains the Hugging Face token:
kubectl create secret generic hf-secret \ --from-literal=hf_api_token=${HF_TOKEN} \ --dry-run=client -o yaml | kubectl apply -f -
Deploy the LLM model
The GitHub repository that you cloned has a directory for each model that includes a RayService configuration. The configuration for each model includes the following components:
- Ray Serve deployment: The Ray Serve deployment, which includes resource configuration and runtime dependencies.
- Model: The Hugging Face model ID.
Ray cluster: The underlying Ray cluster and the resources required for each component, which includes head and worker Pods.
Gemma 2B IT
Deploy the model:
kubectl apply -f gemma-2b-it/
Wait for the RayService resource to be ready:
kubectl get rayservice gemma-2b-it -o yaml
The output is similar to the following:
status: activeServiceStatus: applicationStatuses: llm: healthLastUpdateTime: "2024-06-22T02:51:52Z" serveDeploymentStatuses: VLLMDeployment: healthLastUpdateTime: "2024-06-22T02:51:52Z" status: HEALTHY status: RUNNING
In this output,
status: RUNNING
indicates the RayService resource is ready.Confirm that GKE created the Service for the Ray Serve application:
kubectl get service gemma-2b-it-serve-svc
The output is similar to the following:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE gemma-2b-it-serve-svc ClusterIP 34.118.226.104 <none> 8000/TCP 45m
Gemma 7B IT
Deploy the model:
kubectl apply -f gemma-7b-it/
Wait for the RayService resource to be ready:
kubectl get rayservice gemma-7b-it -o yaml
The output is similar to the following:
status: activeServiceStatus: applicationStatuses: llm: healthLastUpdateTime: "2024-06-22T02:51:52Z" serveDeploymentStatuses: VLLMDeployment: healthLastUpdateTime: "2024-06-22T02:51:52Z" status: HEALTHY status: RUNNING
In this output,
status: RUNNING
indicates the RayService resource is ready.Confirm that GKE created the Service for the Ray Serve application:
kubectl get service gemma-7b-it-serve-svc
The output is similar to the following:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE gemma-7b-it-serve-svc ClusterIP 34.118.226.104 <none> 8000/TCP 45m
Llama 2 7B
Deploy the model:
kubectl apply -f llama-2-7b/
Wait for the RayService resource to be ready:
kubectl get rayservice llama-2-7b -o yaml
The output is similar to the following:
status: activeServiceStatus: applicationStatuses: llm: healthLastUpdateTime: "2024-06-22T02:51:52Z" serveDeploymentStatuses: VLLMDeployment: healthLastUpdateTime: "2024-06-22T02:51:52Z" status: HEALTHY status: RUNNING
In this output,
status: RUNNING
indicates the RayService resource is ready.Confirm that GKE created the Service for the Ray Serve application:
kubectl get service llama-2-7b-serve-svc
The output is similar to the following:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE llama-2-7b-serve-svc ClusterIP 34.118.226.104 <none> 8000/TCP 45m
Llama 3 8B
Deploy the model:
kubectl apply -f llama-3-8b/
Wait for the RayService resource to be ready:
kubectl get rayservice llama-3-8b -o yaml
The output is similar to the following:
status: activeServiceStatus: applicationStatuses: llm: healthLastUpdateTime: "2024-06-22T02:51:52Z" serveDeploymentStatuses: VLLMDeployment: healthLastUpdateTime: "2024-06-22T02:51:52Z" status: HEALTHY status: RUNNING
In this output,
status: RUNNING
indicates the RayService resource is ready.Confirm that GKE created the Service for the Ray Serve application:
kubectl get service llama-3-8b-serve-svc
The output is similar to the following:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE llama-3-8b-serve-svc ClusterIP 34.118.226.104 <none> 8000/TCP 45m
Mistral 7B
Deploy the model:
kubectl apply -f mistral-7b/
Wait for the RayService resource to be ready:
kubectl get rayservice mistral-7b -o yaml
The output is similar to the following:
status: activeServiceStatus: applicationStatuses: llm: healthLastUpdateTime: "2024-06-22T02:51:52Z" serveDeploymentStatuses: VLLMDeployment: healthLastUpdateTime: "2024-06-22T02:51:52Z" status: HEALTHY status: RUNNING
In this output,
status: RUNNING
indicates the RayService resource is ready.Confirm that GKE created the Service for the Ray Serve application:
kubectl get service mistral-7b-serve-svc
The output is similar to the following:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE mistral-7b-serve-svc ClusterIP 34.118.226.104 <none> 8000/TCP 45m
Serve the model
The Llama2 7B and Llama3 8B models use the OpenAI API chat spec. The other models only support text generation, which is a technique that generates text based on a prompt.
Set up port-forwarding
Set up port forwarding to the inferencing server:
Gemma 2B IT
kubectl port-forward svc/gemma-2b-it-serve-svc 8000:8000
Gemma 7B IT
kubectl port-forward svc/gemma-7b-it-serve-svc 8000:8000
Llama2 7B
kubectl port-forward svc/llama-7b-serve-svc 8000:8000
Llama 3 8B
kubectl port-forward svc/llama-3-8b-serve-svc 8000:8000
Mistral 7B
kubectl port-forward svc/mistral-7b-serve-svc 8000:8000
Interact with the model using curl
Use curl to chat with your model:
Gemma 2B IT
In a new terminal session:
curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'
Gemma 7B IT
In a new terminal session:
curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'
Llama2 7B
In a new terminal session:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What are the top 5 most popular programming languages? Please be brief."}
],
"temperature": 0.7
}'
Llama 3 8B
In a new terminal session:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What are the top 5 most popular programming languages? Please be brief."}
],
"temperature": 0.7
}'
Mistral 7B
In a new terminal session:
curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'
Because the models that you served don't retain any history, each message and reply must be sent back to the model to create an interactive dialogue experience. The follow example shows how you can create an interactive dialogue using the Llama 3 8B model:
Create a dialogue with the model using curl
:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What are the top 5 most popular programming languages? Please be brief."},
{"role": "assistant", "content": " \n1. Java\n2. Python\n3. C++\n4. C#\n5. JavaScript"},
{"role": "user", "content": "Can you give me a brief description?"}
],
"temperature": 0.7
}'
The output is similar to the following:
{
"id": "cmpl-3cb18c16406644d291e93fff65d16e41",
"object": "chat.completion",
"created": 1719035491,
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Here's a brief description of each:\n\n1. **Java**: A versatile language for building enterprise-level applications, Android apps, and web applications.\n2. **Python**: A popular language for data science, machine learning, web development, and scripting, known for its simplicity and ease of use.\n3. **C++**: A high-performance language for building operating systems, games, and other high-performance applications, with a focus on efficiency and control.\n4. **C#**: A modern, object-oriented language for building Windows desktop and mobile applications, as well as web applications using .NET.\n5. **JavaScript**: A versatile language for client-side scripting on the web, commonly used for creating interactive web pages, web applications, and mobile apps.\n\nNote: These descriptions are brief and don't do justice to the full capabilities and uses of each language."
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 73,
"total_tokens": 245,
"completion_tokens": 172
}
}
(Optional) Connect to the chat interface
You can use Gradio to build web applications that let you interact with your
model. Gradio is a Python library that has a ChatInterface
wrapper that
creates user interfaces for chatbots. For Llama 2 7B and Llama 3 7B, you
installed Gradio when you deployed the LLM model.
Set up port-forwarding to the
gradio
Service:kubectl port-forward service/gradio 8080:8080 &
Open http://localhost:8080 in your browser to chat with the model.
Serve multiple models with model multiplexing
Model multiplexing is a technique used to serve multiple models within the same Ray cluster. You can route traffic to specific models using request headers or by load balancing.
In this example, you create a multiplexed Ray Serve application consisting of two models: Gemma 7B IT and Llama 3 8B.
Deploy the RayService resource:
kubectl apply -f model-multiplexing/
Wait for the RayService resource to be ready:
kubectl get rayservice model-multiplexing -o yaml
The output is simlar to the following:
status: activeServiceStatus: applicationStatuses: llm: healthLastUpdateTime: "2024-06-22T14:00:41Z" serveDeploymentStatuses: MutliModelDeployment: healthLastUpdateTime: "2024-06-22T14:00:41Z" status: HEALTHY VLLMDeployment: healthLastUpdateTime: "2024-06-22T14:00:41Z" status: HEALTHY VLLMDeployment_1: healthLastUpdateTime: "2024-06-22T14:00:41Z" status: HEALTHY status: RUNNING
In this output,
status: RUNNING
indicates the RayService resource is ready.Confirm GKE created the Kubernetes Service for the Ray Serve application:
kubectl get service model-multiplexing-serve-svc
The output is similar to the following:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE model-multiplexing-serve-svc ClusterIP 34.118.226.104 <none> 8000/TCP 45m
Set up port-forwarding to the Ray Serve application:
kubectl port-forward svc/model-multiplexing-serve-svc 8000:8000
Send a request to the Gemma 7B IT model:
curl -X POST http://localhost:8000/ -H "Content-Type: application/json" --header "serve_multiplexed_model_id: google/gemma-7b-it" -d '{"prompt": "What are the top 5 most popular programming languages? Please be brief.", "max_tokens": 200}'
The output is similar to the following:
{"text": ["What are the top 5 most popular programming languages? Please be brief.\n\n1. JavaScript\n2. Java\n3. C++\n4. Python\n5. C#"]}
Send a request to the Llama 3 8B model:
curl -X POST http://localhost:8000/ -H "Content-Type: application/json" --header "serve_multiplexed_model_id: meta-llama/Meta-Llama-3-8B-Instruct" -d '{"prompt": "What are the top 5 most popular programming languages? Please be brief.", "max_tokens": 200}'
The output is similar to the following:
{"text": ["What are the top 5 most popular programming languages? Please be brief. Here are your top 5 most popular programming languages, based on the TIOBE Index, a widely used measure of the popularity of programming languages.\r\n\r\n1. **Java**: Used in Android app development, web development, and enterprise software development.\r\n2. **Python**: A versatile language used in data science, machine learning, web development, and automation.\r\n3. **C++**: A high-performance language used in game development, system programming, and high-performance computing.\r\n4. **C#**: Used in Windows and web application development, game development, and enterprise software development.\r\n5. **JavaScript**: Used in web development, mobile app development, and server-side programming with technologies like Node.js.\r\n\r\nSource: TIOBE Index (2022).\r\n\r\nThese rankings can vary depending on the source and methodology used, but this gives you a general idea of the most popular programming languages."]}
Send a request to a random model by excluding the header
serve_multiplexed_model_id
:curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Please be brief.", "max_tokens": 200}'
The output is one of the outputs from the previous steps.
Compose multiple models with model composition
Model composition is a technique used to compose multiple models into a single application. Model composition lets you chain together inputs and outputs across multiple LLMs and scale your models as a single application.
In this example, you compose two models, Gemma 7B IT and Llama 3 8B, into a single application. The first model is the assistant model that answers questions provided in the prompt. The second model is the summarizer model. The output of the assistant model is chained into the input of the summarizer model. The final result is the summarized version of the response from the assistant model.
Deploy the RayService resource:
kubectl apply -f model-composition/
Wait for the RayService resource to be ready:
kubectl get rayservice model-composition -o yaml
The output is simlar to the following:
status: activeServiceStatus: applicationStatuses: llm: healthLastUpdateTime: "2024-06-22T14:00:41Z" serveDeploymentStatuses: MutliModelDeployment: healthLastUpdateTime: "2024-06-22T14:00:41Z" status: HEALTHY VLLMDeployment: healthLastUpdateTime: "2024-06-22T14:00:41Z" status: HEALTHY VLLMDeployment_1: healthLastUpdateTime: "2024-06-22T14:00:41Z" status: HEALTHY status: RUNNING
In this output,
status: RUNNING
indicates the RayService resource is ready.Confirm GKE created the Service for the Ray Serve application:
kubectl get service model-composition-serve-svc
The output is similar to the following:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE model-composition-serve-svc ClusterIP 34.118.226.104 <none> 8000/TCP 45m
Send a request to the model:
curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What is the most popular programming language for machine learning and why?", "max_tokens": 1000}'
The output is similar to the following:
{"text": ["\n\n**Sure, here is a summary in a single sentence:**\n\nThe most popular programming language for machine learning is Python due to its ease of use, extensive libraries, and growing community."]}
Delete the project
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete the individual resources
If you used an existing project and you don't want to delete it, you can delete the individual resources.
Delete the cluster:
gcloud container clusters delete rayserve-cluster
What's next
- Discover how to run optimized AI/ML workloads with GKE platform orchestration capabilities.
- Train a model with GPUs on GKE Standard mode
- Learn how to use RayServe on GKE, by viewing the sample code in GitHub.