This page shows how to configure Google Kubernetes Engine (GKE) to collect logs and metrics for Ray clusters running on Google Kubernetes Engine (GKE), plus how to view Ray logs and metrics in Cloud Logging and Cloud Monitoring.
For more information on Ray and KubeRay, see Ray on Google Kubernetes Engine (GKE) overview.
Before you begin
Before you start, make sure you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running
gcloud components update
.
Requirements and limitations
- You must enable system and workload logging on an existing GKE cluster before you enable log collection for Ray clusters.
- If you enable log collection for Ray clusters on an existing GKE cluster, GKE only collects logs from newly created Ray Pods, not from existing Ray Pods.
- For Standard GKE clusters, you must enable Google Cloud Managed Service for Prometheus to enable metrics collection for Ray clusters. For Autopilot clusters, Google Cloud Managed Service for Prometheus is enabled by default.
- You must not specify a volume named
ray-logs
in any Ray container in the Ray cluster. Otherwise, GKE won't collect logs.
Enable log collection for a Ray cluster
You can enable log collection for Ray clusters with new or existing Autopilot or Standard GKE clusters. The Ray logs that GKE collects from Ray clusters are classified as container logs. This includes all logs produced by the Ray cluster header and worker nodes.
You can enable log collection for Ray clusters using the Google Cloud console or the gcloud CLI.
Console
Go to the Google Kubernetes Engine page in the Google Cloud console.
Click
Create then in the Standard or Autopilot section, click Configure.From the navigation pane, under Cluster, click Features.
In the Operations section, ensure the System and Workloads checkbox is selected.
In the AI and Machine Learning section, select Enable Ray Operator and then select Enable log collection for Ray clusters.
Click Create.
For Standard clusters, you must also enable Google Cloud Managed Service for Prometheus.
gcloud
Create a cluster using the --addons=RayOperator
option and the
--enable-ray-cluster-logging
option:
gcloud container clusters create CLUSTER_NAME \
--cluster-version=VERSION \
--addons=RayOperator \
--enable-ray-cluster-logging
Replace the following:
CLUSTER_NAME
: the name of the new cluster.VERSION
: the GKE version, which must be 1.30.2-gke.1060005 or later. You can also use the--release-channel
option to select a release channel. The release channel must have a default version of 1.30.2-gke.106000 or later.
You can enable log collection for Ray clusters on an existing cluster by
using the
gcloud container clusters update
command with the --addons=RayOperator
option and the
--enable-ray-cluster-logging
option.
View Ray logs
You can view logs collected from Ray clusters running on GKE using Logging.
Go to the Cloud Logging page in the Google Cloud console.
Open the query editor and paste your expression into the query editor
Click Run query
You can use the following examples queries in the Logs Explorer:
Query/filter name | Expression |
---|---|
All Ray logs | resource.type="k8s_container" labels."k8s-pod/ray_io/is-ray-node"="yes" |
All Ray head logs | resource.type="k8s_container" labels."k8s-pod/ray_io/node-type"="head" |
All logs in a Ray cluster | resource.type="k8s_container" labels."k8s-pod/ray_io/cluster"="RAY_CLUSTER_NAME" |
All logs from a Ray job | resource.type="k8s_container" jsonPayload.ray_submission_id="RAY_JOB_SUBMISSION_ID" |
Enable metrics collection for a Ray cluster
You can enable metrics collection for Ray clusters with new or existing Autopilot or Standard GKE clusters.
After you enable metrics collection for Ray clusters, GKE collects metrics from existing Ray clusters and new Ray clusters. GKE collects all system metrics exported by Ray in Prometheus format.
You can enable metrics collection for Ray clusters using the Google Cloud console or the gcloud CLI.
Console
Go to the Google Kubernetes Engine page in the Google Cloud console.
Click
Create then in the Standard or Autopilot section, click Configure.From the navigation pane, under Cluster, click Features.
In the Operations section, ensure the System and Workloads checkbox is selected.
In the AI and Machine Learning section, select Enable Ray Operator and then select Enable metrics collection for Ray clusters.
Click Create.
For Standard clusters, you must also enable Google Cloud Managed Service for Prometheus.
gcloud
Create a cluster using the --addons=RayOperator
option and the
--enable-ray-cluster-monitoring
option:
gcloud container clusters create CLUSTER_NAME \
--cluster-version=VERSION \
--addons=RayOperator \
--enable-ray-cluster-monitoring
Replace the following:
CLUSTER_NAME
: the name of the new cluster.VERSION
: the GKE version, which must be 1.30.2-gke.1060005 or later. You can also use the--release-channel
option to select a release channel. The release channel must have a default version of 1.30.2-gke.106000 or later.
You can enable log collection for Ray clusters on an existing cluster by
using the
gcloud container clusters update
command with the --addons=RayOperator
option and the
--enable-ray-cluster-monitoring
option.
View Ray metrics
You can view metrics collected from Ray clusters running on GKE using Monitoring.
Go to the Metrics Explorer page in the Google Cloud console.
In the Select a metric drop-down menu, enter Prometheus Target.
In the Active Metric Categories section, select Ray.
What's next
- Learn about Ray on Kubernetes.
- Explore the KubeRay documentation.