Collect and view logs and metrics for Ray clusters on Google Kubernetes Engine (GKE)


This page shows how to configure Google Kubernetes Engine (GKE) to collect logs and metrics for Ray clusters running on Google Kubernetes Engine (GKE), plus how to view Ray logs and metrics in Cloud Logging and Cloud Monitoring.

For more information on Ray and KubeRay, see Ray on Google Kubernetes Engine (GKE) overview.

Before you begin

Before you start, make sure you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.

Requirements and limitations

  • You must enable system and workload logging on an existing GKE cluster before you enable log collection for Ray clusters.
  • If you enable log collection for Ray clusters on an existing GKE cluster, GKE only collects logs from newly created Ray Pods, not from existing Ray Pods.
  • For Standard GKE clusters, you must enable Google Cloud Managed Service for Prometheus to enable metrics collection for Ray clusters. For Autopilot clusters, Google Cloud Managed Service for Prometheus is enabled by default.
  • You must not specify a volume named ray-logs in any Ray container in the Ray cluster. Otherwise, GKE won't collect logs.

Enable log collection for a Ray cluster

You can enable log collection for Ray clusters with new or existing Autopilot or Standard GKE clusters. The Ray logs that GKE collects from Ray clusters are classified as container logs. This includes all logs produced by the Ray cluster header and worker nodes.

You can enable log collection for Ray clusters using the Google Cloud console or the gcloud CLI.

Console

  1. Go to the Google Kubernetes Engine page in the Google Cloud console.

    Go to Google Kubernetes Engine

  2. Click Create then in the Standard or Autopilot section, click Configure.

  3. From the navigation pane, under Cluster, click Features.

  4. In the Operations section, ensure the System and Workloads checkbox is selected.

  5. In the AI and Machine Learning section, select Enable Ray Operator and then select Enable log collection for Ray clusters.

  6. Click Create.

For Standard clusters, you must also enable Google Cloud Managed Service for Prometheus.

gcloud

Create a cluster using the --addons=RayOperator option and the --enable-ray-cluster-logging option:

gcloud container clusters create CLUSTER_NAME \
    --cluster-version=VERSION \
    --addons=RayOperator \
    --enable-ray-cluster-logging

Replace the following:

  • CLUSTER_NAME: the name of the new cluster.
  • VERSION: the GKE version, which must be 1.30.2-gke.1060005 or later. You can also use the --release-channel option to select a release channel. The release channel must have a default version of 1.30.2-gke.106000 or later.

You can enable log collection for Ray clusters on an existing cluster by using the gcloud container clusters update command with the --addons=RayOperator option and the --enable-ray-cluster-logging option.

View Ray logs

You can view logs collected from Ray clusters running on GKE using Logging.

  1. Go to the Cloud Logging page in the Google Cloud console.

    Go to Cloud Logging

  2. Open the query editor and paste your expression into the query editor

  3. Click Run query

You can use the following examples queries in the Logs Explorer:

Query/filter name Expression
All Ray logs
resource.type="k8s_container"
labels."k8s-pod/ray_io/is-ray-node"="yes"
All Ray head logs
resource.type="k8s_container"
labels."k8s-pod/ray_io/node-type"="head"
All logs in a Ray cluster
resource.type="k8s_container"
labels."k8s-pod/ray_io/cluster"="RAY_CLUSTER_NAME"
All logs from a Ray job
resource.type="k8s_container"
jsonPayload.ray_submission_id="RAY_JOB_SUBMISSION_ID"

Enable metrics collection for a Ray cluster

You can enable metrics collection for Ray clusters with new or existing Autopilot or Standard GKE clusters.

After you enable metrics collection for Ray clusters, GKE collects metrics from existing Ray clusters and new Ray clusters. GKE collects all system metrics exported by Ray in Prometheus format.

You can enable metrics collection for Ray clusters using the Google Cloud console or the gcloud CLI.

Console

  1. Go to the Google Kubernetes Engine page in the Google Cloud console.

    Go to Google Kubernetes Engine

  2. Click Create then in the Standard or Autopilot section, click Configure.

  3. From the navigation pane, under Cluster, click Features.

  4. In the Operations section, ensure the System and Workloads checkbox is selected.

  5. In the AI and Machine Learning section, select Enable Ray Operator and then select Enable metrics collection for Ray clusters.

  6. Click Create.

For Standard clusters, you must also enable Google Cloud Managed Service for Prometheus.

gcloud

Create a cluster using the --addons=RayOperator option and the --enable-ray-cluster-monitoring option:

gcloud container clusters create CLUSTER_NAME \
    --cluster-version=VERSION \
    --addons=RayOperator \
    --enable-ray-cluster-monitoring

Replace the following:

  • CLUSTER_NAME: the name of the new cluster.
  • VERSION: the GKE version, which must be 1.30.2-gke.1060005 or later. You can also use the --release-channel option to select a release channel. The release channel must have a default version of 1.30.2-gke.106000 or later.

You can enable log collection for Ray clusters on an existing cluster by using the gcloud container clusters update command with the --addons=RayOperator option and the --enable-ray-cluster-monitoring option.

View Ray metrics

You can view metrics collected from Ray clusters running on GKE using Monitoring.

  1. Go to the Metrics Explorer page in the Google Cloud console.

    Go to Metrics Explorer

  2. In the Select a metric drop-down menu, enter Prometheus Target.

  3. In the Active Metric Categories section, select Ray.

What's next