Containers & Kubernetes

How to secure Ray on Google Kubernetes Engine

March 22, 2024

Bill Bernsen

Security Engineer

Cynthia Thomas

Product Manager

When developers are innovating quickly, security can be an afterthought. That’s even true for AI/ML workloads, where the stakes are high for organizations trying to protect valuable models and data.

When you deploy an AI workload on Google Kubernetes Engine (GKE), you can benefit from the many security tools available in Google Cloud infrastructure. In this blog, we share security insights and hardening techniques for training AI/ML workloads on one framework in particular — Ray.

Ray needs security hardening

As a distributed compute framework for AI applications, Ray has grown in popularity in recent years, and deploying it on GKE is a popular choice that provides flexibility and configurable orchestration. You can read more on why we recommend GKE for Ray.

However, Ray lacks built-in authentication and authorization, which means that if you can successfully send a request to the Ray cluster head, it will execute arbitrary code on your behalf.

So how do you secure Ray? The authors state that security should be enforced outside of the Ray cluster, but how do you actually harden it? Running Ray on GKE can help you achieve a more secure, scalable, and reliable Ray deployment by taking advantage of existing global Google infrastructure components including Identity-Aware Proxy (IAP).

We’re also making strides in the Ray community to make safer defaults for running Ray with Kubernetes using KubeRay. One focus area has been improving Ray component compliance with the restricted Pod Security Standards profile and by adding security best practices, such as running the operator as non-root to help prevent privilege escalation.

Security separation supports multi-cluster operation

One key advantage of running Ray inside Kubernetes is the ability to run multiple Ray clusters, with diverse workloads, managed by multiple teams, inside a single Kubernetes cluster. This gives you better resource sharing and utilization because nodes with accelerators can be used by several teams, and spinning up Ray on an existing GKE cluster saves waiting on VM provisioning time before workloads can begin execution.

Security plays a supporting role in landing those multi-cluster advantages by using Kubernetes security features to help keep Ray clusters separate. The goal is to avoid accidental denial of service or accidental cross-tenant access. Note that the security separation here is not “hard” multitenancy — it is only sufficient for clusters running trusted code and teams that trust each other with their data. If further isolation is required, consider using separate GKE clusters.

The architecture is shown in the following diagram. Different Ray clusters are separated by namespaces within the GKE cluster, allowing authorized users to make calls to their assigned Ray cluster, without accessing others.

https://storage.googleapis.com/gweb-cloudblog-publish/images/Hardening_Ray_on_GKE_-_blog_diagram.max-2200x2200.jpg

Diagram: Ray on GKE Architecture

How to secure Ray on GKE

At Google Cloud, we’ve been working on improving the security of KubeRay components, and making it easier to spin up a multi-team environment with the help of Terraform templates including sample security configurations that you can reuse.

Below, we’ve summarized fundamental security best practices included in our sample templates:

Namespaces: Separate Ray clusters into distinct namespaces by placing one Ray cluster per namespace to take advantage of Kubernetes policies based on the namespace boundary.
Role-based access control (RBAC): Practice least privilege by creating a Kubernetes Service Account (KSA) per Ray cluster namespace, avoid using the default KSA associated with each Ray cluster namespace, and minimizing permissions down to no RoleBindings until deemed necessary. Optionally, consider setting automountServiceAccountToken:false on the KSA to ensure the KSA’s token is not available to the Ray cluster Pods, since Ray jobs are not expected to call the Kubernetes API.
Resource quotas: Harden against denial of service due to resource exhaustion by setting limits for resource quotas (especially for CPUs, GPUs, TPUs, and memory) on your Ray cluster namespace.
NetworkPolicy: Protect the Ray API as a critical measure to Ray security, since there is no authentication or authorization for submitting jobs. Use Kubernetes NetworkPolicy with GKE Dataplane V2 to control which traffic reaches the Ray components.
Security context: Comply with Kubernetes Pod Security Standards by configuring Pods to run with hardened settings preventing privilege escalation, running as root, and restricting potentially dangerous syscalls.
Workload identity federation: If necessary, secure access from your Ray deployment Pods to other Google Cloud services with workload identity federation such as Cloud Storage, by leveraging your KSA in a Google Cloud IAM policy.

Additional security tools

The following tools and references can provide additional security for your Ray clusters on GKE:

Identity-Aware Policy (IAP): Control access to your Ray cluster with Google’s distributed global endpoint with IAP providing user and group authorization, with Ray deployed as an Kubernetes Ingress or Gateway service.
Pod Security Standards (PSS): Turn Pod Security Standards on for each of your namespaces in order to prevent common insecure misconfigurations such as HostVolume mounts. If you need more policy customization, you can also use Policy Controller.
GKE Sandbox: Leverage GKE Sandbox Pods based on gVisor to add a second security layer around Pods, further reducing the possibility of breakouts for your Ray clusters. Currently available for CPUs (also GPUs with some limitations).
Cluster hardening: By default, GKE Autopilot already applies a lot of cluster hardening best practices, but there are some additional ways to lock down the cluster. The Ray API can be further secured by removing access from the Internet by using private nodes.
Organization policies: Ensure your organization's clusters meet security and hardening standards by setting custom organization policies — for example, guarantee that all GKE clusters are Autopilot.

Google continues to contribute to the community through our efforts to ensure safe and scalable deployments. We look forward to continued collaboration to ensure Ray runs safely on Kubernetes clusters. Please drop us a line with any feedback at ray-on-gke@google.com or comment on our GitHub repo.

To learn more, check out the following resources: