Create a Ray cluster on Vertex AI

You can use the Google Cloud console or the Vertex AI SDK for Python to create a Ray cluster. A cluster can have up to 2,000 nodes. There is an upper limit of 1,000 nodes within one worker pool. There's no limit on the number of worker pools, but having a large number of worker pools, such as having 1,000 worker pools with one node each, can negatively affect cluster performance.

Before you begin, make sure to read the Ray on Vertex AI overview and set up all the prerequisite tools you need.

A Ray cluster on Vertex AI may take 10-20 minutes to start up after you create it.

Console

In accordance with the OSS Ray best practice recommendation, setting the logical CPU count to 0 on the Ray head node is enforced in order to avoid running any workload on the head node.

  1. In the Google Cloud console, go to the Ray on Vertex AI page.

    Go to the Ray on Vertex AI page

  2. Click Create Cluster to open the Create Cluster panel.

  3. For each step in the Create Cluster panel, review or replace the default cluster information. Click Continue to complete each step:

    1. For Name and region, specify a Name and choose a Location for your cluster.

    2. For Compute settings, specify the configuration of the Ray cluster on the Vertex AI's head node, including its machine type, accelerator type and count, disk type and size, and replica count. Optionally, you can add a custom image URI to specify a custom container image to add Python dependencies not provided by the default container image. See Custom image.

      Under Advanced options, you can:

      • Specify your own encryption key.
      • Specify a custom service account.
      • Disable metrics collection, if you don't need to monitor the resource stats of your workload during training.
    3. (Optional) To set a private endpoint instead of a public endpoint for your cluster, specify a VPC network to use with Ray on Vertex AI. For more information, see Private and public connectivity.

      If you haven't set up a connection for your VPC network, click Set up connection. In the Create a private services access connection panel, complete and click Continue for each of the following steps:

      1. Enable the Service Networking API.

      2. For Allocate an IP range, you can select, create, or allow Google to automatically allocate an IP range.

      3. For Create a connection, review the Network and Allocated IP Range information.

      4. Click Create connection.

  4. Click Create.

Ray on Vertex AI SDK

In accordance with the OSS Ray best practice recommendation, setting the logical CPU count to 0 on the Ray head node is enforced in order to avoid running any workload on the head node.

From an interactive Python environment, use the following to create the Ray cluster on Vertex AI:

import ray
import vertex_ray
from google.cloud import aiplatform
from vertex_ray import Resources

# Define a default CPU cluster, machine_type is n1-standard-16, 1 head node and 1 worker node
head_node_type = Resources()
worker_node_types = [Resources()]

# Or define a GPU cluster.
head_node_type = Resources(
  machine_type="n1-standard-16",
  node_count=1,
  custom_image="us-docker.pkg.dev/my-project/ray-custom.2-9.py310:latest",  # Optional. When not specified, a prebuilt image is used.
)

worker_node_types = [Resources(
  machine_type="n1-standard-16",
  node_count=2,  # Must be >= 1
  accelerator_type="NVIDIA_TESLA_T4",
  accelerator_count=1,
  custom_image="us-docker.pkg.dev/my-project/ray-custom.2-9.py310:latest",  # When not specified, a prebuilt image is used.
)]

aiplatform.init()
# Initialize Vertex AI to retrieve projects for downstream operations.
# Create the Ray cluster on Vertex AI
CLUSTER_RESOURCE_NAME = vertex_ray.create_ray_cluster(
  head_node_type=head_node_type,
  network=NETWORK, #Optional
  worker_node_types=worker_node_types,
  python_version="3.10",  # Optional
  ray_version="2.33",  # Optional
  cluster_name=CLUSTER_NAME, # Optional
  service_account=SERVICE_ACCOUNT,  # Optional
  enable_metrics_collection=True,  # Optional. Enable metrics collection for monitoring.
  labels=LABELS,  # Optional.

)

Where:

  • CLUSTER_NAME: A name for the Ray cluster on Vertex AI that must be unique across your project.

  • NETWORK: (Optional) The full name of your VPC network, in the format of projects/PROJECT_ID/global/networks/VPC_NAME. To set a private endpoint instead of a public endpoint for your cluster, specify a VPC network to use with Ray on Vertex AI. For more information, see Private and public connectivity.

  • VPC_NAME: (Optional) The VPC on which the VM is operating.

  • PROJECT_ID: Your Google Cloud project ID. You can find the project ID in the Google Cloud console welcome page.

  • SERVICE_ACCOUNT: (Optional) The service account to run Ray applications on the cluster. Required roles should be granted.

  • LABELS: (Optional) The labels with user-defined metadata used to organize Ray clusters. Label keys and values can be no longer than 64 characters (Unicode codepoints), and can only contain lowercase letters, numeric characters, underscores and dashes. International characters are allowed. See https://goo.gl/xmQnxf for more information and examples of labels.

You should see the following output until the status changes to RUNNING:

[Ray on Vertex AI]: Cluster State = State.PROVISIONING
Waiting for cluster provisioning; attempt 1; sleeping for 0:02:30 seconds
...
[Ray on Vertex AI]: Cluster State = State.RUNNING

Note the following:

  • The first node is used as the Head node.

  • TPU machine types are not supported.

Lifecycle management

During the lifecycle of a Ray cluster on Vertex AI each action is associated with a state. The billing status and management option for each state is summarized in the table below. The reference documentation provides a definition for each of these states.

Action State Billed? Delete action available? Cancel action available?
The user is creating a cluster PROVISIONING No No No
The user is manually scaling up or down UPDATING Yes, per the real-time size Yes No
The cluster is running RUNNING Yes Yes Not applicable - you can delete
The cluster is autoscaling up or down UPDATING Yes, per the real-time size Yes No
The user is deleting the cluster STOPPING No No Not applicable - already stopping
The cluster enters Error state ERROR No Yes Not applicable - you can delete
Not applicable STATE_UNSPECIFIED No Yes Not applicable

Custom Image (Optional)

Prebuilt images align with most use cases. If you want to build your own image, you're encouraged to use the Ray on Vertex prebuilt images as a base image. See the Docker documentation for how to build your images from a base image.

These base images include an installation of Python, Ubuntu, and Ray. They also include dependencies such as:

  • python-json-logger
  • google-cloud-resource-manager
  • ca-certificates-java
  • libatlas-base-dev
  • liblapack-dev
  • g++, libio-all-perl
  • libyaml-0-2.
  • rsync

If you want to build your own image without our base image (advanced), be sure that your image includes:

  • Ray 2.33.0 or 2.9.3
  • Python 3.10
  • python-json-logger==2.0.7

Private and public connectivity

By default, Ray on Vertex AI creates a public, secure endpoint for interactive development with the Ray Client on Ray clusters on Vertex AI. It's recommended that you use public connectivity for development or ephemeral use cases. This public endpoint is accessible through the internet. Only authorized users who have, at a minimum, Vertex AI user role permissions on the Ray cluster's user project can access the cluster.

If you require a private connection to your cluster or if you're using VPC Service Controls, VPC peering is supported for Ray clusters on Vertex AI. Clusters with a private endpoint are only accessible from a client within a VPC network that is peered with Vertex AI.

To set up private connectivity with VPC Peering for Ray on Vertex AI, select a VPC network when you create your cluster. The VPC network requires a private services connection between your VPC network and Vertex AI. If you're using Ray on Vertex AI in the console, you can set up your private services access connection when creating the cluster.

If you want to use VPC Service Controls and VPC peering with Ray clusters on Vertex AI, there's extra setup required to use the Ray dashboard and interactive shell. Follow the instructions covered in Ray Dashboard and Interactive Shell with VPC-SC + VPC Peering to configure the interactive shell setup with VPC-SC and VPC Peering in your user project.

After you create your Ray cluster on Vertex AI, you can connect to the head node using the Vertex AI SDK for Python. The connecting environment, such as a Compute Engine VM or Vertex AI Workbench instance, must be in the VPC network that is peered with Vertex AI. Note that a private services connection has a limited number of IP addresses, which could result in IP address exhaustion. It's therefore recommended to use private connections for long-running clusters.

Ray Dashboard and Interactive Shell with VPC-SC + VPC Peering

  1. Configure peered-dns-domains.

    {
      VPC_NAME=NETWORK_NAME
      REGION=LOCATION
      gcloud services peered-dns-domains create training-cloud \
      --network=$VPC_NAME \
      --dns-suffix=$REGION.aiplatform-training.cloud.google.com.
    
      # Verify
      gcloud beta services peered-dns-domains list --network $VPC_NAME);
    }
        
    • NETWORK_NAME: Change to peered network.

    • LOCATION: Desired location (for example, us-central1).

  2. Configure DNS managed zone.

    {
      PROJECT_ID=PROJECT_ID
      ZONE_NAME=$PROJECT_ID-aiplatform-training-cloud-google-com
      DNS_NAME=aiplatform-training.cloud.google.com
      DESCRIPTION=aiplatform-training.cloud.google.com
    
      gcloud dns managed-zones create $ZONE_NAME  \
      --visibility=private  \
      --networks=https://www.googleapis.com/compute/v1/projects/$PROJECT_ID/global/networks/$VPC_NAME  \
      --dns-name=$DNS_NAME  \
      --description="Training $DESCRIPTION"
    }
        
    • PROJECT_ID: Your project ID. You can find these IDs in the Google Cloud console welcome page.

  3. Record DNS transaction.

    {
      gcloud dns record-sets transaction start --zone=$ZONE_NAME
    
      gcloud dns record-sets transaction add \
      --name=$DNS_NAME. \
      --type=A 199.36.153.4 199.36.153.5 199.36.153.6 199.36.153.7 \
      --zone=$ZONE_NAME \
      --ttl=300
    
      gcloud dns record-sets transaction add \
      --name=*.$DNS_NAME. \
      --type=CNAME $DNS_NAME. \
      --zone=$ZONE_NAME \
      --ttl=300
    
      gcloud dns record-sets transaction execute --zone=$ZONE_NAME
    }
        
  4. Submit a training job with the interactive shell + VPC-SC + VPC Peering enabled.

What's next