Scale Ray clusters on Vertex AI

There are two options for scaling Ray clusters on Vertex AI: Autoscaling and manual scaling. Autoscaling lets the cluster automatically adjust the number of worker nodes based on the resources required by, for example, Ray tasks and actors. Autoscaling is recommended if you are running a heavy workload and are unsure of the resources needed. Manual scaling gives users more granular control of the nodes.

Autoscaling can reduce workload costs, but adds node launch overhead, and can be tricky to configure. If you are new to Ray, the recommendation is to start with non-autoscaling clusters and use the manual scaling feature.

Autoscaling

You can enable a Ray cluster's autoscaling feature by specifying the minimum replica count (min_replica_count) and maximum replica count (max_replica_count) of a worker pool.

Note that:

  • You must configure the autoscaling spec of all worker pools.
  • The min_replica_count must be greater than or equal to 1.
  • Custom upscaling and downscaling speed is not supported. For default values, see Upscaling and downscaling speed in the Ray documentation.

Set worker pool autoscaling spec

You can use the Google Cloud console or Vertex AI SDK for Python to enable a Ray cluster's autoscaling feature.

Ray on Vertex AI SDK

from google.cloud import aiplatform
import vertex_ray
from vertex_ray import AutoscalingSpec

autoscaling_spec = AutoscalingSpec(
 min_replica_count=1,
 max_replica_count=3,
)

head_node_type = Resources(
 machine_type="n1-standard-16",
 node_count=1,
)

worker_node_types = [Resources(
 machine_type="n1-standard-16",
 accelerator_type="NVIDIA_TESLA_T4",
 accelerator_count=1,
 autoscaling_spec=autoscaling_spec,
)]

# Create the Ray cluster on Vertex AI
CLUSTER_RESOURCE_NAME = vertex_ray.create_ray_cluster(
head_node_type=head_node_type,
worker_node_types=worker_node_types,
...
)

Console

In accordance with the OSS Ray best practice recommendation, setting the logical CPU count to 0 on the Ray head node is enforced in order to avoid running any workload on the head node.

  1. In the Google Cloud console, go to the Ray on Vertex AI page.

    Go to the Ray on Vertex AI page

  2. Click Create cluster to open the Create cluster panel.

  3. For each step in the Create cluster panel, review or replace the default cluster information. Click Continue to complete each step:

    1. For Name and region, specify a Name and choose a location for your cluster.
    2. For Compute settings, specify the configuration of the Ray cluster on the head node, including its machine type, accelerator type and count, disk type and size, and replica count. Optionally, you can add a custom image URI to specify a custom container image to add Python dependencies not provided by the default container image. See Custom image.

      Under Advanced options, you can:

      • Specify your own encryption key.
      • Specify a custom service account.
      • Disable metrics collection, if you don't need to monitor the resource stats of your workload during training.
    3. To create a cluster with an autoscaling worker pool, provide a value for the worker pool's maximum replica count. autoscaling-compute-settings

  4. Click Create.

Manual scaling

As your workloads surge, or decrease, on your Ray clusters on Vertex AI, you can manually scale the number of replicas to match demand. For example, if you have excess capacity you can scale down your worker pools to save costs.

Limitations

When you scale clusters, you can change only the number of replicas in your existing worker pools. You can't, for example, add or remove worker pools from your cluster or change the machine type of your worker pools. Also, the number of replicas for your worker pools can't be lower than one.

If you are using a VPC peering connection to connect to your clusters, there's a limitation on the maximum number of nodes. The maximum number of nodes depends on the number of nodes the cluster had when it was created. For more information, see Max number of nodes calculation. This maximum number includes not just your worker pools but also your head node. If you use the default network configuration, the number of nodes cannot exceed the upper limits as described in the create clusters documentation.

Maximum number of nodes calculation

If you're using private services access (VPC peering) to connect to your nodes, use the following formulas to check that you don't exceed the maximum number of nodes (M), assuming f(x) = min(29, (32 - ceiling(log2(x))):

  • f(2 * M) = f(2 * N)
  • f(64 * M) = f(64 * N)
  • f(max(32, 16 + M)) = f(max(32, 16 + N))

The maximum total number of nodes in the Ray on Vertex AI cluster you can scale up to (M) depends on the initial total number of nodes you set up (N). After you create the Ray on Vertex AI cluster, you can scale the total number of nodes to any amount between P and M inclusive, where P is the number of pools in your cluster.

The initial total number of nodes in the cluster and the scaling up target number must be in the same color block.

number-of-nodes

Update replica count

You can use the Google Cloud console or Vertex AI SDK for Python to update your worker pool's replica count. If your cluster includes multiple worker pools, you can individually change each of their replica counts in a single request.

Ray on Vertex AI SDK

import vertexai
import vertex_ray

vertexai.init()
cluster = vertex_ray.get_ray_cluster("CLUSTER_NAME")

# Get the resource name.
cluster_resource_name = cluster.cluster_resource_name

# Create the new worker pools
new_worker_node_types = []
for worker_node_type in cluster.worker_node_types:
 worker_node_type.node_count = REPLICA_COUNT # new worker pool size
 new_worker_node_types.append(worker_node_type)

# Make update call
updated_cluster_resource_name = vertex_ray.update_ray_cluster(
 cluster_resource_name=cluster_resource_name,
 worker_node_types=new_worker_node_types,
)

Console

  1. In the Google Cloud console, go to the Ray on Vertex AI page.

    Go to the Ray on Vertex AI page

  2. From the list of clusters, click the cluster to modify.

  3. On the Cluster details page, click Edit cluster.

  4. In the Edit cluster pane, select the worker pool to update and then modify the replica count.

  5. Click Update.

    Wait a few minutes for your cluster to update. When the update is complete, you can see the updated replica count on the Cluster details page.

  6. Click Create.