Use a private IP for custom training

Using private IP to connect to your training jobs provides more network security and lower network latency than using public IP. To use private IP, you use Virtual Private Cloud (VPC) to peer your network with any type of Vertex AI custom training job. This allows your training code to access private IP addresses inside your Google Cloud or on-premises networks.

This guide shows how to run custom training jobs in your network after you have already set up VPC Network Peering to peer your network with a Vertex AI CustomJob, HyperparameterTuningJob, or custom TrainingPipeline resource.

Note that you can't use private IP addresses for custom training if you are also using a TPU VM.

Overview

Before you submit a custom training job using private IP, you must configure private services access to create peering connections between your network and Vertex AI. If you have already set this up, you can use your existing peering connections.

This guide covers the following tasks:

  • Understanding which IP ranges to reserve for custom training.
  • Verify the status of your existing peering connections.
  • Perform Vertex AI custom training on your network.
  • Check for active training occurring on one network before training on another network.
  • Test that your training code can access private IPs in your network.

Reserve IP ranges for custom training

When you reserve an IP range for service producers, the range can be used by Vertex AI and other services. This table shows the maximum number of parallel training jobs that you can run with reserved ranges from /16 to /19, assuming the range is used almost exclusively by Vertex AI. If you connect with other service producers using the same range, allocate a larger range to accommodate them, in order to avoid IP exhaustion.

Machine configuration for training job Reserved range Maximum number of parallel jobs
Up to 8 nodes.
For example: 1 primary replica in the first worker pool, 6 replicas in the second worker pool, and 1 worker in the third worker pool (to act as a parameter server)
/16 63
/17 31
/18 15
/19 7
Up to 16 nodes.
For example: 1 primary replica in the first worker pool, 14 replicas in the second worker pool, and 1 worker in the third worker pool (to act as a parameter server)
/16 31
/17 15
/18 7
/19 3
Up to 32 nodes.
For example: 1 primary replica in the first worker pool, 30 replicas in the second worker pool, and 1 worker in the third worker pool (to act as a parameter server)
/16 15
/17 7
/18 3
/19 1

Learn more about configuring worker pools for distributed training.

Check the status of existing peering connections

If you have existing peering connections you use with Vertex AI, you can list them to check status:

gcloud compute networks peerings list --network NETWORK_NAME

You should see that the state of your peering connections are ACTIVE. Learn more about active peering connections.

Perform custom training

When you perform custom training, you must specify the name of the network that you want Vertex AI to have access to.

Depending on how you perform custom training, specify the network in one of the following API fields:

If you do not specify a network name, then Vertex AI runs your custom training without a peering connection, and without access to private IPs in your project.

Example: Creating a CustomJob with the gcloud CLI

The following example shows how to specify a network when you use the gcloud CLI to run a CustomJob that uses a prebuilt container. If you are perform custom training in a different way, add the network field as described for the type of custom training job you're using.

  1. Create a config.yaml file to specify the network. If you're using Shared VPC, use your VPC host project number.

    Make sure the network name is formatted correctly:

    PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
    
    cat <<EOF > config.yaml
    network: projects/PROJECT_NUMBER/global/networks/NETWORK_NAME
    EOF
    
  2. Create a training application to run on Vertex AI.

  3. Create the CustomJob, passing in your config.yaml file:

    gcloud ai custom-jobs create \
      --region=LOCATION \
      --display-name=JOB_NAME \
      --python-package-uris=PYTHON_PACKAGE_URIS \
      --worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,executor-image-uri=PYTHON_PACKAGE_EXECUTOR_IMAGE_URI,python-module=PYTHON_MODULE \
      --config=config.yaml
    

To learn how to replace the placeholders in this command, read Creating custom training jobs.

Run jobs on different networks

You can't perform custom training on a new network while you are still performing custom training on another network. Before you switch to a different network, you must wait for all submitted CustomJob, HyperparameterTuningJob, and custom TrainingPipeline resources to finish, or you must cancel them.

Test training job access

This section explains how to test that a custom training resource can access private IPs in your network.

  1. Create a Compute Engine instance in your VPC network.
  2. Check your firewall rules to make sure that they don't restrict ingress from the Vertex AI network. If so, add a rule to ensure the Vertex AI network can access the IP range you reserved for Vertex AI (and other service producers).
  3. Set up a local server on the VM instance in order to create an endpoint for a Vertex AI CustomJob to access.
  4. Create a Python training application to run on Vertex AI. Instead of model training code, create code that accesses the endpoint you set up in the previous step.
  5. Follow the previous example to create a CustomJob.

Common problems

This section lists some common issues for configuring VPC Network Peering with Vertex AI.

  • When you configure Vertex AI to use your network, specify the full network name:

    "projects/YOUR_PROJECT_NUMBER/global/networks/YOUR_NETWORK_NAME"

  • Make sure you are not performing custom training on a network before performing custom training on a different network.

  • Make sure that you've allocated a sufficient IP range for all service producers your network connects to, including Vertex AI.

For additional troubleshooting information, refer to the VPC Network Peering troubleshooting guide.

What's next