Using private IP to connect to your training jobs provides more network security and lower network latency than using public IP. To use private IP, you use Virtual Private Cloud (VPC) to peer your network with any type of Vertex AI custom training job. This allows your training code to access private IP addresses inside your Google Cloud or on-premises networks.
This guide shows how to run custom training jobs in your network after you have
already set up VPC Network Peering to peer your network
with a Vertex AI CustomJob
, HyperparameterTuningJob
, or custom
TrainingPipeline
resource.
Overview
Before you submit a custom training job using private IP, you must configure private services access to create peering connections between your network and Vertex AI. If you have already set this up, you can use your existing peering connections.
This guide covers the following tasks:
- Understanding which IP ranges to reserve for custom training.
- Verify the status of your existing peering connections.
- Perform Vertex AI custom training on your network.
- Check for active training occurring on one network before training on another network.
- Test that your training code can access private IPs in your network.
Reserve IP ranges for custom training
When you reserve an IP range for service producers, the range can be used by Vertex AI and other services. This table shows the maximum number of parallel training jobs that you can run with reserved ranges from /16 to /19, assuming the range is used almost exclusively by Vertex AI. If you connect with other service producers using the same range, allocate a larger range to accommodate them, in order to avoid IP exhaustion.
Machine configuration for training job | Reserved range | Maximum number of parallel jobs | |
---|---|---|---|
Up to 8 nodes. For example: 1 primary replica in the first worker pool, 6 replicas in the second worker pool, and 1 worker in the third worker pool (to act as a parameter server) |
/16 | 63 | |
/17 | 31 | ||
/18 | 15 | ||
/19 | 7 | ||
Up to 16 nodes. For example: 1 primary replica in the first worker pool, 14 replicas in the second worker pool, and 1 worker in the third worker pool (to act as a parameter server) |
/16 | 31 | |
/17 | 15 | ||
/18 | 7 | ||
/19 | 3 | ||
Up to 32 nodes. For example: 1 primary replica in the first worker pool, 30 replicas in the second worker pool, and 1 worker in the third worker pool (to act as a parameter server) |
/16 | 15 | |
/17 | 7 | ||
/18 | 3 | ||
/19 | 1 |
Learn more about configuring worker pools for distributed training.
Check the status of existing peering connections
If you have existing peering connections you use with Vertex AI, you can list them to check status:
gcloud compute networks peerings list --network NETWORK_NAME
You should see that the state of your peering connections are ACTIVE
.
Learn more about active peering connections.
Perform custom training
When you perform custom training, you must specify the name of the network that you want Vertex AI to have access to.
Depending on how you perform custom training, specify the network in one of the following API fields:
If you are creating a
CustomJob
, specify theCustomJob.jobSpec.network
field.If you are using the Google Cloud CLI, then you can use the
--config
flag on thegcloud ai custom-jobs create
command to specify thenetwork
field.Learn more about creating a
CustomJob
.If you are creating a
HyperparameterTuningJob
, specify theHyperparameterTuningJob.trialJobSpec.network
field.If you are using the gcloud CLI, then you can use the
--config
flag on thegcloud ai hpt-tuning-jobs create
command to specify thenetwork
field.Learn more about creating a
HyperparameterTuningJob
.If you are creating a
TrainingPipeline
without hyperparameter tuning, specify theTrainingPipeline.trainingTaskInputs.network
field.Learn more about creating a custom
TrainingPipeline
.If you are creating a
TrainingPipeline
with hyperparameter tuning, specify theTrainingPipeline.trainingTaskInputs.trialJobSpec.network
field.
If you do not specify a network name, then Vertex AI runs your custom training without a peering connection, and without access to private IPs in your project.
Example: Creating a CustomJob
with the gcloud CLI
The following example shows how to specify a network when you use the
gcloud CLI to run a CustomJob
that uses a prebuilt container. If
you are perform custom training in a different way, add the network
field
as described for the type of custom training job you're using.
Create a
config.yaml
file to specify the network. If you're using Shared VPC, use your VPC host project number.Make sure the network name is formatted correctly:
PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)") cat <<EOF > config.yaml network: projects/PROJECT_NUMBER/global/networks/NETWORK_NAME EOF
Create a training application to run on Vertex AI.
Create the
CustomJob
, passing in yourconfig.yaml
file:gcloud ai custom-jobs create \ --region=LOCATION \ --display-name=JOB_NAME \ --python-package-uris=PYTHON_PACKAGE_URIS \ --worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,executor-image-uri=PYTHON_PACKAGE_EXECUTOR_IMAGE_URI,python-module=PYTHON_MODULE \ --config=config.yaml
To learn how to replace the placeholders in this command, read Creating custom training jobs.
Run jobs on different networks
You can't perform custom training on a new network while you are still
performing custom training on another network. Before you switch to a different
network, you must wait for all submitted CustomJob
, HyperparameterTuningJob
,
and custom TrainingPipeline
resources to finish, or you must cancel them.
Test training job access
This section explains how to test that a custom training resource can access private IPs in your network.
- Create a Compute Engine instance in your VPC network.
- Check your firewall rules to make sure that they don't restrict ingress from the Vertex AI network. If so, add a rule to ensure the Vertex AI network can access the IP range you reserved for Vertex AI (and other service producers).
- Set up a local server on the VM instance in order to create an endpoint for a
Vertex AI
CustomJob
to access. - Create a Python training application to run on Vertex AI. Instead of model training code, create code that accesses the endpoint you set up in the previous step.
- Follow the previous example to create a
CustomJob
.
Common problems
This section lists some common issues for configuring VPC Network Peering with Vertex AI.
When you configure Vertex AI to use your network, specify the full network name:
"projects/YOUR_PROJECT_NUMBER/global/networks/YOUR_NETWORK_NAME"
Make sure you are not performing custom training on a network before performing custom training on a different network.
Make sure that you've allocated a sufficient IP range for all service producers your network connects to, including Vertex AI.
For additional troubleshooting information, refer to the VPC Network Peering troubleshooting guide.
What's next
- Learn more about VPC Network Peering.
- See reference architectures and best practices for VPC design.