Using VPC Network Peering with Training

You can configure AI Platform Training jobs to peer with Virtual Private Cloud (VPC). This allows your training jobs to access private IP addresses inside your Google Cloud or on-premises networks. Using private IP to connect to your training jobs provides more network security and lower network latency than using public IP.

This guide shows how to configure private IP with AI Platform Training by using VPC Network Peering to peer your network with AI Platform Training jobs. This guide is recommended for networking administrators who are already familiar with Google Cloud networking concepts.

Overview

This guide covers the following tasks:

  • Configure private services access for the VPC. This establishes a peering connection between your VPC and Google's shared VPC network.
  • Consider the IP range you need to reserve for AI Platform Training.
  • If applicable, export custom routes so that AI Platform Training can import them.
  • Verify the status of your peering connections.
  • Submit a training job on your network.
  • Check for active training jobs on one network before submitting jobs on another network.
  • Test that a training job can access private IPs in your network.

Before you begin

  • Select a VPC that you want to peer with AI Platform Training jobs.
  • Select or create a Google Cloud project to run training jobs with AI Platform Training.
  • Make sure that billing is enabled for your Google Cloud project.

  • Enable the Compute Engine API, AI Platform Training & Prediction, and the Service Networking APIs.

    Enable the APIs

  • Optionally, you can use Shared VPC. If you use Shared VPC, you usually run training jobs in a separate Google Cloud project than your VPC host project. Enable the Compute Engine API and Service Networking APIs in both projects. Learn how to provision Shared VPC.
  • Install the gcloud CLI if you want to run the gcloud command-line examples in this guide.

Permissions

If you are not a project owner or editor, make sure you have the Compute Network Admin role, which includes the permissions you need to manage networking resources.

To run jobs on AI Platform Training, you need the permissions included in the AI Platform Training Admin or AI Platform Training Developer roles. Learn more about the AI Platform Training IAM roles.

Peering with an on-premises network

For VPC Network Peering with an on-premises network, there are additional steps:

  1. Connect your on-premises network to your VPC. You can use a VPN tunnel or Interconnect.
  2. Set up custom routes from the VPC to your on-premises network.
  3. Export your custom routes so that AI Platform Training can import them.

Set up private services access for your VPC

When you set up private services access, you establish a private connection between your network and a network owned by Google or a third party service (service producers). In this case, AI Platform Training is a service producer. To set up private services access, you reserve an IP range for service producers, and then create a peering connection with AI Platform Training.

If you already have a VPC with private services access configured, and you want to use that VPC to peer with your training job, move on to exporting custom routes.

  1. Set environment variables for your project ID, region name, the name of your reserved range, and the name of your network.
    • If you use Shared VPC, use the project ID of your VPC host project. Otherwise, use the project ID of Google Cloud project you use to run training jobs.
    • Select an eligible region to use with AI Platform Training.
  2. Enable the required APIs. If you use Shared VPC, make sure to enable the APIs in your VPC host project and the Google Cloud project you use to run training jobs.
  3. Set a reserved range using gcloud compute addresses create.
  4. Establish a peering connection between your VPC host project and Google's Service Networking, using gcloud services vpc-peerings connect.

    PROJECT_ID=YOUR_PROJECT_ID
    gcloud config set project $PROJECT_ID
    
    REGION=YOUR_REGION
    
    # This is for display only; you can name the range anything.
    PEERING_RANGE_NAME=google-reserved-range
    
    NETWORK=YOUR_NETWORK_NAME
    
    # NOTE: `prefix-length=16` means a CIDR block with mask /16 will be
    # reserved for use by Google services, such as AI Platform Training.
    gcloud compute addresses create $PEERING_RANGE_NAME \
      --global \
      --prefix-length=16 \
      --description="peering range for Google service" \
      --network=$NETWORK \
      --purpose=VPC_PEERING
    
    # Create the VPC connection.
    gcloud services vpc-peerings connect \
      --service=servicenetworking.googleapis.com \
      --network=$NETWORK \
      --ranges=$PEERING_RANGE_NAME \
      --project=$PROJECT_ID
    

Learn more about private services access.

Reserving IP ranges for AI Platform Training

When you reserve an IP range for service producers, that range can be used by both AI Platform Training as well as other services. Therefore, if you plan to connect other services to the same reserved range, you must also make sure that the range is large enough to avoid IP exhaustion.

You can conservatively estimate the number of addresses that each training job uses as follows:

nextPow2(32 * NUMBER_OF_POOLS * max(POOL_SIZE))

The following table shows the maximum number of parallel training jobs that you can run with reserved ranges from /16 to /19, assuming the range is used almost exclusively by AI Platform Training.

Machine configuration for training job Reserved range Maximum number of parallel jobs
Up to 8 nodes.
For example: 6 workers, 1 master, and 1 parameter server.
/16 63
/17 31
/18 15
/19 7
Up to 16 nodes.
For example: 14 workers, 1 master, and 1 parameter server.
/16 31
/17 15
/18 7
/19 3
Up to 32 nodes.
For example: 30 workers, 1 master, and 1 parameter server.
/16 15
/17 7
/18 3
/19 1

Learn more about specifying machine types for training jobs.

Export custom routes

If you use custom routes, you need to export them so that AI Platform Training can import them. If you don't use custom routes, move on to submitting your training job.

To export custom routes, you update the peering connection in your VPC. Exporting custom routes sends all eligible static and dynamic routes that are in your VPC network, such as routes to your on-premises network, to service producers' networks (AI Platform Training in this case). This establishes the necessary connections and allows training jobs to send traffic back to your on-premises network.

Learn more about private connections with on-premises networks.

Console

  1. Go to the VPC Network Peering page in the Google Cloud console.
    Go to the VPC Network Peering page
  2. Select the peering connection to update.
  3. Click Edit.
  4. Select Export custom routes.

gcloud

  1. Find the name of the peering connection to update. If you have multiple peering connections, omit the --format flag.

    gcloud services vpc-peerings list \
      --network=$NETWORK \
      --service=servicenetworking.googleapis.com \
      --project=$PROJECT_ID \
      --format "value(peering)"
    
  2. Update the peering connection to export custom routes.

    gcloud compute networks peerings update PEERING-NAME \
        --network=$NETWORK \
        --export-custom-routes \
        --project=$PROJECT_ID
    

Check the status of your peering connections

To see that peering connections are active, you can list them using

gcloud compute networks peerings list --network $NETWORK

You should see that the state of the peering you just created is ACTIVE. Learn more about active peering connections.

Submit the training job

When you submit your training job, you need to specify the name of the network that you want AI Platform Training to have access to.

If you submit a training job without a network name, the training job runs by default without a peering connection, and without access to private IPs in your project.

  1. Create a config.yaml to specify the network. If you're using Shared VPC, use your VPC host project number.

    Make sure the network name is formatted correctly:

    PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
    cat << EOF > config.yaml
    trainingInput:
      scaleTier: BASIC
      network: projects/$PROJECT_NUMBER/global/networks/$NETWORK
    EOF
    
  2. Create a training application to submit to AI Platform Training.

  3. Specify other parameters for your training job. Learn more about the parameters needed to submit a training job.

    TRAINER_PACKAGE_PATH="PATH_TO_YOUR_TRAINING_APPLICATION"
    now=$(date +"%Y%m%d_%H%M%S")
    JOB_NAME="YOUR_JOB_NAME_$now"
    MAIN_TRAINER_MODULE="trainer.task"
    JOB_DIR="gs://PATH_TO_OUTPUT_DIRECTORY"
    REGION="us-east1"
    RUNTIME_VERSION="2.11"
    PYTHON_VERSION="3.7"
    
  4. Submit the job, passing in your config.yaml file:

    gcloud ai-platform jobs submit training $JOB_NAME \
      --module-name $MAIN_TRAINER_MODULE \
      --job-dir $JOB_DIR \
      --region $REGION \
      --runtime-version $RUNTIME_VERSION \
      --python-version $PYTHON_VERSION \
      --package-path $TRAINER_PACKAGE_PATH \
      --config config.yaml
    

Running jobs on different networks

You can't submit training jobs to a new network while there are still active training jobs on another network. Before you switch to a different network, you must wait for all submitted training jobs to finish, or cancel them. For example, if you set up a network for testing, and you submitted training jobs on that testing network, you'll need to search for active jobs on the testing network, and make sure they're complete or cancelled before you submit training jobs on a different network for production.

List training jobs that are still active on a network:

PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
NETWORK_FULL_NAME="projects/$PROJECT_NUMBER/global/networks/$NETWORK"

gcloud ai-platform jobs list \
  --filter "(state=queued OR state=running) AND (trainingInput.network=$NETWORK_FULL_NAME)"

Your output might appear similar to this:

JOB_ID                                             STATUS     CREATED
job_20200502_151443                                QUEUED     2020-05-02T15:14:47

If any jobs are listed, you can wait for them to complete, or use gcloud ai-platform jobs cancel to cancel each one.

Test training job access

To test that your training job can access an endpoint in your network, you need to set up an endpoint in your network and then submit a training job that accesses it. Learn more by reading the guide to testing your peering connection.

Troubleshooting

This section lists some common issues for configuring VPC Network Peering with AI Platform Training.

  • When you submit your training job, use the full network name:

    "projects/YOUR_PROJECT_NUMBER/global/networks/YOUR_NETWORK_NAME"

  • Do not use TPUs in a training job peered with your network. TPUs are not supported with VPC Network Peering on AI Platform Training.

  • Make sure there are no active training jobs on a network before submitting training jobs on a different network.

  • Make sure that you've allocated a sufficient IP range for all service producers your network connects to, including AI Platform Training.

For additional troubleshooting information, refer to the VPC Network Peering troubleshooting guide.

What's next