Train an ML model with PyTorch

This tutorial describes how to run a training job that uses the PyTorch machine learning framework. The tutorial describes how configuring your job to use PyTorch differs slightly from using other ML frameworks supported by AI Platform Training. Then it shows you how to run a training job using sample PyTorch code that trains a model based on data from the Chicago Taxi Trips dataset.

The tutorial also shows how to use PyTorch with GPUs and with hyperparameter tuning.

PyTorch containers

AI Platform Training's runtime versions do not include PyTorch as a dependency. Instead, to run a training job that uses PyTorch, specify a pre-built PyTorch container for AI Platform Training to use.

Configuring a pre-built container for training uses some of the same syntax as configuring a custom container. However, you do not have to create your own Docker container; instead specify the URI of a container image provided by AI Platform and provide a Python training package that you create.

AI Platform provides the following pre-built PyTorch containers:

Container image URI PyTorch version Supported processors
gcr.io/cloud-ml-public/training/pytorch-xla.1-11 1.11 CPU, TPU
gcr.io/cloud-ml-public/training/pytorch-gpu.1-11 1.11 GPU
gcr.io/cloud-ml-public/training/pytorch-xla.1-10 1.10 CPU, TPU
gcr.io/cloud-ml-public/training/pytorch-gpu.1-10 1.10 GPU
gcr.io/cloud-ml-public/training/pytorch-xla.1-9 1.9 CPU, TPU
gcr.io/cloud-ml-public/training/pytorch-gpu.1-9 1.9 GPU
gcr.io/cloud-ml-public/training/pytorch-xla.1-7 1.7 CPU, TPU
gcr.io/cloud-ml-public/training/pytorch-gpu.1-7 1.7 GPU
gcr.io/cloud-ml-public/training/pytorch-xla.1-6 1.6 CPU, TPU
gcr.io/cloud-ml-public/training/pytorch-gpu.1-6 1.6 GPU
gcr.io/cloud-ml-public/training/pytorch-cpu.1-4 1.4 CPU
gcr.io/cloud-ml-public/training/pytorch-gpu.1-4 1.4 GPU

These container images are derived from Deep Learning Containers and include the dependencies provided by Deep Learning Containers.

If you want to use a version of PyTorch not available in one of the pre-built containers, follow the guide to using a custom container.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the AI Platform Training & Prediction API.

    Enable the API

  5. Install the Google Cloud CLI.
  6. To initialize the gcloud CLI, run the following command:

    gcloud init
  7. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  8. Make sure that billing is enabled for your Google Cloud project.

  9. Enable the AI Platform Training & Prediction API.

    Enable the API

  10. Install the Google Cloud CLI.
  11. To initialize the gcloud CLI, run the following command:

    gcloud init

Downloading sample code

Run the following commands to download the sample PyTorch training application and navigate to the directory with the training application:

git clone --depth=1 \
  https://github.com/GoogleCloudPlatform/ai-platform-samples.git

cd ai-platform-samples/training/pytorch/structured/python_package

Optionally, inspect the structure of the training code:

ls -pR

The trainer/ directory contains the PyTorch training application, and setup.py provides configuration details for packaging the training application.

Creating a Cloud Storage bucket

Create a Cloud Storage bucket to store your packaged training code and the model artifacts that your training job creates. Run the following command:

gsutil mb -l us-central1 gs://BUCKET_NAME

Replace BUCKET_NAME with a unique name that you choose for your bucket. Learn about requirements for bucket names.

Alternatively, you can use an existing Cloud Storage bucket in your Google Cloud project. For this tutorial, make sure to use a bucket in the us-central1 region.

Training a PyTorch model

This tutorial shows several ways to train a PyTorch model on AI Platform Training:

  • On a virtual machine (VM) instance with a CPU processor
  • On a VM with a GPU processor
  • Using hyperparameter tuning (on a VM with a CPU processor)

Choose one of these ways now, and follow the instructions in the corresponding tabs for the rest of this tutorial. You can then repeat this section if you want to try training with one of the other configurations.

Preparing to create a training job

Before you create a training job, ensure that your training code is ready and specify some configuration options in your local environment.

CPU

Set several Bash variables to use when you create your training job:

BUCKET_NAME=BUCKET_NAME
JOB_NAME=getting_started_pytorch_cpu
JOB_DIR=gs://${BUCKET_NAME}/${JOB_NAME}/models

Replace BUCKET_NAME with the name of the Cloud Storage bucket that you created in a previous section.

GPU

  1. Ensure that your PyTorch training code is aware of the GPU on the VM that your training job uses, so that PyTorch moves tensors and modules to the GPU appropriately.

    If you use the provided sample code, you don't need to do anything, because the sample code contains logic to detect whether the machine running the code has a GPU:

    cuda_availability = torch.cuda.is_available()
    if cuda_availability:
      device = torch.device('cuda:{}'.format(torch.cuda.current_device()))
    else:
      device = 'cpu'

    If you alter the training code, read the PyTorch guide to CUDA semantics to ensure that the GPU gets used.

  2. Set several Bash variables to use when you create your training job:

    BUCKET_NAME=BUCKET_NAME
    JOB_NAME=getting_started_pytorch_gpu
    JOB_DIR=gs://${BUCKET_NAME}/${JOB_NAME}/models
    

    Replace BUCKET_NAME with the name of the Cloud Storage bucket that you created in a previous section.

Hyperparameter tuning

The sample code for this tutorial tunes the learning rate and batch size parameters in order to minimize test loss.

  1. Ensure that your training code is ready for hyperparameter tuning on AI Platform Training:

  2. Run the following command to create a config.yaml file that specifies hyperparameter tuning options:

    cat > config.yaml <<END
    trainingInput:
      hyperparameters:
        goal: MINIMIZE
        hyperparameterMetricTag: test_loss
        maxTrials: 2
        maxParallelTrials: 2
        enableTrialEarlyStopping: True
        params:
        - parameterName: learning-rate
          type: DOUBLE
          minValue: 0.0001
          maxValue: 1
          scaleType: UNIT_LOG_SCALE
        - parameterName: batch-size
          type: INTEGER
          minValue: 1
          maxValue: 256
          scaleType: UNIT_LINEAR_SCALE
    END
    

    These options tune the --learning-rate and --batch-size hyperparameters in order to minimize model loss.

  3. Set several Bash variables to use when you create your training job:

    BUCKET_NAME=BUCKET_NAME
    JOB_NAME=getting_started_pytorch_hptuning
    JOB_DIR=gs://${BUCKET_NAME}/${JOB_NAME}/models
    

    Replace BUCKET_NAME with the name of the Cloud Storage bucket that you created in a previous section.

These Bash variables serve the following purposes:

  • JOB_NAME is an identifier for your AI Platform Training job. It must be unique among AI Platform Training jobs in your Google Cloud project.
  • JOB_DIR is used by AI Platform Training to determine where exactly to upload your training application. The training application also uses JOB_DIR to determine where to export model artifacts when it finishes training.

Creating a training job

Run the following command to create a training job:

CPU

gcloud ai-platform jobs submit training ${JOB_NAME} \
  --region=us-central1 \
  --master-image-uri=gcr.io/cloud-ml-public/training/pytorch-xla.1-10 \
  --scale-tier=BASIC \
  --job-dir=${JOB_DIR} \
  --package-path=./trainer \
  --module-name=trainer.task \
  -- \
  --train-files=gs://cloud-samples-data/ai-platform/chicago_taxi/training/small/taxi_trips_train.csv \
  --eval-files=gs://cloud-samples-data/ai-platform/chicago_taxi/training/small/taxi_trips_eval.csv \
  --num-epochs=10 \
  --batch-size=100 \
  --learning-rate=0.001

GPU

gcloud ai-platform jobs submit training ${JOB_NAME} \
  --region=us-central1 \
  --master-image-uri=gcr.io/cloud-ml-public/training/pytorch-gpu.1-10 \
  --scale-tier=CUSTOM \
  --master-machine-type=n1-standard-8 \
  --master-accelerator=type=nvidia-tesla-p100,count=1 \
  --job-dir=${JOB_DIR} \
  --package-path=./trainer \
  --module-name=trainer.task \
  -- \
  --train-files=gs://cloud-samples-data/ai-platform/chicago_taxi/training/small/taxi_trips_train.csv \
  --eval-files=gs://cloud-samples-data/ai-platform/chicago_taxi/training/small/taxi_trips_eval.csv \
  --num-epochs=10 \
  --batch-size=100 \
  --learning-rate=0.001

Hyperparameter tuning

gcloud ai-platform jobs submit training ${JOB_NAME} \
  --region=us-central1 \
  --master-image-uri=gcr.io/cloud-ml-public/training/pytorch-xla.1-10 \
  --scale-tier=BASIC \
  --job-dir=${JOB_DIR} \
  --package-path=./trainer \
  --module-name=trainer.task \
  --config=config.yaml \
  -- \
  --train-files=gs://cloud-samples-data/ai-platform/chicago_taxi/training/small/taxi_trips_train.csv \
  --eval-files=gs://cloud-samples-data/ai-platform/chicago_taxi/training/small/taxi_trips_eval.csv \
  --num-epochs=10

Read the guide to training jobs to learn about the configuration flags and how to use them to customize training.

The command returns a message similar to the following:

Job [JOB_NAME] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe JOB_NAME

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs JOB_NAME
jobId: JOB_NAME
state: QUEUED

You can monitor job status with the following command:

gcloud ai-platform jobs describe ${JOB_NAME}

You can stream your job's training logs with the following command:

gcloud ai-platform jobs stream-logs ${JOB_NAME}

When the training job finishes, it saves the trained ML model in a file called model.pth in a timestamped directory within the JOB_DIR Cloud Storage directory that you specified.

What's next