Getting Started

Before using Cloud Machine Learning Engine, you should be familiar with machine learning and TensorFlow.

Once you're familiar with machine learning and TensorFlow, either through prior experience or by exploring the suggested resources, you're ready to get started with this walkthrough.

Overview

This document provides an introductory, end-to-end walkthrough of training and prediction on Cloud Machine Learning Engine. You will walk through a sample that uses a census dataset to:

  • Create a TensorFlow trainer and validate it locally.
  • Run your trainer on a single worker instance in the cloud.
  • Run your trainer as a distributed training job in the cloud.
  • Deploy a model to support prediction.
  • Request an online prediction and see the response.
  • Request a batch prediction.

What you will build

The sample builds a wide and deep model for predicting income category based on United States Census Income Dataset.

Wide and deep models use deep neural nets (DNNs) to learn high level abstractions about complex features or interactions between such features. These models then combine the outputs from the DNN with a linear regression performed on simpler features. This provides a balance between power and speed that is effective on many structured data problems.

You can read more about wide and deep models in the Google Research Blog post named Wide & Deep Learning: Better Together with TensorFlow.

The sample defines the model using TensorFlow's prebuilt DNNCombinedLinearClassifier class, and need only define the data transformations particular to our dataset before assigning these (potentially) transformed features to either the DNN or the linear portion of the model.

Before you begin

  1. Complete the Quickstart Using the Command-Line to ensure that you have:

    • A GCP account with the Cloud ML Engine and Cloud Storage APIs activated
    • Cloud SDK installed and initialized
    • TensorFlow installed
  2. Download the sample from the Git repository.

macOS

  1. Download and extract the Cloud ML Engine sample zip file.

  2. Open a terminal window and navigate to the directory that contains the extracted cloudml-samples-master directory.

  3. Navigate to the cloudml-samples-master > census > estimator directory. The commands in this walkthrough must be run from the estimator directory.

    cd cloudml-samples-master/census/estimator
    

Cloud Shell

  1. Download the Cloud ML Engine sample zip file.

    wget https://github.com/GoogleCloudPlatform/cloudml-samples/archive/master.zip
    
  2. Unzip the file to extract the cloudml-samples-master directory.

    unzip master.zip
    
  3. Navigate to the cloudml-samples-master > census > estimator directory. The commands in this walkthrough must be run from the estimator directory.

    cd cloudml-samples-master/census/estimator
    

Costs

This walkthrough uses billable components of Google Cloud Platform, including:

  • Cloud Machine Learning Engine for:
    • Training
    • Prediction
  • Google Cloud Storage for:
    • Storing input data for training
    • Staging the trainer package
    • Writing training artifacts
    • Storing input data files for batch prediction

Use the Pricing Calculator to generate a cost estimate based on your projected usage.

Develop and validate your trainer locally

Before you run your trainer in the cloud, get it running locally. Local environments provide an efficient development and validation workflow so that you can iterate quickly. You also won't incur charges for cloud resources when debugging your application.

Get your training data

The Census Income Data Set that this sample uses for training is hosted by the UC Irvine Machine Learning Repository.

The relevant data files, adult.data and adult.test, are hosted in a public Google Cloud Storage bucket. For purposes of this sample, use the versions on Cloud Storage, which have undergone some trivial cleaning, instead of the original source data. See About the data for more information.

The data is stored in comma-separated value format as shown by the following preview of the adult.data file:

39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K
...

You can read them directly from Cloud Storage or copy them to your local environment. For purposes of this sample you will download the samples for local training, and later upload them to your own Cloud Storage bucket for cloud training.

  1. Download the data to a local file directory and set variables that point to the downloaded data files.

    mkdir data
    gsutil -m cp gs://cloudml-public/census/data/* data/
    
  2. Set the TRAIN_DATA AND EVAL_DATA variables to your local file paths. For example, the following commands set the variables to local paths.

    TRAIN_DATA=$(pwd)/data/adult.data.csv
    EVAL_DATA=$(pwd)/data/adult.test.csv
    

Run a local trainer

A local trainer loads your Python training program and starts a training process in an environment that's similar to that of a live Cloud ML Engine cloud training job.

  1. Specify an output directory and set a MODEL_DIR variable. The following command sets MODEL_DIR to a value of output.

    MODEL_DIR=output
    
  2. It's a good practice to delete the contents of the output directory in case data remains from a previous training run. The following command deletes all data in the output directory.

    rm -rf $MODEL_DIR
    
  3. To run your training locally, run the following command:

    gcloud ml-engine local train \
        --module-name trainer.task \
        --package-path trainer/ \
        -- \
        --train-files $TRAIN_DATA \
        --eval-files $EVAL_DATA \
        --train-steps 1000 \
        --job-dir $MODEL_DIR
    

By default, verbose logging is turned off. You can enable it by setting the --verbosity tag to DEBUG. You'll enable it in a later example command.

To see the evaluation results, you can use Tensorboard, which is available as part of the TensorFlow installation.

Inspect the summary logs using Tensorboard

You can inspect the behavior of your training by launching TensorBoard and pointing it at the summary logs produced during training — both during and after execution.

macOS

  1. Launch TensorBoard:

    python -m tensorflow.tensorboard --logdir=output
    
  2. Once you start running TensorBoard, you can access it in your browser at http://localhost:6006

Cloud Shell

  1. Launch TensorBoard:

    python -m tensorflow.tensorboard --logdir=output --port=8080
    
  2. Select "Preview on port 8080" from the Web Preview menu at the top of the command-line.

Click on Accuracy to see graphical representations of how accuracy changes as your job progresses.

Tensorboard accuracy graphs

You can shut down TensorBoard at any time by typing ctrl+c on the command-line.

Run a local trainer in distributed mode

You can also test whether your model works with the Cloud ML Engine's distributed execution environment by running a local trainer using the --distributed flag.

  1. Specify an output directory and set MODEL_DIR variable again. The following command sets MODEL_DIR to a value of output-dist.

    MODEL_DIR=output-dist
    
  2. Remember to delete the contents of the output directory in case data remains from a previous training run. The following command deletes all data in the output directory.

    rm -rf $MODEL_DIR
    
  3. Run the local train command using the --distributed option. Be sure to place the flag above the -- that separates the user arguments from the command-line arguments.

    gcloud ml-engine local train \
        --module-name trainer.task \
        --package-path trainer/ \
        --distributed \
        -- \
        --train-files $TRAIN_DATA \
        --eval-files $EVAL_DATA \
        --train-steps 1000 \
        --job-dir $MODEL_DIR
    

Inspect the output

Output files are written to the directory specified by --job-dir, which was set to output-dist:

ls -R output-dist/

You should see output similar to:

checkpoint
eval
events.out.tfevents.1488577094.<host-name>
export
graph.pbtxt
model.ckpt-1000.data-00000-of-00001
model.ckpt-1000.index
model.ckpt-1000.meta
model.ckpt-2.data-00000-of-00001
model.ckpt-2.index
model.ckpt-2.meta

output-dist//eval:
events.out.tfevents.1488577121.<host-name>
events.out.tfevents.1488577141.<host-name>
events.out.tfevents.1488577169.<host-name>
...

Inspect the logs

Inspect the summary logs using Tensorboard the same way that you did for the single-instance training job except that you change the --logdir value to match the output directory name you used for distributed mode.

macOS

  1. Launch TensorBoard:

    python -m tensorflow.tensorboard --logdir=$MODEL_DIR
    
  2. Once you start running TensorBoard, you can access it in your browser at http://localhost:6006

Cloud Shell

  1. Launch TensorBoard:

    python -m tensorflow.tensorboard --logdir=$MODEL_DIR --port=8080
    
  2. Select "Preview on port 8080" from the Web Preview menu at the top of the command-line.

Set up your Cloud Storage bucket

The Cloud ML Engine services need to access Cloud Storage locations to read and write data during model training and batch prediction. This section shows you how to create a new bucket. You can use an existing bucket, but if it is not part of the project you are using to run Cloud ML Engine, you must explicitly grant access to the Cloud ML Engine service accounts.

Create a Google Cloud Storage bucket for reading and writing data during model training and batch prediction:

  1. Set a name for your new bucket.

    If you want to use the project name with -mlengine appended, get your project name and append -mlengine:

    PROJECT_ID=$(gcloud config list project --format "value(core.project)")
    BUCKET_NAME=${PROJECT_ID}-mlengine

    Otherwise, use whatever name you want, though it must be unique across all buckets in Cloud Storage:

    BUCKET_NAME="your_bucket_name"
  2. Check the bucket name that you created.

    echo $BUCKET_NAME
  3. Select a region for your bucket and set a `REGION` environment variable.

    Warning: You must specify a region (like us-central1) for your bucket, not a multi-region location (like us). Learn more in the development environment overview. For example, the following code creates `REGION` and sets it to `us-central1`.

    REGION=us-central1
  4. Create the new bucket:

    gsutil mb -l $REGION gs://$BUCKET_NAME

    Note: Use the same region where you plan on running Cloud ML Engine jobs. The example uses us-central1 because that is the region used in the quickstart instructions.

Next, upload the data files to your Cloud Storage bucket.

  1. Use gsutil to copy the two files to your Cloud Storage bucket.

    gsutil cp -r data gs://$BUCKET_NAME/data
    
  2. Set the TRAIN_DATA and EVAL_DATA variables to point to the files.

    TRAIN_DATA=gs://$BUCKET_NAME/data/adult.data.csv
    EVAL_DATA=gs://$BUCKET_NAME/data/adult.test.csv
    
  3. Use gsutil again to copy the JSON test file test.json to your Cloud Storage bucket.

    gsutil cp ../test.json gs://$BUCKET_NAME/data/test.json
    
  4. Set the TEST_JSON variable to point to that file.

    TEST_JSON=gs://$BUCKET_NAME/data/test.json
    

Run a single-instance trainer in the cloud

With a validated trainer that runs in both single-instance and distributed mode, you're now ready to run a trainer in the cloud. You'll start by requesting a single-instance training job.

Use the default BASIC scale tier to run a single-instance trainer. The initial job request can take a few minutes to start, but subsequent jobs run more quickly. This enables quick iteration as you develop and validate your training job.

  1. Select a name for the initial training run that distinguishes it from any subsequent training runs. For example, you can append a number to represent the iteration.

    JOB_NAME=census_single_1
    
  2. Specify a directory for output generated by Cloud ML Engine by setting an OUTPUT_PATH variable to include when requesting training and prediction jobs. The OUTPUT_PATH represents the fully qualified Cloud Storage location for model checkpoints, summaries, and exports. You can use the BUCKET_NAME variable you defined in a previous step.

    It's a good practice to use the job name as the output directory. For example, the following OUTPUT_PATH points to a directory named census-single-1.

    OUTPUT_PATH=gs://$BUCKET_NAME/$JOB_NAME
    
  3. Run the following command to submit a training job in the cloud that uses a single process. This time, set the --verbosity tag to DEBUG so that you can inspect the full logging output and retrieve accuracy, loss, and other metrics. The output also contains a number of other warning messages that you can ignore for the purposes of this sample.

    gcloud ml-engine jobs submit training $JOB_NAME \
    --job-dir $OUTPUT_PATH \
    --runtime-version 1.0 \
    --module-name trainer.task \
    --package-path trainer/ \
    --region $REGION \
    -- \
    --train-files $TRAIN_DATA \
    --eval-files $EVAL_DATA \
    --train-steps 1000 \
    --verbosity DEBUG
    

You can monitor the progress of your trainer by watching the command-line output or in ML Engine > Jobs on Google Cloud Platform Console.

Inspect the output

In cloud training, outputs are produced into Google Cloud Storage. In this sample, outputs are saved to OUTPUT_PATH; to list them, run:

gsutil ls -r $OUTPUT_PATH

The outputs should be similar to the outputs from training locally (above).

Inspect the Stackdriver logs

Logs are a useful way to understand the behavior of your training code on the cloud. When Cloud ML Engine runs a training job, it captures all stdout and stderr streams and logging statements. These logs are stored in Stackdriver Logging; they are visible both during and after execution.

The easiest way to find the logs for your job is to select your job in ML Engine > Jobs on Cloud Platform Console, and then click "View logs".

If you leave "All logs" selected, you will see all logs from all workers. You can also select a specific task; master-replica-0 will give you an overview of the job's execution from the master's perspective.

Because you selected verbose logging, you can inspect the full logging output. Look for the term accuracy in the logs:

screenshot of Stackdriver logging console for ML Engine jobs

Inspect the summary logs using Tensorboard

You can inspect the behavior of your training by launching TensorBoard and pointing it at the summary logs produced during training — both during and after execution.

Because the training programs write summaries directly to a Cloud Storage location, Tensorboard can automatically read from them without manual copying of event files.

macOS

  1. Launch TensorBoard:

    python -m tensorflow.tensorboard --logdir=$OUTPUT_PATH
    
  2. Once you start running TensorBoard, you can access it in your browser at http://localhost:6006

Cloud Shell

  1. Launch TensorBoard:

    python -m tensorflow.tensorboard --logdir=$OUTPUT_PATH --port=8080
    
  2. Select "Preview on port 8080" from the Web Preview menu at the top of the command-line.

Click on Accuracy to see graphical representations of how accuracy changes as your job progresses.

You can shut down TensorBoard at any time by typing ctrl+c on the command-line.

Run distributed training in the cloud

To take advantage of Google's scalable infrastructure when running training jobs, configure your trainer to run in distributed mode.

No code changes are necessary to run this model as a distributed process in Cloud ML Engine.

To run a distributed job, set --scale-tier to any tier above basic. For more information about scale tiers, see scale tier documentation.

  1. Select a name for your distributed training run that distinguishes it from other training runs. For example, you could use dist to represent distributed and a number to represent the iteration.

    JOB_NAME=census_dist_1
    
  2. Specify OUTPUT_PATH to include the job name so that you don't inadvertently reuse checkpoints between jobs. You might have to redefine BUCKET_NAME if you've started a new command-line session since you last defined it. For example, the following OUTPUT_PATH points to a directory named census-dist-1.

    OUTPUT_PATH=gs://$BUCKET_NAME/$JOB_NAME
    
  3. Run the following command to submit a training job in the cloud that uses multiple workers. Note that the job can take a few minutes to start.

    Place --scale-tier above the -- that separates the user arguments from the command-line arguments. For example, the following command uses a scale tier of STANDARD_1:

    gcloud ml-engine jobs submit training $JOB_NAME \
        --job-dir $OUTPUT_PATH \
        --runtime-version 1.0 \
        --module-name trainer.task \
        --package-path trainer/ \
        --region $REGION \
        --scale-tier STANDARD_1 \
        -- \
        --train-files $TRAIN_DATA \
        --eval-files $EVAL_DATA \
        --train-steps 1000 \
        --verbosity DEBUG
    

You can monitor the progress of your job by watching the command-line output or in ML Engine > Jobs on Google Cloud Platform Console.

Inspect the logs

Inspect the Stackdriver logs and summary logs the same way that you did for the single-instance training job.

For Stackdriver logs: select your job in ML Engine > Jobs on Cloud Platform Console, and then click View logs.

For TensorBoard:

macOS

  1. Launch TensorBoard:

    python -m tensorflow.tensorboard --logdir=$OUTPUT_PATH
    
  2. Once you start running TensorBoard, you can access it in your browser at http://localhost:6006

Cloud Shell

  1. Launch TensorBoard:

    python -m tensorflow.tensorboard --logdir=$OUTPUT_PATH --port=8080
    
  2. Select "Preview on port 8080" from the Web Preview menu at the top of the command-line.

GPUs and Hyperparameter Tuning

Cloud ML Engine offers machines with graphics processing units (GPUs) that you can request to help scale your training job. For more information about GPUs, see Using GPUs.

Cloud ML Engine also offers hyperparameter tuning to help you maximize your model's predictive accuracy. For more information about hyperparameter tuning, see Hyperparameter Tuning Overview.

Deploy a model to support prediction

  1. Choose a name for your model; this must start with a letter and contain only letters, numbers, and underscores. For example:

    MODEL_NAME=census
    
  2. Create a Cloud ML Engine model:

    gcloud ml-engine models create $MODEL_NAME --regions=$REGION
    
  3. Select the job output to use. The following sample uses the job named census_dist_1.

    OUTPUT_PATH=gs://$BUCKET_NAME/census_dist_1
    
  4. Look up the full path of your exported trained model binaries:

    gsutil ls -r $OUTPUT_PATH/export
    
  5. Find a directory named $OUTPUT_PATH/export/Servo/<timestamp> and copy this directory path (without the : at the end) and set the environment variable MODEL_BINARIES to its value. For example:

    MODEL_BINARIES=gs://$BUCKET_NAME/census_dist_1/export/Servo/1487877383942/

    Where $BUCKET_NAME is your Cloud Storage bucket name, and census_dist_1 is the output directory.

  6. Run the following command to create a version v1:

    gcloud ml-engine versions create v1 \
    --model $MODEL_NAME \
    --origin $MODEL_BINARIES \
    --runtime-version 1.0
    

You can get a list of your models using the models list command.

gcloud ml-engine models list

Send a prediction request to a deployed model

You can now send prediction requests to your model. For example, the following command sends a prediction request using a test.json file that you downloaded as part of the sample GitHub repository.

gcloud ml-engine predict \
--model $MODEL_NAME \
--version v1 \
--json-instances \
../test.json

The response includes the predicted labels of the examples.

CLASSES  LOGISTIC                LOGITS                PROBABILITIES
0        [0.003707568161189556]  [-5.593664646148682]  [0.9962924122810364, 0.003707568161189556]

Submit a batch prediction job

The batch prediction service is useful if you have large amounts of data, and no latency requirements on receiving prediction results. This uses the same format as online prediction, but requires data be stored in Cloud Storage.

  1. Set a name for the job.

    JOB_NAME=census_prediction_1
    
  2. Set the output path.

    OUTPUT_PATH=gs://$BUCKET_NAME/$JOB_NAME
    
  3. Submit the prediction job.

    gcloud ml-engine jobs submit prediction $JOB_NAME \
    --model $MODEL_NAME \
    --version v1 \
    --data-format TEXT \
    --region $REGION \
    --input-paths $TEST_JSON \
    --output-path $OUTPUT_PATH/predictions

Unlike the previous commands, this one returns immediately. Check the progress of the job and wait for it to finish:

gcloud ml-engine jobs describe $JOB_NAME

You should see state: SUCCEEDED once the job completes; this may take several minutes.

Alternatively, you can check the progress in ML Engine > Jobs on Cloud Platform Console.

After the job succeeds, you can:

  • Read the output summary.

    gsutil cat $OUTPUT_PATH/predictions/prediction.results-00000-of-00001
    

    You should see output similar to the following.

    {"probabilities": [0.9962924122810364, 0.003707568161189556], "logits": [-5.593664646148682], "classes": 0, "logistic": [0.003707568161189556]}
    

  • List the other files that the job produced using the gsutil ls command.

    gsutil ls -r $OUTPUT_PATH
    

Compared to online prediction, batch prediction:

  • Is slower for this small number of instances (but is more suitable for large numbers of instances).
  • Could return output in a different order than the input (but the numeric index allows each output to be matched to its corresponding input instance; this is not necessary for online prediction since the outputs are returned in the same order as the original input instances).

After the predictions are available, the next step is usually to ingest these predictions into a database or data processing pipeline.

In this sample, you deployed the model before running the batch prediction, but it's possible to skip that step by specifying the model binaries URI when you submit the batch prediction job. One advantage of generating predictions from a model before deploying it is that you can evaluate the model's performance on different evaluation datasets to help you decide whether the model meets your criteria for deployment.

Clean up

If you are done analyzing the output from your training and prediction runs, you can avoid incurring additional charges to your Google Cloud Platform account for the Cloud Storage directories used in this guide:

  1. Open a terminal window (if not already open).

  2. Use the gsutil rm command with the -r flag to delete the directory that contains your most recent job:

    gsutil rm -r gs://$BUCKET_NAME/$JOB_NAME
    

If successful, the command returns a message similar to:

Removing gs://my-awesome-bucket/just-a-folder/cloud-storage.logo.png#1456530077282000...
Removing gs://my-awesome-bucket/...

Repeat the command for any other directories that you created for this sample.

Alternately, if you have no other data stored in the bucket, you can run the gsutil rm -r command on the bucket itself.

What's next

You've now completed a walkthrough of a Cloud ML Engine sample that uses census data for training and prediction. You validated the trainer locally, ran it in the cloud in both single-instance and distributed mode, used hyperparameter tuning to improve the model, and used the model to get online and batch predictions.

The following resources can help you continue learning about Cloud ML Engine.

About the data

Census data courtesy of: Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://archive.ics.uci.edu/ml - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Send feedback about...

Cloud Machine Learning Engine (Cloud ML Engine)