Working with Data

You need two things to work with Cloud Machine Learning Engine data and a model that works on that data. The Cloud Machine Learning Engine API doesn't provide any services to help you with the many tasks involved in preparing data; there are many data-science tools you can use to help you. This topic describes how to work with data that you have already prepared to get it to work with Cloud ML Engine.

Understanding Cloud ML Engine's data interactions

Cloud ML Engine helps you accomplish two machine learning tasks: training models and getting predictions from models. The kinds of data interactions you have with the services are different for each.

Data during training

The Cloud ML Engine training service doesn't directly interact with your data. The data involved in training is managed by your trainer and the TensorFlow objects you use to develop it. Even so, the following data is assumed to be part of the process by conventional design patterns.

Training and evaluation data

You need a large number of existing data instances to train a model. This data:

  • Is representative of the data in your problem space.
  • Includes all of the features your model needs to make predictions as well as the target value you want to infer in new instances.
  • Is serialized in a format that TensorFlow can accept, generally CSV or TFRecords.
  • Must be stored in a location that your Google Cloud Platform project can access, typically a Google Cloud Storage location or in Google BigQuery.
  • Is split into two datasets: one for training the model, and one for evaluating the trained model's accuracy and generalizability.

Trainer package and custom dependencies

To begin training with Cloud ML Engine your training application must be made into a Python package and staged in a Cloud Storage location. You can also put the package files of any custom dependencies on Cloud Storage and set the trainer to install them on the training instances before running your trainer.

You can use the gcloud command-line tool to handle this step automatically when you start your training job.

Get more information on the packaging how-to page.

Training output files

Most trainers save checkpoints during training and write the trained model to a TensorFlow SavedModel file at the end of the job. You need a Cloud Storage location to save them to and your project must have write access to it.

You can use the special job directory option when you start a training job. The training service automatically passes the path you set for the job directory to your trainer as a command-line argument named job_dir. You can parse it along with your application's other arguments and use it in your code. The advantage to the job directory is that the training service validates the directory before starting your trainer.

Specify a job directory by including a value with the "jobDir" key in the TrainingInput object you use to configure your job. In the gcloud tool, you add the --job-dir flag to gcloud ml-engine jobs submit training.

Hyperparameter tuning data

When you use hyperparameter configuration, you must facilitate a closer interaction between your trainer and the training service than anywhere else in the training process. The details are described in the hyperparameter tuning concepts page. Here is a brief summary of the data involved:

  • The hyperparameters to tune must be defined as command-line arguments for your trainer.
  • The target variable that you are trying to optimize is defined in your trainer, and included in the hyperparameter specification object.
  • The results of individual tuning trials are aggregated in the TF_CONFIG environment variable where both the service and your application can access them.

Training logs

You can put logging events in your trainer with standard Python libraries (logging, for example). All messages sent to stderr are automatically captured in your job's entry in Stackdriver Logging.

Data during prediction

Unlike during the training process, Cloud ML Engine has strict requirements about the data you use for prediction. You can find explanations of the data formatting of input instances in the prediction concepts page. Here is a summary of the data involved:

  • Input instances, new data records to make inferences for, must be formatted differently depending on the type of prediction:

    • Online prediction input is serialized in a JSON string that you send in your request message.
    • Batch prediction input data is stored in one or more files in accessible Cloud Storage locations.
  • The predictions themselves also differ by type:

    • Online prediction results are returned in the response message.
    • Batch prediction results are written in files in a Cloud Storage location that you specify.

Working with Cloud Storage

Most of the data exchange between your model (and model trainer) and Cloud ML Engine happens via Cloud Storage locations that your project has access to.

Here are the places in the process where using Cloud storage is either required or encouraged:

  • Staging your trainer package and custom dependencies.
  • Storing your training input data.
  • Storing your training output data.
  • Staging your SavedModel to make it into a model version.
  • Storing your batch prediction input files.
  • Storing your batch prediction output.

Setting up your Cloud Storage buckets

When you create a bucket to use with Cloud ML Engine you should:

  • Assign it to a specific compute region, not to a multi-region value.
  • Use the same region where you run your training jobs.
  • Organize your folder structure to accommodate many iterations of your model.

Using a Cloud Storage Bucket from a Different Project

Cloud ML Engine needs access to read your input files (training code, training data, and prediction data) and to write output files (trained models and batch-prediction results) on Google Cloud Storage. This section describes how to configure Cloud Storage buckets from outside of your project so that your Cloud ML Engine jobs can access them.

Step 1: Get required information from your cloud project

The steps in this section get information about your Google Cloud Platform project in order to use them to change access control for your project's Cloud ML Engine service account. You need to store the values for later use in environment variables.

  1. Get your project identifier by using the gcloud command-line tool with your project selected:

    PROJECT_ID=$(gcloud config list project --format "value(core.project)")
    
  2. Get the access token for your project by using gcloud:

    AUTH_TOKEN=$(gcloud auth print-access-token)
    
  3. Get the service account information by requesting project configuration from the REST service:

    SVC_ACCOUNT=$(curl -X GET -H "Content-Type: application/json" \
        -H "Authorization: Bearer $AUTH_TOKEN" \
        https://ml.googleapis.com/v1/projects/${PROJECT_ID}:getConfig \
        | python -c "import json; import sys; response = json.load(sys.stdin); \
        print response['serviceAccount']")
    

Step 2: Configure access to your Cloud Storage bucket

Now that you have your project and service account information, you need to update access permissions. These steps use the same variable names used in the previous section.

  1. Set the name of your bucket in an environment variable named BUCKET_NAME:

    BUCKET_NAME="your_bucket_name"
    
  2. Grant the service accounts read access to your Cloud Storage bucket:

    gsutil -m defacl ch -u $SVC_ACCOUNT:R gs://$BUCKET_NAME
    
  3. If your bucket already contains objects that you need to access, you must grant read access to them explicitly:

    gsutil -m acl ch -u $SVC_ACCOUNT:R -r gs://$BUCKET_NAME
    
  4. Grant write access:

    gsutil -m acl ch -u $SVC_ACCOUNT:W gs://$BUCKET_NAME
    

What's next

Send feedback about...

Cloud Machine Learning Engine (Cloud ML Engine)