Working with Data

You need two things to work with Cloud Machine Learning Engine: data, and a model that works on that data. The Cloud Machine Learning Engine API doesn't provide any services to help with the many tasks involved in preparing data. There are many data science tools you can use. This topic describes how to work with data that you have already prepared, to get it to work with Cloud ML Engine.

Understanding Cloud ML Engine's data interactions

Cloud ML Engine helps you accomplish two machine learning tasks: training models and getting predictions from models. The kinds of data interactions you have with the services are different for each.

Data during training

The Cloud ML Engine training service doesn't directly interact with your data. The data involved in training is managed by your trainer and the TensorFlow objects you use to develop it. Even so, the following data is assumed to be part of the process by conventional design patterns.

Training and evaluation data

You need a large number of existing data instances to train a model. This data:

  • Is representative of the data in your problem space.
  • Includes all of the features your model needs to make predictions as well as the target value you want to infer in new instances.
  • Is serialized in a format that TensorFlow can accept, generally CSV or TFRecords.
  • Must be stored in a location that your Google Cloud Platform project can access, typically a Google Cloud Storage location or in Google BigQuery.
  • Is split into two datasets: one for training the model, and one for evaluating the trained model's accuracy and generalizability.

Trainer package and custom dependencies

To begin training with Cloud ML Engine your training application must be made into a Python package and staged in a Cloud Storage location. You can also put the package files of any custom dependencies on Cloud Storage and set the trainer to install them on the training instances before running your trainer.

You can use the gcloud command-line tool to handle this step automatically when you start your training job.

Get more information on the packaging how-to page.

Training output files

Most trainers save checkpoints during training and write the trained model to a TensorFlow SavedModel file at the end of the job. You need a Cloud Storage location to save them to and your project must have write access to it.

You can use the special job directory option when you start a training job. The training service automatically passes the path you set for the job directory to your trainer as a command-line argument named job_dir. You can parse it along with your application's other arguments and use it in your code. The advantage to the job directory is that the training service validates the directory before starting your trainer.

Specify a job directory by including a value with the "jobDir" key in the TrainingInput object you use to configure your job. In the gcloud tool, you add the --job-dir flag to gcloud ml-engine jobs submit training.

Hyperparameter tuning data

When you use hyperparameter configuration, you must facilitate a closer interaction between your trainer and the training service than anywhere else in the training process. The details are described in the hyperparameter tuning concepts page. Here is a brief summary of the data involved:

  • The hyperparameters to tune must be defined as command-line arguments for your trainer.
  • The target variable that you are trying to optimize is defined in your trainer, and included in the hyperparameter specification object.
  • The results of individual tuning trials are aggregated in the TF_CONFIG environment variable where both the service and your application can access them.

Training logs

You can put logging events in your trainer with standard Python libraries (logging, for example). All messages sent to stderr are automatically captured in your job's entry in Stackdriver Logging.

Data during prediction

Unlike during the training process, Cloud ML Engine has strict requirements about the data you use for prediction. You can find explanations of the data formatting of input instances in the prediction concepts page. Here is a summary of the data involved:

  • Input instances, new data records to make inferences for, must be formatted differently depending on the type of prediction:

    • Online prediction input is serialized in a JSON string that you send in your request message.
    • Batch prediction input data is stored in one or more files in accessible Cloud Storage locations.
  • The predictions themselves also differ by type:

    • Online prediction results are returned in the response message.
    • Batch prediction results are written in files in a Cloud Storage location that you specify.

What's next

Send feedback about...

Cloud ML Engine for TensorFlow