Machine Learning with Structured Data: Training the Model (Part 2)

In this tutorial, you create a wide and deep ML prediction model using TensorFlow's high-level Estimator API. You train the model on Cloud ML Engine using the CSV files that you created in Part 1 of this three-part series, Data Analysis and Preparation.

Architecture

This tutorial uses the components inside the dotted line in the following diagram:

Series architecture with components used in this tutorial highlighted.

Costs

This tutorial uses billable components of Cloud Platform, including:

  • Compute Engine
  • Persistent Disk
  • Cloud Storage
  • Cloud ML Engine

The estimated price to run this part of the tutorial, assuming you use every resource for an entire day, is approximately $1.57, based on this pricing calculator.

Before you begin

You must complete Part 1 of this series, Data Analysis and Preparation, before you begin this part.

Walking through the notebook

You can follow the instructions in the accompanying Cloud Datalab notebook that you downloaded in the previous tutorial to understand the end-to-end process for creating an ML model.

This section provides an overview and additional context for the second part of the notebook.

Creating a TensorFlow model using the Estimator API

In Part 1 of the tutorial you explored the original dataset and chose features relevant to a baby's weight. You also converted the dataset into CSV files using Cloud Dataflow, splitting them into the training set and the evaluation set.

Select an appropriate model

TensorFlow offers several levels of abstraction when building models. The low-level API offers considerable flexibility and power, and it's useful if you're an ML researcher developing new ML techniques. Estimating baby weight, however, is a straightforward, well-defined problem, and the high-level Estimator API is a great choice for implementing an ML solution.

The Estimator API:

  • Comes with prebuilt estimators, or classes, that you can instantiate to solve the most common ML problems.
  • Allows you to train the model on a cluster of machines, rather than limiting you to a single machine. The Estimator API provides a convenient way to implement distributed training.

For the Natality dataset, the notebook uses the tf.estimator.DNNLinearCombinedRegressor regressor. This regressor allows you to create a model that is both wide, with logistic regression that has sparse features, and deep, using a feed-forward neural network that has an embedding layer and several hidden layers.

The following illustration outlines the three major model types: wide, deep, and wide and deep:

Three model types: wide, deep, and wide and deep

A wide model is generally useful for training based on categorical features. In the Natality dataset, the following features work well with a wide model:

  • is_male, plurality

To apply numeric features to the wide model, you must convert them to categorical features by splitting the continuous value into buckets of value ranges (that is, bucketizing them). You apply this technique to the following features in this case:

  • mother_age, gestation_weeks

Feature crossing generates a new feature by concatenating the pair of features (is_male, plurality) and using this concatenation as an input so that the model learns that, for example, twin boys tend to have a higher weight than twin girls. You can apply feature crossing directly to categorical features, and you can apply it to numeric features when the numeric features are discretized into buckets first.

A deep model is most appropriate for numeric features. The following features in the baby weight prediction problem match this criterion:

  • mother_age, gestation_weeks

You can implement embedding layers in a deep model, too. An embedding layer transforms a large number of categorical features into a lower-dimensional numeric feature.

The drawback is that feature crossing greatly increases the total number of features, which increases the risk of overfitting if supplied to a deep network. You can mitigate this risk by using an embedding layer.

In the Natality dataset, you take the numeric columns, discretize them into buckets, generate additional categorical features through feature crossing, and then apply an embedding layer. This embedding layer is an additional input to the deep model.

In summary, certain data features apply to a wide model, and other features work better with a deep model:

Model Criteria Features
Wide Categorical features, and bucketized numeric features. is_male, plurality, mother_age (bucketized), gestation_weeks (bucketized)
Deep Numeric features. mother_age, gestation_weeks
Deep with embeddings Categorical features generated through feature crossing Crossed features in the Wide model

The wide and deep model was developed to deal with the problem posed by different kinds of features. It is a combination of the two previous models. You can specify which features are used by which model when you configure the Estimator object.

Here is a wide and deep engineering structure:

A wide and deep feature engineering structure

A function to read the data

The Estimator API requires you to provide a function named input_fn, which returns a batch of examples from the training set with each invocation. You can use the read_dataset function in the notebook to create an input_fn function by specifying the filename pattern of your CSV files.

Internally, the input_fn function uses a filename queue object stored in a variable filename_queue as a queueing mechanism. It identifies files that match the filename pattern and shuffles them before retrieving examples. After reading a batch of rows from a CSV file, tf.decode_csv converts column values into a list of TensorFlow constant objects. The DEFAULTS list is used to identify the value types.

For example, if the first element of DEFAULTS is [0.0], the values in the first column are treated as real numbers. The input_fn function also provides default values to complement empty cells in CSV files; these empty cells are called missing values. Finally, the list is converted to a Python dictionary using CSV_COLUMNS values as its keys. The input_fn function returns the dictionary of features and the corresponding label values. In this case, the label is a baby's weight.

Transformations for values

After reading values from the CSV files you can apply additional transformations to them. For example, the is_male field contains the string values True and False. However, it doesn't make sense to use these strings as direct model inputs because the mathematical model doesn't interpret the meanings of literal expressions like a human would. A common practice is to use one-hot encoding for categorical features. In this case, True and False are mapped to two binary columns, [1, 0] and [0, 1], respectively.

You must also apply the techniques of bucketization, feature crossing, and embedding to the original inputs, as described in the previous section.

TensorFlow provides helper methods called feature columns to automate these transformations, and these methods can be used in conjunction with Estimator objects. For example, in the following code, tf.feature_column.categorical_column_with_vocabulary_list applies the one-hot encoding to is_male. You can use the return value as an input to the Estimator object.

tf.feature_column.categorical_column_with_vocabulary_list('is_male',
    ['True', 'False', 'Unknown'])

In the following code, you can apply tf.feature_column.bucketized_column to bucketize a numeric value column such as mother_age.

age_buckets = tf.feature_column.bucketized_column(mother_age,
    boundaries=np.arange(15,45,1).tolist())

The boundaries option specifies the size of buckets. In the preceding example, it creates buckets for each age between 15 to 44 in addition to "less than 15" and "over 45".

The following code applies the feature crossing and embedding layer as well to every one of the wide columns.

crossed = tf.feature_column.crossed_column(wide, hash_bucket_size=20000)
embed = tf.feature_column.embedding_column(crossed, 3)

The second argument of tf.feature_column.embedding_column defines the dimension of the embedding layer, in this case 3.

Define wide and deep features

You must distinguish between features used for the wide and deep parts. After applying transformations, you define a list of features named wide as an input to the wide part, and a list of features named deep as an input to the deep part. The variables in the following code (is_male, plurality, etc.) are the return values of the feature columns.

wide = [is_male,
        plurality,
        age_buckets,
        gestation_buckets]

deep = [mother_age,
        gestation_weeks,
        embed]

Define input columns

You must define a serving_input_fn function for the Estimator API. This method defines columns to use as input for API requests to a prediction service. The input columns are generally the same as those in CSV files, but in some cases, you want to accept input data that is in a different format than what you used during model training.

By using this method, you can transform the input data into the same form that was used during training. Your preprocessing steps for predictions must be identical to what you used in training, or your model will not make accurate predictions.

Creating and training the model

The next step is to create a model and train it using the dataset.

Define the model

Defining a model is a simple process using the Estimator API. The following single line of code builds a wide and deep model using wide and deep as inputs for the wide and deep parts, respectively. It returns an Estimator object containing the defined model.

estimator = tf.estimator.DNNLinearCombinedRegressor(
                         model_dir=output_dir,
                         linear_feature_columns=wide,
                         dnn_feature_columns=deep,
                         dnn_hidden_units=[64, 32])

You specify the storage location for various training outputs using the model_dir option. The dnn_hidden_units option defines the structure of the feed-forward neural network in the deep part of the model. In this case, two hidden layers are used, consisting of 64 and 32 nodes.

The number of layers and the number of nodes in each layer are tunable parameters that warrant some experimentation in order to adjust the model complexity to the data complexity. If the model is too complex, it generally suffers from overfitting, where the model learns the characteristics specific to the training set but fails to make good predictions for new data. For a small number of input features (3, in this case), two layers with 64 and 32 nodes is a good empirical starting point.

Run the training job on Cloud Datalab

You execute the training job by calling the tf.estimator.train_and_evaluate function and specifying the Estimator object. If this job is running on a distributed training environment such as Cloud ML Engine, this function assigns training tasks to multiple worker nodes and periodically saves checkpoints in the storage location specified by the model_dir option of the Estimator object. The function also invokes an evaluation loop at defined intervals to calculate a metric to evaluate the performance of the model. Finally, the tf.estimator.train_and_evaluate function exports the trained model in the SavedModel format.

The tf.estimator.train_and_evaluate function can resume training from the latest checkpoint. For example, if you execute the training specifying train_steps=10000, the model's parameter values are stored in the checkpoint file after being trained with 10,000 batches. When you execute the training again with train_steps=20000, it restores the model from the checkpoint and starts training from step count 10,001.

If you want to restart training from scratch, you must remove the checkpoint files before starting the training. In the notebook, this is done by removing the babyweight_trained directory in advance.

Because in this part of the notebook the training is executed on a VM instance hosting Cloud Datalab and not on the distributed training environment, you should use small amounts of data. The notebook uses a single CSV file for training and evaluation by specifying the file pattern as pattern = "00001-of-" and setting the train_steps value to 1,000.

Training on Cloud ML Engine

When you are confident that your model doesn't have any obvious problems and is ready to be trained with the full dataset, you make a Python package containing the code you developed on the notebook and execute it on Cloud ML Engine.

You use prepackaged code cloned from GitHub and transferred into your Cloud Datalab environment to accomplish this. You can explore the directory structure on GitHub.

Run locally

It is a good practice to test the code locally on your Cloud Datalab instance before submitting a training job to Cloud ML Engine. You can do this using the second cell of the Training on Cloud ML Engine section in the notebook. Because the prepackaged code is a Python package, you can run it just as you would standard Python code from the notebook. However, you must limit the amount of data used for the training by specifying the options --pattern="00001-of-" and --train_steps=1000.

Run on Cloud ML Engine

To train your model using Cloud ML Engine, you must submit a training job using the gcloud tool.

The checkpoint files and an exported model are stored in the Cloud Storage bucket specified by the --output_dir option. You must remove the old checkpoint files if you want to restart a training from scratch. To do this, uncomment the line "#gsutil -m rm -rf $OUTDIR" in the cell.

After you submit the training job Cloud ML Engine, open the ML Engine page in the Google Cloud Platform Console to find the running job.

OPEN ML ENGINE JOBS

Here you find logs from the training job. To see graphs for various metrics, you can start TensorBoard by executing the command in the fourth cell of the same section in the notebook. The following illustration shows the value of average_loss that corresponds to RMSE values generated by the evaluation set during training. You set a sliding bar for smoothing to 0 on TensorBoard to see actual changes.

Chart of average loss.

If the model training is successful, RMSE values typically decay exponentially. The decay reflects the fact that the accuracy of the model improves very quickly in initial iterations, but requires many more iterations to achieve the best possible model performance.

When you finish using TensorBoard, stop it by following the instructions in the notebook.

Deploying the trained model

When the training job finishes successfully, the trained model is exported to the Cloud Storage bucket.

The directory path containing the model output looks like $OUTDIR/export/exporter/1492051542987/, where $OUTDIR is the storage path specified by the --output_dir option in the previous step. The last part is a timestamp and is different for each job. You can find the actual path from the training log located near the end of the job, as in this example:

SavedModel written to: gs://cloud-training-demos-ml/babyweight/trained_model/export/exporter/1492051542987/saved_model.pb

Alternatively, you can check the directory contents under $OUTDIR using the gsutil command as in the notebook. By specifying this directory path as an option for the gcloud command, you can deploy the model on Cloud ML Engine to provide a prediction API service.

Following the instruction in the notebook, you define a wide and deep model for the prediction service and deploy the model associated with a model version. The version name is arbitrary and you can deploy multiple versions simultaneously. You can specify the version to make predictions in API requests, which you will do in the next section. You can also define a default version to be used when the version is not specified in the request.

After you deploy a model to the prediction service, you can use the ML Engine page in the GCP Console to see a list of defined models and associated versions:

OPEN ML ENGINE MODELS

Using the model to generate predictions

In the notebook, you can use Google API Client Libraries for Python to send requests to the prediction API service you deployed in the previous section.

The request contains JSON data corresponding to the dictionary elements defined by the serving_input_fn function in the training code. If the JSON data contains multiple records, the API service returns predictions for each of them. You can follow the example in the notebook to understand more about how to use the client library.

When you use the client library outside the project, such as on an external web server, you must authenticate using API keys or OAuth 2.0.

Cleaning up

If you plan to continue to Part 3 of this series, keep the resources you created in this step. Otherwise, to avoid continued charges, go to the Google Developers Console Project List, choose the project you created for this tutorial, and delete it.

Next steps

Was this page helpful? Let us know how we did:

Send feedback about...