In this tutorial, you create a wide and deep ML prediction model using TensorFlow's high-level Estimator API. You train the model on Cloud ML Engine using the CSV files that you created in Part 1 of this three-part series, Data Analysis and Preparation.
This tutorial uses the components inside the dotted line in the following diagram:
This tutorial uses billable components of Cloud Platform, including:
- Compute Engine
- Persistent Disk
- Cloud Storage
- Cloud ML Engine
The estimated price to run this part of the tutorial, assuming you use every resource for an entire day, is approximately $1.57, based on this pricing calculator.
Before you begin
You must complete Part 1 of this series, Data Analysis and Preparation, before you begin this part.
Walking through the notebook
You can follow the instructions in the accompanying Cloud Datalab notebook that you downloaded in the previous tutorial to understand the end-to-end process for creating an ML model.
This section provides an overview and additional context for the second part of the notebook.
Creating a TensorFlow model using the Estimator API
In Part 1 of the tutorial you explored the original dataset and chose features relevant to a baby's weight. You also converted the dataset into CSV files using Cloud Dataflow, splitting them into the training set and the evaluation set.
Select an appropriate model
TensorFlow offers several levels of abstraction when building models. The low-level API offers considerable flexibility and power, and it's useful if you're an ML researcher developing new ML techniques. Estimating baby weight, however, is a straightforward, well-defined problem, and the high-level Estimator API is a great choice for implementing an ML solution.
The Estimator API:
- Comes with prebuilt estimators, or classes, that you can instantiate to solve the most common ML problems.
- Allows you to train the model on a cluster of machines, rather than limiting you to a single machine. The Estimator API provides a convenient way to implement distributed training.
For the Natality dataset, the notebook uses the
tf.estimator.DNNLinearCombinedRegressor regressor. This regressor allows you
to create a
model that is both wide, with logistic regression that has sparse features, and
deep, using a feed-forward neural network that has an embedding layer and
several hidden layers.
The following illustration outlines the three major model types: wide, deep, and wide and deep:
A wide model is generally useful for training based on categorical features. In the Natality dataset, the following features work well with a wide model:
To apply numeric features to the wide model, you must convert them to categorical features by splitting the continuous value into buckets of value ranges (that is, bucketizing them). You apply this technique to the following features in this case:
Feature crossing generates a new feature by
concatenating the pair of features
(is_male, plurality) and using this
concatenation as an input so that the model learns that, for example, twin
boys tend to have a higher weight than twin girls. You can apply feature
crossing directly to categorical features, and you can apply it to
numeric features when the numeric features are discretized into buckets
A deep model is most appropriate for numeric features. The following features in the baby weight prediction problem match this criterion:
You can implement embedding layers in a deep model, too. An embedding layer transforms a large number of categorical features into a lower-dimensional numeric feature.
The drawback is that feature crossing greatly increases the total number of features, which increases the risk of overfitting if supplied to a deep network. You can mitigate this risk by using an embedding layer.
In the Natality dataset, you take the numeric columns, discretize them into buckets, generate additional categorical features through feature crossing, and then apply an embedding layer. This embedding layer is an additional input to the deep model.
In summary, certain data features apply to a wide model, and other features work better with a deep model:
|Wide||Categorical features, and bucketized numeric features.||
|Deep with embeddings||Categorical features generated through feature crossing||Crossed features in the Wide model|
The wide and deep model was developed to deal with the problem posed by different
kinds of features. It is a combination of the two previous models. You can
specify which features
are used by which model when you configure the
Here is a wide and deep engineering structure:
A function to read the data
The Estimator API requires you to provide a function named
input_fn, which returns a
batch of examples from the training set with each invocation. You can use the
read_dataset function in the notebook to create an
input_fn function by
specifying the filename pattern of your CSV files.
input_fn function uses a filename queue object stored in a
filename_queue as a queueing mechanism. It
identifies files that match the filename pattern and shuffles them before
retrieving examples. After reading a batch of rows from a CSV file,
tf.decode_csv converts column values into a list of TensorFlow constant
DEFAULTS list is used to identify the value types.
For example, if the first element of
[0.0], the values in the
first column are treated as real numbers. The
input_fn function also provides default
complement empty cells in CSV files; these empty cells are called missing
values. Finally, the list is converted to a Python dictionary using
CSV_COLUMNS values as its keys. The
input_fn function returns the dictionary of
features and the corresponding label values. In this case, the label is a baby's
Transformations for values
After reading values from the CSV files you can apply additional transformations
to them. For example, the
is_male field contains the string values
However, it doesn't make sense to use these strings as direct model inputs
because the mathematical model doesn't interpret the meanings of literal
expressions like a human would. A common practice is to use one-hot encoding for
categorical features. In this case,
False are mapped to two binary
[1, 0] and
[0, 1], respectively.
You must also apply the techniques of bucketization, feature crossing, and embedding to the original inputs, as described in the previous section.
TensorFlow provides helper methods called feature columns to automate these
transformations, and these methods can be used in conjunction with
For example, in the following code,
tf.feature_column.categorical_column_with_vocabulary_list applies the one-hot
is_male. You can use the return value as an input to the
tf.feature_column.categorical_column_with_vocabulary_list('is_male', ['True', 'False', 'Unknown'])
In the following code, you can apply
to bucketize a numeric value column such
age_buckets = tf.feature_column.bucketized_column(mother_age, boundaries=np.arange(15,45,1).tolist())
boundaries option specifies the size of buckets. In the preceding example,
it creates buckets for each age between 15 to 44 in addition to "less than 15"
and "over 45".
The following code applies the feature crossing and embedding layer as well to every one of the wide columns.
crossed = tf.feature_column.crossed_column(wide, hash_bucket_size=20000) embed = tf.feature_column.embedding_column(crossed, 3)
The second argument of
tf.feature_column.embedding_column defines the
dimension of the embedding layer, in this case
Define wide and deep features
distinguish between features used for the wide and deep parts. After applying
transformations, you define a list of features named
wide as an input to the
wide part, and a list of features named
deep as an input to the deep part. The
variables in the following code (
plurality, etc.) are the return
values of the feature columns.
wide = [is_male, plurality, age_buckets, gestation_buckets] deep = [mother_age, gestation_weeks, embed]
Define input columns
You must define a
serving_input_fn function for the Estimator API. This
method defines columns to use as input for API requests to a prediction
service. The input columns are generally the same as those in CSV files, but in
some cases, you want to accept input data that is in a different format than
what you used during model training.
By using this method, you can transform the input data into the same form that was used during training. Your preprocessing steps for predictions must be identical to what you used in training, or your model will not make accurate predictions.
Creating and training the model
The next step is to create a model and train it using the dataset.
Define the model
model is a simple process using the Estimator API. The following single line of
code builds a wide and deep model using
deep as inputs for the wide
and deep parts, respectively. It returns an
Estimator object containing the
estimator = tf.estimator.DNNLinearCombinedRegressor( model_dir=output_dir, linear_feature_columns=wide, dnn_feature_columns=deep, dnn_hidden_units=[64, 32])
You specify the storage location for various training outputs using the
model_dir option. The
dnn_hidden_units option defines the structure of
the feed-forward neural
network in the deep part of the model. In this case, two hidden layers are used,
consisting of 64 and 32 nodes.
The number of layers and the number of nodes in each layer are tunable parameters that warrant some experimentation in order to adjust the model complexity to the data complexity. If the model is too complex, it generally suffers from overfitting, where the model learns the characteristics specific to the training set but fails to make good predictions for new data. For a small number of input features (3, in this case), two layers with 64 and 32 nodes is a good empirical starting point.
Run the training job on Cloud Datalab
You execute the training job by calling the
tf.estimator.train_and_evaluate function and specifying the Estimator object.
If this job is running on a distributed training environment such as Cloud ML Engine, this
function assigns training tasks to multiple worker nodes and
periodically saves checkpoints in the storage location specified by the
model_dir option of the
Estimator object. The function also invokes an
evaluation loop at defined intervals to calculate a metric to evaluate
the performance of the model. Finally, the
exports the trained model in the
tf.estimator.train_and_evaluate function can resume training from the
latest checkpoint. For example, if you execute the training specifying
train_steps=10000, the model's parameter values are stored in the checkpoint
file after being trained with 10,000 batches. When you execute the training
train_steps=20000, it restores the model from the checkpoint and
starts training from step count 10,001.
If you want to restart
training from scratch, you must remove the checkpoint files before starting
the training. In the notebook, this is done by removing the
directory in advance.
Because in this part of the notebook the training is executed on a VM instance
hosting Cloud Datalab and not on the distributed training environment, you
small amounts of data. The notebook uses a single CSV file
for training and evaluation by specifying the file pattern as
"00001-of-" and setting the
train_steps value to 1,000.
Training on Cloud ML Engine
When you are confident that your model doesn't have any obvious problems and is ready to be trained with the full dataset, you make a Python package containing the code you developed on the notebook and execute it on Cloud ML Engine.
You use prepackaged code cloned from GitHub and transferred into your Cloud Datalab environment to accomplish this. You can explore the directory structure on GitHub.
It is a good practice to test the code locally on your Cloud Datalab instance
before submitting a training job to Cloud ML Engine. You can do this using the
second cell of the Training on Cloud ML Engine section in the notebook.
Because the prepackaged code is a Python package, you can run it just as you
would standard Python
code from the notebook. However, you must limit the amount of data used
for the training by specifying the options
Run on Cloud ML Engine
To train your model using Cloud ML Engine, you must submit a
training job using the
The checkpoint files and an exported model are stored in the Cloud Storage
bucket specified by the
--output_dir option. You must remove the old
checkpoint files if you want to restart a
training from scratch. To do this, uncomment the line
"#gsutil -m rm -rf
$OUTDIR" in the cell.
After you submit the training job Cloud ML Engine, open the ML Engine page in the Google Cloud Platform Console to find the running job.
Here you find logs from the training job. To see graphs for various metrics,
you can start
by executing the command in the fourth cell of the same section in the notebook.
The following illustration shows the value of
average_loss that corresponds
to RMSE values generated by the
evaluation set during training. You set a sliding bar for smoothing to 0 on
TensorBoard to see actual changes.
If the model training is successful, RMSE values typically decay exponentially. The decay reflects the fact that the accuracy of the model improves very quickly in initial iterations, but requires many more iterations to achieve the best possible model performance.
When you finish using TensorBoard, stop it by following the instructions in the notebook.
Deploying the trained model
When the training job finishes successfully, the trained model is exported to the Cloud Storage bucket.
The directory path containing the model output looks like
$OUTDIR is the storage path
specified by the
--output_dir option in the previous step. The last part is a
timestamp and is different for each job. You can find the actual path from the
training log located near the end of the job, as in this example:
SavedModel written to: gs://cloud-training-demos-ml/babyweight/trained_model/export/exporter/1492051542987/saved_model.pb
Alternatively, you can check the directory contents under
$OUTDIR using the
gsutil command as in the notebook. By specifying this directory path as an
option for the gcloud command, you can deploy the model on Cloud ML Engine
to provide a prediction API service.
Following the instruction in the notebook, you define a wide and deep model for the prediction service and deploy the model associated with a model version. The version name is arbitrary and you can deploy multiple versions simultaneously. You can specify the version to make predictions in API requests, which you will do in the next section. You can also define a default version to be used when the version is not specified in the request.
After you deploy a model to the prediction service, you can use the ML Engine page in the GCP Console to see a list of defined models and associated versions:
Using the model to generate predictions
In the notebook, you can use Google API Client Libraries for Python to send requests to the prediction API service you deployed in the previous section.
The request contains JSON data corresponding to the dictionary elements defined
serving_input_fn function in the training code. If the JSON data contains multiple
records, the API service returns predictions for each of them. You can follow
the example in the notebook to understand more about how to use the client
When you use the client library outside the project, such as on an external web server, you must authenticate using API keys or OAuth 2.0.
If you plan to continue to Part 3 of this series, keep the resources you created in this step. Otherwise, to avoid continued charges, go to the Google Developers Console Project List, choose the project you created for this tutorial, and delete it.
- Continue to Part 3, Deploying a Web Application to deploy a web application running on App Engine to make online predictions.