Deploying Models

You can host your trained models in the cloud with Cloud ML Engine and use them to serve predictions. This page explains how to deploy your model. For general background information on the prediction process, see the prediction overview page.

Whether you train your models in the cloud or train elsewhere, you can deploy TensorFlow models to Cloud ML Engine and use them to serve predictions.

Before you begin

After you have trained your model, you must make important adjustments before deploying it to Cloud ML Engine for predictions.

  • Exporting your model to a SavedModel.

  • Ensure that the file size of your SavedModel is under the Cloud ML Engine default limit of 250 MB by exporting a graph specifically for prediction.

  • Store your SavedModel in Cloud Storage.

Understanding training graphs and serving graphs

When you have trained your model and exported it as a SavedModel, there are some important steps to take before you are ready to get predictions.

There are some key differences between a training graph and a serving graph. Training graphs contain features which are not appropriate for serving, such as:

  • file readers
  • input queues
  • dropout layers
  • loss functions
  • optimizers

Because the process of serving predictions has different needs than the process of training, it is a best practice to export a separate graph specifically for serving predictions.

Understanding the SavedModel

A SavedModel is TensorFlow's recommended format for saving models, and it is the required format for deploying trained TensorFlow models on Cloud ML Engine. Exporting your trained model as a SavedModel saves your training graph with its assets, variables and metadata in a format that Cloud ML Engine can consume and restore for predictions.

After exporting a SavedModel, you have a SavedModel directory that contains the following:

  • your training graph(s), saved in SavedModel protocol buffers
  • external files, called assets
  • variables, which are saved as checkpoint files

When you deploy your SavedModel to Cloud ML Engine, you must include the entire SavedModel directory, not just the SavedModel protocol buffer file that contains your graph and its metadata. This file usually has an extension of either .pb or .pbtxt.

The SavedModel allows you to save multiple versions of a graph that share the same assets and variables (or checkpoints). For example, you may want to develop two versions of the same graph: one to run on CPUs, and another to run on GPUs.

Learn more about the structure of a SavedModel directory.

Check and adjust model size

Your SavedModel must be 250 MB or smaller to deploy with Cloud ML Engine. This includes all the assets and variables in your SavedModel directory, not just the SavedModel protocol buffer file itself (that is, saved_model.pb or saved_model.pbtxt).

To check your model size during development, export a SavedModel and check the file size of the directory.

If your SavedModel exceeds the 250 MB limit:

Reduce your model size

Reducing your model size involves:

  • building a new graph using only what is needed for prediction
  • reducing the precision of weights and data

Following these steps can bring the SavedModel under the 250 MB limit and decrease the latency of predictions. The benefits include better performance and not having to request and wait for a quota increase.

Build an optimal prediction graph

Training produces multiple checkpoints that are not used for serving predictions. Be sure to upload a directory free from those artifacts, containing only the model to be deployed.

For example, if you export summaries during the training process for visualization in TensorBoard, you will want to be sure they are not included in your SavedModel. These TensorBoard summaries are not necessary for a prediction graph.

When using an Estimator for training and prediction, you can use the export_savedmodel function to get a SavedModel. Otherwise, you can use the SavedModelBuilder to export a model. See the MNIST example, which shows how to export a model using the SavedModelBuilder with the correct tags and signatures.

Reduce precision to decrease file size

Reducing the precision of variables and input data is a tradeoff that reduces your model size significantly with some cost of prediction accuracy. High-precision data is stored less efficiently than low-precision data. Although low-precision data is a source of noise, a neural network may "disregard" this noise and still produce fairly accurate predictions.

If using these methods results in too large a loss in prediction accuracy for your use case, try requesting a quota increase.

  • Shrink the file size by reducing the size of weights, which default to floating-point numbers that are difficult to store efficiently. These inefficiently stored weights are the largest contributor to the overall file size of the model.

  • Quantize your continuous data in order to reduce the size of your model by up to 75% without sacrificing a significant amount of accuracy.

  • Use less precise variables. For example, change the data type (dtype) from int64 to int32.

  • Reduce the size of other input features in the assets folder of your SavedModel directory. For example, use smaller vocabulary sizes for text data.

Tools to inspect SavedModels and graphs

TensorFlow provides a command-line interface that you can use to sanity-check aspects of your SavedModel, such as input formatting and SignatureDefs. Learn more about the SavedModel CLI.

The Graph Transform Tool in TensorFlow can be used to optimize your model for deployment. Although the use of this tool is explained in the context of mobile deployment, it also can be used to optimize models for non-mobile deployment.

Learn more about using the Graph Transform tool.

Creating serving input functions

When you export a serving graph, one key step is to define a serving input function. You can do this at the following points in relation to the overall training process:

  • During the end of the training process.
  • As a separate process after training is completed.

See more information about serving input functions.

Create serving graph during training

This typically occurs at the end of the training process, but is still tied in with training.

  1. Define a serving input function. The following example code comes from our Census sample. In this case, the serving input function is defined in the same Python file as the training model.

    def json_serving_input_fn():
      """Build the serving inputs."""
      inputs = {}
      for feat in INPUT_COLUMNS:
        inputs[feat.name] = tf.placeholder(shape=[None], dtype=feat.dtype)
    
      return tf.estimator.export.ServingInputReceiver(inputs, inputs)

  2. Export a SavedModel from your estimator using export_savedmodel, passing in the path to your model as the export_dir_base parameter, and the name of your serving input function as the serving_input_fn parameter. In the Census example, the type of Estimator used is tf.contrib.learn.DNNLinearCombinedClassifier.

Create serving graph separately from training

If you have already trained your model, you can get predictions without retraining. This process is very similar to creating a serving graph during training. The main difference is that you create the serving graph in a separate Python script that you run after training is over. The basic idea is to construct the Estimator with the same model_dir used in training, then to call export_savedmodel as described above.

  1. Define a serving input function in your Python script, similarly to how you define it in training:

    def json_serving_input_fn():
      """Build the serving inputs."""
      inputs = {}
      for feat in INPUT_COLUMNS:
        inputs[feat.name] = tf.placeholder(shape=[None], dtype=feat.dtype)
    
      return tf.estimator.export.ServingInputReceiver(inputs, inputs)

  2. When creating your Estimator, make sure to set the model_dir parameter to be the same one used in training. This makes checkpoints from your previously saved model available to the Estimator.

  3. Finally, use your Estimator to call export_savedmodel, passing in the path to your model as the export_dir_base parameter, and the name of your serving input function as the serving_input_fn parameter.

TensorFlow tags and signatures

When building a TensorFlow prediction graph, you must specify the correct values for your graph's tags and signatures. TensorFlow provides constants for these tag and signature values, used for the following purposes:

  • To select a graph in your SavedModel for serving predictions
  • To indicate that you are building a prediction signature for your prediction graph

Signatures define the inputs and outputs for your graph. When you build a signature for your prediction graph, you must specify a valid signature constant as the method_name parameter in build_signature_def. For prediction, the best choice is usually PREDICT_METHOD_NAME.

You must use a tag to specify which graph in your SavedModel is used to serve predictions. In add_meta_graph_and_variables, add tag_constants.SERVING to your tags list.

See an example of how to build a prediction graph using the correct constants for tags and signatures.

Creating a model version

Cloud ML Engine organizes your trained models using model and version resources. A Cloud ML Engine model is a container for the versions of your machine learning model.

You can find detailed information about the parameters that you need to deploy your model to Cloud ML Engine on the prediction concepts page.

In order to deploy your trained model on Cloud ML Engine, you must:

  • Upload your SavedModel directory to a Cloud Storage bucket before you start.
  • Create a Cloud ML Engine model resource.
  • Create a Cloud ML Engine version resource, specifying the Cloud Storage path to your SavedModel.
  • Ensure that your Cloud ML Engine service account has "list" access for the Cloud Storage bucket that contains your SavedModel, and "read" access for the SavedModel within the Cloud Storage bucket. Without the appropriate permissions, your request to create a version will fail. See more about granting permissions for storage.

console

  1. Open the Cloud ML Engine page in the Google Cloud Platform Console

    Open models in the GCP Console

  2. If needed, create the model to add your new version to:

    1. Click Create Model.

    2. Enter a name for your model in the Model name box.

    3. Click Create.

    4. Verify that you have returned to the Models page, and that your new model appears in the list.

  3. Select your model from the list.

  4. Click Create a version under Versions on the Model details page.

  5. Fill in the form on the Create version page:

    1. Enter your model name in the Name box.

    2. Enter the Cloud Storage path to your SavedModel in the Source box.

    3. Click Create.

    4. Verify that you have returned to the Model details page, and that your new version appears in the Versions list.

gcloud

  1. If needed, create the model that you are deploying a new version of:

    gcloud ml-engine models create "model_name"
    
  2. Optionally set an environment variable to store your Cloud Storage path, which might be cumbersome to type in the next command:

    DEPLOYMENT_SOURCE="bucket_path"
    
  3. Create the version:

    gcloud ml-engine versions create "version_name"\
        --model "model_name" --origin $DEPLOYMENT_SOURCE
    
  4. Get information about your new version:

    gcloud ml-engine versions describe "your_version_name" \
        --model "your_model_name"
    

    You should see output similar to this:

    createTime: '2016-09-29T16:30:45Z'
    deploymentUri: gs://your_bucket_path
    isDefault: true
    name: projects/project_name/models/model_name/versions/version_name
    

Python

  1. Import the packages required to get an auth token and to use the Cloud ML Engine APIs from the Google API Client:

    from oauth2client.client import GoogleCredentials
    from googleapiclient import discovery
    from googleapiclient import errors
    # Time is for waiting until the request finishes.
    import time
    
  2. Set variables for project and model, using the required format for the APIs (projects/project/models/model/versions/version). Also make a variable for the storage location where you put your SavedModel:

    projectID = 'projects/{}'.format('project_name')
    modelName = 'model_name'
    modelID = '{}/models/{}'.format(projectID, modelName)
    versionName = 'version_name'
    versionDescription = 'version_description'
    trainedModelLocation = 'gs://bucket_path'
    
  3. Get your application default credentials and build the Python representation of the Cloud ML Engine API:

    credentials = GoogleCredentials.get_application_default()
    ml = discovery.build('ml', 'v1', credentials=credentials)
    
  4. If needed, create the model to which this version belongs:

    # Create a dictionary with the fields from the request body.
    requestDict = {'name': modelName,
        'description': 'Another model for testing.'}
    
    # Create a request to call projects.models.create.
    request = ml.projects().models().create(parent=projectID,
                                body=requestDict)
    
    # Make the call.
    try:
        response = request.execute()
    
        # Any additional code on success goes here (logging, etc.)
    
    except errors.HttpError as err:
        # Something went wrong, print out some information.
        print('There was an error creating the model.' +
            ' Check the details:')
        print(err._get_reason())
    
        # Clear the response for next time.
        response = None
    
  5. Create a dictionary with entries for the version creation request:

    requestDict = {'name': versionName,
        'description': versionDescription,
        'deploymentUri': trainedModelLocation}
    
  6. Create the request and make the service call to create the version:

    # Create a request to call projects.models.versions.create
    request = ml.projects().models().versions().create(parent=modelID,
                  body=requestDict)
    
    # Make the call.
    try:
        response = request.execute()
    
        # Get the operation name.
        operationID = response['name']
    
        # Any additional code on success goes here (logging, etc.)
    
    except errors.HttpError as err:
        # Something went wrong, print out some information.
        print('There was an error creating the version.' +
              ' Check the details:')
        print(err._get_reason())
    
        # Handle the exception as makes sense for your application.
    
  7. Monitor the status of the create operation:

    done = False
    request = ml.projects().operations().get(name=operationID)
    
    while not done:
        response = None
    
        # Wait for 300 milliseconds.
        time.sleep(0.3)
    
    # Make the next call.
    try:
        response = request.execute()
    
        # Check for finish.
        done = response.get('done', False)
    
    except errors.HttpError as err:
        # Something went wrong, print out some information.
        print('There was an error getting the operation.' +
              'Check the details:')
        print(err._get_reason())
        done = True
    

    This example waits for version creation to finish, but you may not want to block processing in your application.

What's next

Send feedback about...

Cloud Machine Learning Engine (Cloud ML Engine)