Optimized TensorFlow runtime

Stay organized with collections Save and categorize content based on your preferences.

Vertex AI Prediction features an optimized TensorFlow runtime that allows serving TensorFlow models at a lower cost and with lower latency than open source based pre-built TensorFlow Serving containers. The optimized TensorFlow runtime lets you take advantage of technologies and model optimization techniques that are used internally at Google.

The optimized TensorFlow runtime is backwards compatible with pre-built TensorFlow Serving containers. If you're running TensorFlow models with a pre-built container, you can switch to an optimized TensorFlow runtime container with minimal effort.

The optimized TensorFlow runtime has the following limitations.

  • The optimized TensorFlow runtime is not compatible with older NVIDIA GPUs such as Tesla K80, Tesla P4 and Tesla P100.
  • The optimized TensorFlow runtime does not support explainability at this time.
  • The optimized TensorFlow runtime does not support batch predictions at this time.

While the performance of your model improves when you use the optimized TensorFlow runtime, you should expect the performance impact to vary for different types of models.

Optimized TensorFlow runtime overview

The optimized TensorFlow runtime uses model optimizations and new proprietary Google technologies to improve the speed and lower the cost of inference compared to open source based pre-built TensorFlow Serving containers. This section describes these improvements.

Model optimizations

The following three model optimizations are currently included in the optimized TensorFlow runtime.

Improved tabular model performance on GPUs

In most cases, TensorFlow tabular models are served on CPUs because they can't utilize accelerators effectively. The optimized TensorFlow runtime addresses this by running computationally expensive parts of the model on GPUs, and the rest of the model on CPUs by minimizing communication between the host and accelerator. This makes serving tabular models faster and cost less.

The optimized TensorFlow runtime optimizes serving the following tabular model types.

Model precompilation

When a TensorFlow model runs, all operations are run individually. There is a small amount of overhead with running individual operations. The optimized TensorFlow runtime can remove some of this overhead by precompiling the TensorFlow graph, or portions of the TensorFlow graph, into larger kernels.

Model precompilation is optional and disabled by default. To learn how to enable model precompilation during a deployment, see Enable model precompilation.

Optimizations that affect precision

The optimized TensorFlow runtime can run some models significantly faster with a small impact on model precision enabling optimizations that affect precision. When this is enabled, the optimized TensorFlow runtime utilizes techniques such as quantization and weight pruning to run models faster.

The optimizations that affect precision feature is disabled by default. To learn how to enable optimizations that affect precision during a deployment, see Enable optimizations that affect precision.

Use of the TensorFlow runtime (TFRT)

Optimized TensorFlow runtime uses the TensorFlow runtime (TFRT) for CPU models. The TFRT efficiently uses multithreaded host CPUs, supports asynchronous programming models, and is optimized for low-level efficiency.

Use of the Google runtime

Because the optimized TensorFlow runtime is built using Google's internal stack, it can take advantage of running on Google's proprietary runtime environment.

Optimized TensorFlow runtime container images

Vertex AI Prediction provides two types of optimized TensorFlow runtime container images: stable and nightly.

Stable container images

Stable optimized TensorFlow runtime containers are bound to a specific TensorFlow version, just like the open source based pre-built TensorFlow Serving containers. Optimized TensorFlow runtime containers bound to a specific version are maintained for the same duration as the open source build that is bound to the same version. The optimized TensorFlow runtime builds have the same properties as open source TensorFlow builds, except with faster inference.

Builds are backwards compatible. This means you should be able to run models trained on older TensorFlow versions using a more recent container. Recent containers should perform better than older ones. In rare exceptions, a model trained on an older TensorFlow version might not work with a more recent container.

Nightly containers images

Nightly optimized TensorFlow runtime builds include the most recent improvements and optimizations, but might not be as reliable as stable builds. Before you serve a model on a nightly image, you must first redeploy the model. Nightly build names include the label nightly.

Available container images

The following nightly and stable optimized TensorFlow runtime Docker container images are available.

ML framework version Use with GPUs? URIs (choose any)
nightly No
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.nightly:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.nightly:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.nightly:latest
nightly Yes
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.nightly:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.nightly:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.nightly:latest
2.9 No
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-9:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-9:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-9:latest
2.9 Yes
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-9:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-9:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-9:latest
2.8 No
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-8:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-8:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-8:latest
2.8 Yes
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-8:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-8:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-8:latest

Use optimized TensorFlow runtime with a private endpoint

Using private endpoints to serve online predictions with Vertex AI provides a low-latency, secure connection to the Vertex AI online prediction service that is faster than using public endpoints. The optimized TensorFlow runtime likely serves latency sensitive models, so you might consider using it with private endpoints. For more information, see Use private endpoints for online prediction.

Deploy a model using the optimized TensorFlow runtime

The process to deploy a model for prediction using the optimized TensorFlow runtime is almost the same as the process to deploy models using open source based pre-built TensorFlow Serving containers. The only differences are that you specify a container image that uses the optimized TensorFlow runtime when you create your model and you can enable the optimization flags described earlier in this topic. For example, if you deployed your model with the us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-7:latest container, you can serve the same model with the optimized TensorFlow runtime by using the using us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-8:latest container.

The following code sample shows you how to create a model with the us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.nightly:latest optimized TensorFlow runtime container. To deploy this model, you use the same process that you use to deploy a model with other pre-built TensorFlow Serving containers.

For more information about the ModelServiceClient used in this sample, see Class ModelServiceClient. For more information about how to deploy models using Vertex AI, see Deploy a model using the Vertex AI API. For more information about the allow_precompilation and allow_precision_affecting_optimizations settings, see Model optimizations described earlier in this topic.

    from google.cloud.aiplatform import gapic as aip
    PROJECT_ID = PROJECT_ID
    REGION = LOCATION
    API_ENDPOINT = f"{REGION}-aiplatform.googleapis.com"
    PARENT = f"projects/{PROJECT_ID}/locations/{REGION}"

    client_options = {"api_endpoint": API_ENDPOINT}
    model_service_client = aip.ModelServiceClient(client_options=client_options)
    tf_opt_model_dict = {
        "display_name": "DISPLAY_NAME",
        "metadata_schema_uri": "",
        "artifact_uri": "MODEL_URI",
        "container_spec": {
            "image_uri": "us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.nightly:latest",
            "args": [
                # The optimized TensorFlow runtime includes the following
                # options that can be set here.
                # "--allow_precompilation=true" - enable precompilation
                # "--allow_precision_affecting_optimizations=true" - enable
                #    optimizations that affect precision
            ],
        },
    }
    tf_opt_model = model_service_client.upload_model(
        parent=PARENT,
        model=tf_opt_model_dict).result(timeout=180).model

Model optimization flags

When you deploy a model using the optimized TensorFlow runtime, you can enable two features that might further optimize serving TensorFlow models.

  1. Model precompilation
  2. Optimization that affects precision

You can enable model precompilation and optimizations that affect precision at the same time. The following topics describe how to enable these options using flags during deployment.

Enable model precompilation

To configure the optimized TensorFlow runtime to precompile models, set the allow_precompilation flag to true. Model precompilation works for different kinds of models, and in most cases improves performance. Precompilation works best for requests with large sizes.

Model precompilation happens when the first request with the new batch size arrives. To ensure that the runtime is initialized before the first request, you can include a warmup requests file. For more information, see SavedModel warmup, on the TensorFlow website.

Precompilation takes between several seconds and several minutes to complete, depending on the model complexity. If you use model precompilation, you should consider the following.

  • If you use a warm up file, try to include requests with batch sizes that represent the batch sizes you expect your model to receive. Note that providing a large number of requests in your warm up file slows down the startup of your model server.

  • If you expect your model to receive requests with different batch sizes, you might want enable server-side batching with a set of fixed values for allow_batch_sizes. For more information about how to enable server-side

    batching, see Enable server-side request batching for Tensorflow on the Tensorflow website.

  • Because precompilation adds memory overhead, some large models might fail with an out of memory error on the GPU.

It's recommended that you test precompilation on your model before enabling this feature in production.

Enable optimizations that affect precision

To configure optimized TensorFlow runtime to use optimizations that affect precision, set its allow_precision_affecting_optimizations flag to true. You should test how enabling this flag affects the precision of your model, then determine if you want to enable it in production.

Pricing

Deploying models using the optimized TensorFlow runtime doesn't include additional charges. It is the same as other prediction deployments where you're charged based on the number of VMs and accelerators that are used. For more information, see Vertex AI pricing.