Optimized TensorFlow runtime

The optimized TensorFlow runtime optimizes models for faster and lower cost inference than open source based prebuilt TensorFlow Serving containers. The optimized TensorFlow runtime does this by utilizing Google's proprietary and open source technologies.

The larger a machine learning (ML) model is, the more it can cost to serve it. With the optimized TensorFlow runtime, the cost of serving your ML model and the speed of inference can be lower compared to when you use an open source-based TensorFlow runtime. To take advantage of the optimized TensorFlow runtime when you use Vertex AI Prediction, you don't need to modify code. Instead, you choose a serving container image that uses it.

The optimized TensorFlow runtime is backward compatible with prebuilt TensorFlow Serving containers. If you're running TensorFlow models with a prebuilt container, you can switch to an optimized TensorFlow runtime container with minimal effort.

While the performance of your model improves when you use the optimized TensorFlow runtime, you should expect the performance impact to vary for different types of models.

Optimized TensorFlow runtime overview

The optimized TensorFlow runtime uses model optimizations and new proprietary Google technologies to improve the speed and lower the cost of inference compared to open source based prebuilt TensorFlow Serving containers.

The optimization occurs when Vertex AI uploads a model, before it runs. After you deploy a model to an endpoint, the optimization log is added to the prediction log. You can use these logs to troubleshoot problems that might occur during optimization.

The following topics describe optimization improvements in the optimized TensorFlow runtime.

Model optimizations

The following three model optimizations are included in the optimized TensorFlow runtime.

Model XLA precompilation

When a TensorFlow model runs, all operations run individually. There is a small amount of overhead with running individual operations. The optimized TensorFlow runtime can remove some of this overhead by leveraging XLA to precompile all or a portion of the TensorFlow graph into larger kernels.

Model XLA precompilation is optional and disabled by default. To learn how to enable model XLA precompilation during a deployment, see Enable model XLA precompilation.

Model compression optimizations

The optimized TensorFlow runtime can run some models faster with a small impact on model precision enabling model compression optimization. When model compression optimization is enabled, the optimized TensorFlow runtime utilizes techniques such as quantization and weight pruning to run models faster.

The model compression optimization feature is disabled by default. To learn how to enable model compression optimization during a deployment, see Enable model compression optimization.

Improved tabular model performance on GPUs

TensorFlow tabular models are usually served on CPUs because they can't utilize accelerators effectively. The optimized TensorFlow runtime addresses this by running computationally expensive parts of the model on GPUs. The rest of the model runs on CPUs by minimizing communication between the host and accelerator. Running the expensive parts of the model on GPUs and the rest on CPUs makes serving tabular models faster and cost less.

The optimized TensorFlow runtime optimizes serving the following tabular model types.

Automatic model optimization for Cloud TPU

The prebuilt optimized TensorFlow runtime containers that support Cloud TPUs can automatically partition and optimize your models to be run on TPUs. For more information, see Deploy to Cloud TPU.

Use of the TensorFlow runtime (TFRT)

The Optimized TensorFlow runtime can use the TensorFlow runtime (TFRT). The TFRT efficiently uses multithreaded host CPUs, supports asynchronous programming models, and is optimized for low-level efficiency.

The TFRT CPU is enabled in all optimized TensorFlow runtime runtime CPU container images except version 2.8. To disable the TFRT CPU, set the use_tfrt flag to false.

The TFRT GPU is available on nightly optimized TensorFlow runtime GPU container images and stable optimized TensorFlow runtime GPU container images versions 2.13 and later. To enable the TFRT GPU, set the use_tfrt and allow_precompilation flags to true. TFRT on a GPU container image minimizes data transfer overhead between the host CPU and the GPU. After you enable TFRT, it works together with XLA compilation. Because XLA precompilation is enabled, you might experience some side effects such as increased latency on the first request. For more information, see Enable model XLA precompilation.

Use of the Google runtime

Because the optimized TensorFlow runtime is built using Google's internal stack, it can take advantage of running on Google's proprietary runtime environment.

Optimized TensorFlow runtime container images

Vertex AI Prediction provides two types of optimized TensorFlow runtime container images: stable and nightly.

Stable container images

Stable optimized TensorFlow runtime containers are bound to a specific TensorFlow version, just like the open source based prebuilt TensorFlow Serving containers. Optimized TensorFlow runtime containers bound to a specific version are maintained for the same duration as the open source build that is bound to the same version. The optimized TensorFlow runtime builds have the same properties as open source TensorFlow builds, except with faster inference.

Builds are backward compatible. This means you should be able to run models trained on older TensorFlow versions using a more recent container. Recent containers should perform better than older ones. In rare exceptions, a model trained on an older TensorFlow version might not work with a more recent container.

Nightly containers images

Nightly optimized TensorFlow runtime builds include the most recent improvements and optimizations, but might not be as reliable as stable builds. They are primarily used for experimental purposes. Nightly build names include the label nightly. Unlike the stable container images, nightly containers are not covered by the Vertex AI Service Level Agreement (SLA).

Available container images

The following nightly and stable optimized TensorFlow runtime Docker container images are available.

ML framework version Supported accelerators (and CUDA version, if applicable) End of patch and support date End of availability Supported images
nightly CPU only Not applicable Not applicable
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.nightly:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.nightly:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.nightly:latest
nightly GPU (CUDA 12.x) Not applicable Not applicable
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.nightly:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.nightly:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.nightly:latest
nightly Cloud TPU Not applicable Not applicable
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.nightly:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.nightly:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.nightly:latest
2.15 CPU only Aug 15, 2024 Aug 15, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-15:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-15:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-15:latest
2.15 GPU (CUDA 12.x) Aug 15, 2024 Aug 15, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-15:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-15:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-15:latest
2.15 Cloud TPU Aug 15, 2024 Aug 15, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-15:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-15:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-15:latest
2.14 CPU only Aug 15, 2024 Aug 15, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-14:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-14:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-14:latest
2.14 GPU (CUDA 12.x) Aug 15, 2024 Aug 15, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-14:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-14:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-14:latest
2.13 CPU only Aug 15, 2024 Aug 15, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-13:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-13:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-13:latest
2.13 GPU (CUDA 11.x) Aug 15, 2024 Aug 15, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-13:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-13:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-13:latest
2.12 CPU only May 15, 2024 May 15, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-12:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-12:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-12:latest
2.12 GPU (CUDA 11.x) May 15, 2024 May 15, 2025
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-12:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-12:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-12:latest
2.11 CPU only Nov 15, 2023 Nov 15, 2024
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-11:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-11:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-11:latest
2.11 GPU (CUDA 11.x) Nov 15, 2023 Nov 15, 2024
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-11:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-11:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-11:latest
2.10 CPU only Nov 15, 2023 Nov 15, 2024
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-10:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-10:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-10:latest
2.10 GPU (CUDA 11.x) Nov 15, 2023 Nov 15, 2024
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-10:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-10:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-10:latest
2.9 CPU only Nov 15, 2023 Nov 15, 2024
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-9:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-9:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-9:latest
2.9 GPU (CUDA 11.x) Nov 15, 2023 Nov 15, 2024
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-9:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-9:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-9:latest
2.8 CPU only Nov 15, 2023 Nov 15, 2024
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-8:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-8:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-8:latest
2.8 GPU (CUDA 11.x) Nov 15, 2023 Nov 15, 2024
  • us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-8:latest
  • europe-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-8:latest
  • asia-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-8:latest

Use the optimized TensorFlow runtime with a private endpoint

Using private endpoints to serve online predictions with Vertex AI provides a low-latency, secure connection to the Vertex AI online prediction service that is faster than using public endpoints. The optimized TensorFlow runtime likely serves latency sensitive models, so you might consider using it with private endpoints. For more information, see Use private endpoints for online prediction.

Deploy a model using the optimized TensorFlow runtime

The process to deploy a model for prediction using the optimized TensorFlow runtime is almost the same as the process to deploy models using open source based prebuilt TensorFlow Serving containers. The only differences are that you specify a container image that uses the optimized TensorFlow runtime when you create your model and you can enable the optimization flags described earlier in this topic. For example, if you deployed your model with the us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-8:latest container, you can serve the same model with the optimized TensorFlow runtime by using the using us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-8:latest container.

The following code sample shows you how to create a model with the us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.nightly:latest optimized TensorFlow runtime container. To deploy this model, you use the same process that you use to deploy a model with other prebuilt TensorFlow Serving containers.

For more information about the ModelServiceClient used in this sample, see Class ModelServiceClient. For more information about how to deploy models using Vertex AI, see Deploy a model using the Vertex AI API. For more information about the allow_precompilation and allow_compression settings, see Model optimizations described earlier in this topic.

    from google.cloud.aiplatform import gapic as aip
    PROJECT_ID = PROJECT_ID
    REGION = LOCATION
    API_ENDPOINT = f"{REGION}-aiplatform.googleapis.com"
    PARENT = f"projects/{PROJECT_ID}/locations/{REGION}"

    client_options = {"api_endpoint": API_ENDPOINT}
    model_service_client = aip.ModelServiceClient(client_options=client_options)
    tf_opt_model_dict = {
        "display_name": "DISPLAY_NAME",
        "metadata_schema_uri": "",
        "artifact_uri": "MODEL_URI",
        "container_spec": {
            "image_uri": "us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.nightly:latest",
            "args": [
                # The optimized TensorFlow runtime includes the following
                # options that can be set here.
                # "--allow_precompilation=true" - enable XLA precompilation
                # "--allow_compression=true" - enable
                #    model compression optimization
            ],
        },
    }
    tf_opt_model = model_service_client.upload_model(
        parent=PARENT,
        model=tf_opt_model_dict).result(timeout=180).model

Model optimization flags

When you deploy a model using the optimized TensorFlow runtime, you can enable two features that might further optimize serving TensorFlow models.

  1. Model XLA precompilation
  2. Optimization that affects model compression

You can enable model XLA precompilation and model compression optimization at the same time. The following topics describe how to enable these options using flags during deployment.

Enable model XLA precompilation

To configure the optimized TensorFlow runtime to precompile models, set the allow_precompilation flag to true. Model XLA precompilation works for different kinds of models, and in most cases improves performance. XLA precompilation works best for requests with large batch sizes.

Model XLA precompilation happens when the first request with the new batch size arrives. To ensure that the runtime is initialized before the first request, you can include a warmup requests file. For more information, see SavedModel warmup, on the TensorFlow website.

XLA precompilation takes between several seconds and several minutes to complete, depending on the model complexity. If you use model XLA precompilation, you should consider the following.

  • If you use a warmup file, try to include requests with batch sizes that represent the batch sizes you expect your model to receive. Providing a large number of requests in your warmup file slows down the startup of your model server.

  • If you expect your model to receive requests with different batch sizes, you might want to enable server-side batching with a set of fixed values for allow_batch_sizes. For more information about how to enable server-side

    batching, see Enable server-side request batching for TensorFlow on the TensorFlow website.

  • Because XLA precompilation adds memory overhead, some large models might fail with an out of memory error on the GPU.

It's recommended that you test XLA precompilation on your model before enabling this feature in production.

Enable model compression optimization

To configure optimized TensorFlow runtime to use model compression optimization, set its allow_compression flag to true. Test how enabling this flag affects the precision of your model, then determine if you want to enable it in production.

Disable optimizations

To configure optimized TensorFlow runtime to run models without optimization, set its disable_optimizer flag to true.

Optimized TensorFlow runtime limits

The optimized TensorFlow runtime has the following limitations:

  • The optimized TensorFlow runtime is not compatible with older NVIDIA GPUs such as Tesla K80, Tesla P4 and Tesla P100.
  • The optimized TensorFlow runtime supports only sampled Shapley explainability at this time.

Pricing

Deploying models using the optimized TensorFlow runtime doesn't include additional charges. Its cost is the same as other prediction deployments where you're charged based on the number of VMs and accelerators that are used. For more information, see Vertex AI pricing.