Vertex AI Prediction features an optimized TensorFlow runtime that allows serving TensorFlow models at a lower cost and with lower latency than open source based pre-built TensorFlow Serving containers. The optimized TensorFlow runtime lets you take advantage of technologies and model optimization techniques that are used internally at Google.
The optimized TensorFlow runtime is backwards compatible with pre-built TensorFlow Serving containers. If you're running TensorFlow models with a pre-built container, you can switch to an optimized TensorFlow runtime container with minimal effort.
The optimized TensorFlow runtime has the following limitations.
- The optimized TensorFlow runtime is not compatible with older NVIDIA GPUs such as Tesla K80, Tesla P4 and Tesla P100.
- The optimized TensorFlow runtime does not support explainability at this time.
- The optimized TensorFlow runtime does not support batch predictions at this time.
While the performance of your model improves when you use the optimized TensorFlow runtime, you should expect the performance impact to vary for different types of models.
Optimized TensorFlow runtime overview
The optimized TensorFlow runtime uses model optimizations and new proprietary Google technologies to improve the speed and lower the cost of inference compared to open source based pre-built TensorFlow Serving containers. This section describes these improvements.
Model optimizations
The following three model optimizations are currently included in the optimized TensorFlow runtime.
Improved tabular model performance on GPUs
In most cases, TensorFlow tabular models are served on CPUs because they can't utilize accelerators effectively. The optimized TensorFlow runtime addresses this by running computationally expensive parts of the model on GPUs, and the rest of the model on CPUs by minimizing communication between the host and accelerator. This makes serving tabular models faster and cost less.
The optimized TensorFlow runtime optimizes serving the following tabular model types.
Model precompilation
When a TensorFlow model runs, all operations are run individually. There is a small amount of overhead with running individual operations. The optimized TensorFlow runtime can remove some of this overhead by precompiling the TensorFlow graph, or portions of the TensorFlow graph, into larger kernels.
Model precompilation is optional and disabled by default. To learn how to enable model precompilation during a deployment, see Enable model precompilation.
Optimizations that affect precision
The optimized TensorFlow runtime can run some models significantly faster with a small impact on model precision enabling optimizations that affect precision. When this is enabled, the optimized TensorFlow runtime utilizes techniques such as quantization and weight pruning to run models faster.
The optimizations that affect precision feature is disabled by default. To learn how to enable optimizations that affect precision during a deployment, see Enable optimizations that affect precision.
Use of the TensorFlow runtime (TFRT)
Optimized TensorFlow runtime uses the TensorFlow runtime (TFRT) for CPU models. The TFRT efficiently uses multithreaded host CPUs, supports asynchronous programming models, and is optimized for low-level efficiency.
Use of the Google runtime
Because the optimized TensorFlow runtime is built using Google's internal stack, it can take advantage of running on Google's proprietary runtime environment.
Optimized TensorFlow runtime container images
Vertex AI Prediction provides two types of optimized TensorFlow runtime container images: stable and nightly.
Stable container images
Stable optimized TensorFlow runtime containers are bound to a specific TensorFlow version, just like the open source based pre-built TensorFlow Serving containers. Optimized TensorFlow runtime containers bound to a specific version are maintained for the same duration as the open source build that is bound to the same version. The optimized TensorFlow runtime builds have the same properties as open source TensorFlow builds, except with faster inference.
Builds are backwards compatible. This means you should be able to run models trained on older TensorFlow versions using a more recent container. Recent containers should perform better than older ones. In rare exceptions, a model trained on an older TensorFlow version might not work with a more recent container.
Nightly containers images
Nightly optimized TensorFlow runtime builds include the most recent improvements and optimizations,
but might not be as reliable as stable builds. Before you serve a model on a
nightly image, you must first redeploy the model. Nightly build names include the
label nightly
.
Available container images
The following nightly and stable optimized TensorFlow runtime Docker container images are available.
ML framework version | Use with GPUs? | URIs (choose any) |
---|---|---|
nightly | No |
|
nightly | Yes |
|
2.11 | No |
|
2.11 | Yes |
|
2.10 | No |
|
2.10 | Yes |
|
2.9 | No |
|
2.9 | Yes |
|
2.8 | No |
|
2.8 | Yes |
|
Use optimized TensorFlow runtime with a private endpoint
Using private endpoints to serve online predictions with Vertex AI provides a low-latency, secure connection to the Vertex AI online prediction service that is faster than using public endpoints. The optimized TensorFlow runtime likely serves latency sensitive models, so you might consider using it with private endpoints. For more information, see Use private endpoints for online prediction.
Deploy a model using the optimized TensorFlow runtime
The process to deploy a model for prediction using the optimized TensorFlow runtime is almost the
same as the process to deploy models using open source based
pre-built TensorFlow Serving containers. The only
differences are that you specify a container image that uses the optimized TensorFlow runtime when
you create your model and you can enable the optimization flags described
earlier in this topic. For example, if you deployed your model with the
us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-7:latest
container, you can
serve the same model with the optimized TensorFlow runtime by using the using
us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.2-8:latest
container.
The following code sample shows you how to create a model with the
us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.nightly:latest
optimized TensorFlow runtime container. To deploy this model, you use the same process that you use
to deploy a model with other
pre-built TensorFlow Serving containers.
For more information about the ModelServiceClient
used in this sample, see
Class ModelServiceClient.
For more information about how to deploy models using Vertex AI, see
Deploy a model using the Vertex AI API.
For more information about the allow_precompilation
and
allow_precision_affecting_optimizations
settings, see
Model optimizations
described earlier in this topic.
from google.cloud.aiplatform import gapic as aip PROJECT_ID = PROJECT_ID REGION = LOCATION API_ENDPOINT = f"{REGION}-aiplatform.googleapis.com" PARENT = f"projects/{PROJECT_ID}/locations/{REGION}" client_options = {"api_endpoint": API_ENDPOINT} model_service_client = aip.ModelServiceClient(client_options=client_options) tf_opt_model_dict = { "display_name": "DISPLAY_NAME", "metadata_schema_uri": "", "artifact_uri": "MODEL_URI", "container_spec": { "image_uri": "us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-cpu.nightly:latest", "args": [ # The optimized TensorFlow runtime includes the following # options that can be set here. # "--allow_precompilation=true" - enable precompilation # "--allow_precision_affecting_optimizations=true" - enable # optimizations that affect precision ], }, } tf_opt_model = model_service_client.upload_model( parent=PARENT, model=tf_opt_model_dict).result(timeout=180).model
Model optimization flags
When you deploy a model using the optimized TensorFlow runtime, you can enable two features that might further optimize serving TensorFlow models.
You can enable model precompilation and optimizations that affect precision at the same time. The following topics describe how to enable these options using flags during deployment.
Enable model precompilation
To configure the optimized TensorFlow runtime to precompile models, set the allow_precompilation
flag to true
. Model precompilation works for different kinds of models, and in
most cases improves performance. Precompilation works best for requests with
large sizes.
Model precompilation happens when the first request with the new batch size arrives. To ensure that the runtime is initialized before the first request, you can include a warmup requests file. For more information, see SavedModel warmup, on the TensorFlow website.
Precompilation takes between several seconds and several minutes to complete, depending on the model complexity. If you use model precompilation, you should consider the following.
If you use a warm up file, try to include requests with batch sizes that represent the batch sizes you expect your model to receive. Note that providing a large number of requests in your warm up file slows down the startup of your model server.
If you expect your model to receive requests with different batch sizes, you might want enable server-side batching with a set of fixed values for
allow_batch_sizes
. For more information about how to enable server-sidebatching, see Enable server-side request batching for Tensorflow on the Tensorflow website.
Because precompilation adds memory overhead, some large models might fail with an out of memory error on the GPU.
It's recommended that you test precompilation on your model before enabling this feature in production.
Enable optimizations that affect precision
To configure optimized TensorFlow runtime to use optimizations that affect precision, set its
allow_precision_affecting_optimizations
flag to true
. You should test how
enabling this flag affects the precision of your model, then determine if you want
to enable it in production.
Pricing
Deploying models using the optimized TensorFlow runtime doesn't include additional charges. It is the same as other prediction deployments where you're charged based on the number of VMs and accelerators that are used. For more information, see Vertex AI pricing.