Google Cloud provides access to custom-designed machine learning accelerators called Tensor Processing Units (TPUs). TPUs are optimized to accelerate the training and inference of machine learning models, making them ideal for a variety of applications, including natural language processing, computer vision, and speech recognition.
This page describes how to deploy your models to a single host Cloud TPU v5e for online prediction in Vertex AI.
Only Cloud TPU version v5e is supported. Other Cloud TPU generations are not supported.
Import your model
For deployment on Cloud TPUs, you must import your model to Vertex AI and configure it to use one of the following containers:
- prebuilt optimized TensorFlow runtime container either the
nightly
version, or version2.15
or later - prebuilt PyTorch TPU container version
2.1
or later - your own custom container that supports TPUs
Prebuilt optimized TensorFlow runtime container
To import and run a
TensorFlow SavedModel
on a Cloud TPU, the model must be TPU-optimized. If your TensorFlow
SavedModel
is not already TPU
optimized, there are three ways to optimize your model:
Manual model optimization - You use Inference Converter to optimize your model and save it. Then, you must pass
--saved_model_tags='serve,tpu'
and--disable_optimizer=true
flags when youupload
your model. For example:model = aiplatform.Model.upload( display_name='Manually optimized model', artifact_uri="gs://model-artifact-uri", serving_container_image_uri="us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-15:latest", serving_container_args=[ "--saved_model_tags=serve,tpu", "--disable_optimizer=true", ] )
Automatic model optimization with automatic partitioning - When you import a model, Vertex AI will attempt to optimize your unoptimized model using an automatic partitioning algorithm. This optimization does not work on all models. If optimization fails, you must either manually optimize your model or choose automatic model optimization with manual partitioning. For example:
model = aiplatform.Model.upload( display_name='TPU optimized model with automatic partitioning', artifact_uri="gs://model-artifact-uri", serving_container_image_uri="us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-15:latest", serving_container_args=[ ] )
Automatic model optimization with manual partitioning. Specify the
--converter_options_string
flag and adjust theConverterOptions.TpuFunction
to fit your needs. For an example, see Converter Image. Note that onlyConverterOptions.TpuFunction
, which is all that is needed for manual partitioning, is supported. For example:model = aiplatform.Model.upload( display_name='TPU optimized model with manual partitioning', artifact_uri="gs://model-artifact-uri", serving_container_image_uri="us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-15:latest", serving_container_args=[ "--converter_options_string='tpu_functions { function_alias: \"partitioning function name\" }'" ] )
For more information on importing models, see importing models to Vertex AI.
Prebuilt PyTorch container
The instructions to import and run a PyTorch model on Cloud TPU are the same as the instructions to import and run a PyTorch model.
For example, TorchServe for Cloud TPU v5e Inference demonstrates how to package the Densenet 161 model into model artifacts using Torch Model Archiver.
Then, upload the model artifacts to your Cloud Storage folder and upload your model as shown:
model = aiplatform.Model.upload(
display_name='DenseNet TPU model from SDK PyTorch 2.1',
artifact_uri="gs://model-artifact-uri",
serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/pytorch-tpu.2-1:latest",
serving_container_args=[],
serving_container_predict_route="/predictions/model",
serving_container_health_route="/ping",
serving_container_ports=[8080]
)
For more information, see export model artifacts for PyTorch and the Jupyter Notebook for Serve a PyTorch model using a prebuilt container.
Custom container
For custom containers, your model does not need to be a TensorFlow model, but it must be TPU optimized. For information on producing a TPU optimized model, see the following guides for common ML frameworks:
For information on serving models trained with JAX, TensorFlow, or PyTorch on Cloud TPU v5e, see Cloud TPU v5e Inference.
Make sure your custom container meets the custom container requirements.
You must raise the locked memory limit so the driver can communicate with the TPU chips over direct memory access (DMA). For example:
Command line
ulimit -l 68719476736
Python
import resource
resource.setrlimit(
resource.RLIMIT_MEMLOCK,
(
68_719_476_736_000, # soft limit
68_719_476_736_000, # hard limit
),
)
Then, see Use a custom container for prediction for information on importing a model with a custom container. If you have want to implement pre or post processing logic, consider using Custom prediction routines.
Create an endpoint
The instructions for creating an endpoint for Cloud TPUs are the same as the instructions for creating any endpoint.
For example, the following command creates an endpoint
resource:
endpoint = aiplatform.Endpoint.create(display_name='My endpoint')
The response contains the new endpoint's ID, which you use in subsequent steps.
For more information on creating an endpoint, see deploy a model to an endpoint.
Deploy a model
The instructions for deploying a model to Cloud TPUs are the same as the instructions for deploying any model, except you specify one of the following supported Cloud TPU machine types:
Machine Type | Number of TPU chips |
---|---|
ct5lp-hightpu-1t |
1 |
ct5lp-hightpu-4t |
4 |
ct5lp-hightpu-8t |
8 |
TPU accelerators are built-in to the machine type. You don't have to specify accelerator type or accelerator count.
For example, the following command deploys a model by calling
deployModel
:
machine_type = 'ct5lp-hightpu-1t'
deployed_model = model.deploy(
endpoint=endpoint,
deployed_model_display_name='My deployed model',
machine_type=machine_type,
traffic_percentage=100,
min_replica_count=1
sync=True,
)
For more information, see deploy a model to an endpoint.
Get online predictions
The instruction for getting online predictions from a Cloud TPU is the same as the instruction for getting online predictions.
For example, the following command sends an online prediction request by calling
predict
:
deployed_model.predict(...)
For custom containers, see the prediction request and response requirements for custom containers.
Securing capacity
By default, the quota for Custom model serving TPU v5e cores per region
is 0.
To request an increase, see Request a higher quota limit.
Pricing
TPU machine types are billed per hour, just like all other machine type in Vertex Prediction. For more information, see Prediction pricing.
What's next
- Learn how to get an online prediction