Deploy a model to Cloud TPU VMs

Google Cloud provides access to custom-designed machine learning accelerators called Tensor Processing Units (TPUs). TPUs are optimized to accelerate the training and inference of machine learning models, making them ideal for a variety of applications, including natural language processing, computer vision, and speech recognition.

This page describes how to deploy your models to a single host Cloud TPU v5e or v6e for online inference in Vertex AI.

Only Cloud TPU version v5e and v6e are supported. Other Cloud TPU generations are not supported.

To learn which locations Cloud TPU version v5e and v6e are available in, see locations.

Import your model

For deployment on Cloud TPUs, you must import your model to Vertex AI and configure it to use one of the following containers:

prebuilt optimized TensorFlow runtime container either the nightly version, or version 2.15 or later
prebuilt PyTorch TPU container version 2.1 or later
your own custom container that supports TPUs

Prebuilt optimized TensorFlow runtime container

To import and run a TensorFlow SavedModel on a Cloud TPU, the model must be TPU-optimized. If your TensorFlow SavedModel is not already TPU optimized, there are three ways to optimize your model:

Manual model optimization - You use Inference Converter to optimize your model and save it. Then, you must pass --saved_model_tags='serve,tpu' and --disable_optimizer=true flags when you upload your model. For example:

model = aiplatform.Model.upload(
    display_name='Manually optimized model',
    artifact_uri="gs://model-artifact-uri",
    serving_container_image_uri="us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-15:latest",
    serving_container_args=[
        "--saved_model_tags=serve,tpu",
        "--disable_optimizer=true",
    ]
)

Automatic model optimization with automatic partitioning - When you import a model, Vertex AI will attempt to optimize your unoptimized model using an automatic partitioning algorithm. This optimization does not work on all models. If optimization fails, you must either manually optimize your model or choose automatic model optimization with manual partitioning. For example:
```
model = aiplatform.Model.upload(
    display_name='TPU optimized model with automatic partitioning',
    artifact_uri="gs://model-artifact-uri",
    serving_container_image_uri="us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-15:latest",
    serving_container_args=[
    ]
)
```

Automatic model optimization with manual partitioning. Specify the --converter_options_string flag and adjust the ConverterOptions.TpuFunction to fit your needs. For an example, see Converter Image. Note that only ConverterOptions.TpuFunction, which is all that's needed for manual partitioning, is supported. For example:

model = aiplatform.Model.upload(
display_name='TPU optimized model with manual partitioning',
  artifact_uri="gs://model-artifact-uri",
  serving_container_image_uri="us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-tpu.2-15:latest",
  serving_container_args=[
      "--converter_options_string='tpu_functions { function_alias: \"partitioning function name\" }'"
  ]
)

For more information on importing models, see importing models to Vertex AI.

Prebuilt PyTorch container

The instructions to import and run a PyTorch model on Cloud TPU are the same as the instructions to import and run a PyTorch model.

For example, TorchServe for Cloud TPU v5e Inference demonstrates how to package the Densenet 161 model into model artifacts using Torch Model Archiver.

Then, upload the model artifacts to your Cloud Storage folder and upload your model as shown:

model = aiplatform.Model.upload(
    display_name='DenseNet TPU model from SDK PyTorch 2.1',
    artifact_uri="gs://model-artifact-uri",
    serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/pytorch-tpu.2-1:latest",
    serving_container_args=[],
    serving_container_predict_route="/predictions/model",
    serving_container_health_route="/ping",
    serving_container_ports=[8080]
)

For more information, see export model artifacts for PyTorch and the tutorial notebook for Serve a PyTorch model using a prebuilt container.

Custom container

For custom containers, your model does not need to be a TensorFlow model, but it must be TPU optimized. For information on producing a TPU optimized model, see the following guides for common ML frameworks:

For information on serving models trained with JAX, TensorFlow, or PyTorch on Cloud TPU v5e, see Cloud TPU v5e Inference.

Make sure your custom container meets the custom container requirements.

You must raise the locked memory limit so the driver can communicate with the TPU chips over direct memory access (DMA). For example:

Command line

ulimit -l 68719476736

Python

import resource

resource.setrlimit(
    resource.RLIMIT_MEMLOCK,
    (
        68_719_476_736_000,  # soft limit
        68_719_476_736_000,  # hard limit
    ),
  )

Then, see Use a custom container for inference for information on importing a model with a custom container. If you have want to implement pre or post processing logic, consider using Custom inference routines.

Create an endpoint

The instructions for creating an endpoint for Cloud TPUs are the same as the instructions for creating any endpoint.

For example, the following command creates an endpoint resource:

endpoint = aiplatform.Endpoint.create(display_name='My endpoint')

The response contains the new endpoint's ID, which you use in subsequent steps.

For more information on creating an endpoint, see deploy a model to an endpoint.

Deploy a model

The instructions for deploying a model to Cloud TPUs are the same as the instructions for deploying any model, except you specify one of the following supported Cloud TPU machine types:

Machine Type	Number of TPU chips
`ct6e-standard-1t`	1
`ct6e-standard-4t`	4
`ct6e-standard-8t`	8
`ct5lp-hightpu-1t`	1
`ct5lp-hightpu-4t`	4
`ct5lp-hightpu-8t`	8

TPU accelerators are built-in to the machine type. You don't have to specify accelerator type or accelerator count.

For example, the following command deploys a model by calling deployModel:

machine_type = 'ct5lp-hightpu-1t'

deployed_model = model.deploy(
    endpoint=endpoint,
    deployed_model_display_name='My deployed model',
    machine_type=machine_type,
    traffic_percentage=100,
    min_replica_count=1
    sync=True,
)

For more information, see deploy a model to an endpoint.

Get online inferences

The instruction for getting online inferences from a Cloud TPU is the same as the instruction for getting online inferences.

For example, the following command sends an online inference request by calling predict:

deployed_model.predict(...)

For custom containers, see the inference request and response requirements for custom containers.

Securing capacity

For most regions, the TPU v5e and v6e cores per region quota for custom model serving is 0. In some regions, it is limited.

To request a quota increase, see Request a quota adjustment.

Pricing

TPU machine types are billed per hour, just like all other machine type in Vertex Prediction. For more information, see Prediction pricing.

What's next

Learn how to get an online inference