TPU アクセラレータを使用したトレーニング

Vertex AI は、TPU VM を使用したさまざまなフレームワークとライブラリを使用したトレーニングをサポートしています。コンピューティングリソースを構成するときに、TPU v2、TPU v3、または TPU v5e VM を指定できます。TPU v5e は、JAX 0.4.6 以降、TensorFlow 2.15 以降、PyTorch 2.1 以降をサポートしています。カスタムトレーニング用の TPU VM の構成の詳細については、カスタムトレーニング用のコンピューティングリソースを構成するをご覧ください。

TensorFlow トレーニング

ビルド済みコンテナ

TPU をサポートするビルド済みのトレーニングコンテナを使用し、Python トレーニングアプリケーションを作成します。

カスタムコンテナ

TPU VM 専用にビルドされた tensorflow バージョンと libtpu バージョンがインストールされているカスタムコンテナを使用します。これらのライブラリは Cloud TPU サービスによって管理されており、サポートされている TPU 構成のドキュメントに記載されています。

目的の tensorflow バージョンと、それに対応する libtpu ライブラリを選択します。コンテナをビルドするときに、これらを Docker コンテナイメージにインストールします。

たとえば、TensorFlow 2.12 を使用する場合は Dockerfile に次の指示を含めます。

  # Download and install `tensorflow`.
  RUN pip install https://storage.googleapis.com/cloud-tpu-tpuvm-artifacts/tensorflow/tf-2.15.0/tensorflow-2.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

  # Download and install `libtpu`.
  # You must save `libtpu.so` in the '/lib' directory of the container image.
  RUN curl -L https://storage.googleapis.com/cloud-tpu-tpuvm-artifacts/libtpu/1.9.0/libtpu.so -o /lib/libtpu.so

  # TensorFlow training on TPU v5e requires the PJRT runtime. To enable the PJRT
  # runtime, configure the following environment variables in your Dockerfile.
  # For details, see https://cloud.google.com/tpu/docs/runtimes#tf-pjrt-support.
  # ENV NEXT_PLUGGABLE_DEVICE_USE_C_API=true
  # ENV TF_PLUGGABLE_DEVICE_LIBRARY_PATH=/lib/libtpu.so

TPU Pod

TPU Pod で tensorflow トレーニングを行う場合は、トレーニングコンテナに追加の設定が必要です。Vertex AI は、初期設定を処理するベース Docker イメージを保持します。

イメージの URI	Python バージョンと TPU バージョン
`us-docker.pkg.dev/vertex-ai/training/tf-tpu-pod-base-cp38:latest` `europe-docker.pkg.dev/vertex-ai/training/tf-tpu-pod-base-cp38:latest` `asia-docker.pkg.dev/vertex-ai/training/tf-tpu-pod-base-cp38:latest`	3.8
`us-docker.pkg.dev/vertex-ai/training/tf-tpu.2-15-pod-base-cp310:latest` `europe-docker.pkg.dev/vertex-ai/training/tf-tpu.2-15-pod-base-cp310:latest` `asia-docker.pkg.dev/vertex-ai/training/tf-tpu.2-15-pod-base-cp310:latest`	3.10

カスタムコンテナをビルドする手順は次のとおりです。

使用する Python バージョンのベースイメージを選択します。TensorFlow 2.12 以前の TPU TensorFlow ホイールは Python 3.8 をサポートしています。TensorFlow 2.13 以降は、Python 3.10 以降をサポートしています。特定の TensorFlow ホイールについては、Cloud TPU の構成をご覧ください。
トレーナーコードと起動コマンドを使用してイメージを拡張します。

# Specifies base image and tag
FROM us-docker.pkg.dev/vertex-ai/training/tf-tpu-pod-base-cp38:latest
WORKDIR /root

# Download and install `tensorflow`.
RUN pip install https://storage.googleapis.com/cloud-tpu-tpuvm-artifacts/tensorflow/tf-2.12.0/tensorflow-2.12.0-cp38-cp38-linux_x86_64.whl

# Download and install `libtpu`.
# You must save `libtpu.so` in the '/lib' directory of the container image.
RUN curl -L https://storage.googleapis.com/cloud-tpu-tpuvm-artifacts/libtpu/1.6.0/libtpu.so -o /lib/libtpu.so

# Copies the trainer code to the docker image.
COPY your-path-to/model.py /root/model.py
COPY your-path-to/trainer.py /root/trainer.py

# The base image is setup so that it runs the CMD that you provide.
# You can provide CMD inside the Dockerfile like as follows.
# Alternatively, you can pass it as an `args` value in ContainerSpec:
# (https://cloud.google.com/vertex-ai/docs/reference/rest/v1/CustomJobSpec#containerspec)
CMD ["python3", "trainer.py"]

PyTorch トレーニング

TPU でトレーニングする場合は、PyTorch 用のビルド済みコンテナまたはカスタムコンテナを使用できます。

ビルド済みコンテナ

TPU をサポートするビルド済みのトレーニングコンテナを使用し、Python トレーニングアプリケーションを作成します。

カスタムコンテナ

PyTorch ライブラリをインストールしたカスタムコンテナを使用します。

たとえば、Dockerfile は次のようになります。

FROM python:3.10

# v5e specific requirement - enable PJRT runtime
ENV PJRT_DEVICE=TPU

# install pytorch and torch_xla
RUN pip3 install torch~=2.1.0 torchvision torch_xla[tpu]~=2.1.0
 -f https://storage.googleapis.com/libtpu-releases/index.html

# Add your artifacts here
COPY trainer.py .

# Run the trainer code
CMD ["python3", "trainer.py"]

TPU Pod

トレーニングは TPU Pod のすべてのホストで実行されます（TPU Pod スライスで PyTorch コードを実行するをご覧ください）。

Vertex AI は、すべてのホストからのレスポンスを待ってから、ジョブの完了を決定します。

JAX トレーニング

ビルド済みコンテナ

JAX 用のビルド済みコンテナはありません。

カスタムコンテナ

JAX ライブラリをインストールしたカスタムコンテナを使用します。

たとえば、Dockerfile は次のようになります。

# Install JAX.
RUN pip install 'jax[tpu]>=0.4.6' -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

# Add your artifacts here
COPY trainer.py trainer.py

# Set an entrypoint.
ENTRYPOINT ["python3", "trainer.py"]

TPU Pod

トレーニングは TPU Pod のすべてのホストで実行されます（TPU Pod スライスでの JAX コードの実行をご覧ください）。

Vertex AI は、TPU Pod の最初のホストを監視してジョブの完了を判断します。次のコードスニペットを使用すると、すべてのホストが同時に終了するようにできます。

# Your training logic
...

if jax.process_count() > 1:
  # Make sure all hosts stay up until the end of main.
  x = jnp.ones([jax.local_device_count()])
  x = jax.device_get(jax.pmap(lambda x: jax.lax.psum(x, 'i'), 'i')(x))
  assert x[0] == jax.device_count()

環境変数

次の表に、コンテナ内で使用可能な環境変数の詳細を示します。

名前	値
TPU_NODE_NAME	my-first-tpu-node
TPU_CONFIG	{"project": "tenant-project-xyz", "zone": "us-central1-b", "tpu_node_name": "my-first-tpu-node"}

カスタムサービスアカウント

TPU トレーニングにはカスタムサービスアカウントを使用できます。カスタムサービスアカウントの使用方法については、カスタムサービスアカウントの使用方法のページをご覧ください。

トレーニング用のプライベート IP（VPC ネットワークピアリング）

TPU トレーニングにプライベート IP を使用できます。カスタムトレーニングにプライベート IP を使用する方法のページをご覧ください。

VPC Service Controls

VPC Service Controls が有効になっているプロジェクトでは、TPU トレーニングジョブを送信できます。

制限事項

TPU VM を使用してトレーニングする場合は、次の制限が適用されます。

TPU は特定の Vertex AI リージョンでのみ使用できます。

TPU タイプ

メモリの上限など、TPU アクセラレータの詳細については、TPU タイプをご覧ください。

TPU アクセラレータを使用したトレーニング

TensorFlow トレーニング

ビルド済みコンテナ

カスタム コンテナ

TPU Pod

PyTorch トレーニング

ビルド済みコンテナ

カスタム コンテナ

TPU Pod

JAX トレーニング

ビルド済みコンテナ

カスタム コンテナ

TPU Pod

環境変数

カスタム サービス アカウント

トレーニング用のプライベート IP（VPC ネットワーク ピアリング）

VPC Service Controls

制限事項

TPU タイプ

カスタムコンテナ

カスタムコンテナ

カスタムコンテナ

カスタムサービスアカウント

トレーニング用のプライベート IP（VPC ネットワークピアリング）