Optimum TPU を活用し、GKE 上で TPU を使用してオープンソースモデルをサービングします

Standard

このチュートリアルでは、Hugging Face の Optimum TPU サービングフレームワークを活用し、Google Kubernetes Engine（GKE）上で Tensor Processing Unit（TPU）を使用して大規模言語モデル（LLM）オープンソースモデルをサービングする方法について説明します。このチュートリアルでは、Hugging Face からオープンソースモデルをダウンロードし、Optimum TPU を実行するコンテナを使用して、GKE Standard クラスタにモデルをデプロイします。

このガイドは、AI / ML ワークロードをデプロイしてサービングする際に、マネージド Kubernetes での詳細な制御、スケーラビリティ、復元力、ポータビリティ、費用対効果が求められる場合の出発点として適しています。

このチュートリアルは、Hugging Face エコシステムの生成 AI をご利用のお客様、GKE の新規または既存のユーザー、ML エンジニア、MLOps（DevOps）エンジニア、LLM のサービングに Kubernetes コンテナのオーケストレーション機能を使用することに関心をお持ちのプラットフォーム管理者を対象としています。

Google Cloud では、LLM 推論のためのオプションが複数用意されています。たとえば、Vertex AI、GKE、Google Compute Engine などのサービスを利用して、JetStream、vLLM などのサービングライブラリやその他のパートナーサービスを組み込むことができます。たとえば、JetStream を使用してプロジェクトから最新の最適化を取得できます。Hugging Face のオプションを選択する場合は、Optimum TPU を使用できます。

Optimum TPU は次の機能をサポートしています。

連続的なバッチ処理
トークンのストリーミング
トランスフォーマーを使用した貪欲探索と多項サンプリング。

目標

モデルの特性に基づいて推奨される TPU トポロジを持つ GKE Standard クラスタを準備します。
GKE に Optimum TPU をデプロイします。
Optimum TPU を使用して、サポートされるモデルを curl を通じてサービングします。

始める前に

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the required API.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the required API.

Enable the API

Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin, roles/artifactregistry.admin
Check for the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
1. In the Google Cloud console, go to the IAM page.
  IAM に移動
2. プロジェクトを選択します。
3. [ アクセスを許可] をクリックします。
4. [新しいプリンシパル] フィールドに、ユーザー ID を入力します。これは通常、Google アカウントのメールアドレスです。
5. [ロールを選択] リストでロールを選択します。
6. 追加のロールを付与するには、 [別のロールを追加] をクリックして各ロールを追加します。
7. [保存] をクリックします。

Optimum TPU を活用し、GKE 上で TPU を使用してオープンソースモデルをサービングします

目標

始める前に

Check for the roles

Grant the roles

環境を準備する

モデルへのアクセス権を取得する

Gemma 2B

アクセストークンを生成する

Llama3 8B

アクセストークンを生成する

GKE クラスタを作成する

TPU ノードプールを作成する

クラスタと通信するように kubectl を構成する:

コンテナをビルドする

イメージを Artifact Registry に push する

Hugging Face の認証情報用の Kubernetes Secret を作成する

Optimum TPU をデプロイする

Gemma 2B

Llama3 8B

モデルをサービングする

curl を使用してモデルサーバーと対話する

クリーンアップ

デプロイされたリソースを削除する

次のステップ

Optimum TPU を活用し、GKE 上で TPU を使用してオープンソース モデルをサービングします コレクションでコンテンツを整理 必要に応じて、コンテンツの保存と分類を行います。

目標

始める前に

Check for the roles

Grant the roles

環境を準備する

モデルへのアクセス権を取得する

Gemma 2B

ライセンス同意契約に署名する

アクセス トークンを生成する

Llama3 8B

アクセス トークンを生成する

GKE クラスタを作成する

TPU ノードプールを作成する

クラスタと通信するように kubectl を構成する:

コンテナをビルドする

イメージを Artifact Registry に push する

Hugging Face の認証情報用の Kubernetes Secret を作成する

Optimum TPU をデプロイする

Gemma 2B

Llama3 8B

モデルをサービングする

curl を使用してモデルサーバーと対話する

クリーンアップ

デプロイされたリソースを削除する

次のステップ

Optimum TPU を活用し、GKE 上で TPU を使用してオープンソースモデルをサービングします

アクセストークンを生成する

アクセストークンを生成する