このページは Cloud Translation API によって翻訳されました。

カスタムの重みを使用してモデルをデプロイする

カスタム重みを使用したモデルのデプロイはプレビュー版です。事前定義された一連のベースモデルに基づいてモデルをファインチューニングし、カスタマイズしたモデルを Vertex AI Model Garden にデプロイできます。カスタムモデルをデプロイするには、カスタムウェイトのインポートを使用して、モデルアーティファクトをプロジェクトの Cloud Storage バケットにアップロードします。これは、Vertex AI でワンクリックで実行できます。

サポートされているモデル

カスタムの重み付けでモデルをデプロイするの公開プレビュー版は、次のベースモデルでサポートされています。

モデル名	バージョン
Llama	Llama-2: 7B、13B Llama-3.1: 8B、70B Llama-3.2: 1B、3B Llama-4: Scout-17B、Maverick-17B CodeLlama-13B
Gemma	Gemma-2: 27B Gemma-3: 1B、4B、3-12B、27B Medgemma: 4B、27B-text
Qwen	Qwen2: 15 億 Qwen2.5: 0.5B、1.5B、7B、32B Qwen3: 0.6B、1.7B、8B、32B、Qwen3-Coder-480B-A35B-Instruct、Qwen3-Next-80B-A3B-Instruct、Qwen3-Next-80B-A3B-Thinking
Deepseek	Deepseek-R1 Deepseek-V3 DeepSeek-V3.1
Mistral と Mixtral	Mistral-7B-v0.1 Mixtral-8x7B-v0.1 Mistral-Nemo-Base-2407
Phi-4	Phi-4-reasoning
OpenAI OSS	gpt-oss: 20B、120B

制限事項

カスタムの重み付けでは、量子化モデルのインポートはサポートされていません。

モデルファイル

モデルファイルは Hugging Face の重み形式で指定する必要があります。Hugging Face の重み形式の詳細については、Hugging Face モデルを使用するをご覧ください。

必要なファイルが提供されていない場合、モデルのデプロイが失敗する可能性があります。

次の表に、モデルのアーキテクチャに応じて異なるモデルファイルの種類を示します。

モデルファイルの内容	ファイル形式
モデル設定	`config.json`
モデルの重み	`.safetensors` `.bin`
重みインデックス	`*.index.json`
トークナイザーファイル	`tokenizer.model` `tokenizer.json` `tokenizer_config.json`

ロケーション

Model Garden サービスからすべてのリージョンにカスタムモデルをデプロイできます。

前提条件

このセクションでは、カスタムモデルをデプロイする方法について説明します。

始める前に

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

このチュートリアルでは、Cloud Shell を使用して Google Cloudを操作していることを前提としています。Cloud Shell の代わりに別のシェルを使用する場合は、次の追加の構成を行います。

Install the Google Cloud CLI.
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
To initialize the gcloud CLI, run the following command:
```
gcloud init
```

カスタムモデルをデプロイする

このセクションでは、カスタムモデルをデプロイする方法について説明します。

コマンドラインインターフェース（CLI）、Python、JavaScript を使用している場合は、次の変数をコードサンプルが機能する値に置き換えます。

REGION: リージョン。例: uscentral1
MODEL_GCS: Google Cloud モデル。例: gs://custom-weights-fishfooding/meta-llama/Llama-3.2-1B-Instruct。
PROJECT_ID: 実際のプロジェクト ID。
MODEL_ID: モデル ID。
MACHINE_TYPE: マシンタイプ。例: g2-standard-12
ACCELERATOR_TYPE: アクセラレータのタイプ。例: NVIDIA_L4
ACCELERATOR_COUNT: アクセラレータの数。
PROMPT: テキストプロンプト。

コンソール

次の手順では、 Google Cloud コンソールを使用してカスタム重みでモデルをデプロイする方法について説明します。

Google Cloud コンソールで、[Model Garden] ページに移動します。

Model Garden に移動
[カスタム重みを使用してモデルをデプロイ] をクリックします。[Vertex AI でカスタム重みを使用してモデルをデプロイする] ペインが表示されます。
[モデルソース] セクションで、次の操作を行います。
1. [参照] をクリックし、モデルが保存されているバケットを選択して、[選択] をクリックします。
2. 省略可: [モデル名] フィールドにモデルの名前を入力します。
[デプロイの設定] セクションで、次の操作を行います。
1. [リージョン] フィールドでリージョンを選択し、[OK] をクリックします。
2. [マシン仕様] フィールドで、モデルのデプロイに使用されるマシン仕様を選択します。
3. 省略可: [エンドポイント名] フィールドには、デフォルトでモデルのエンドポイントが表示されます。ただし、フィールドに別のエンドポイント名を入力することはできます。
[カスタム重みを使用してモデルをデプロイ] をクリックします。

gcloud CLI

このコマンドは、特定のリージョンにモデルをデプロイする方法を示しています。

gcloud ai model-garden models deploy --model=${MODEL_GCS} --region ${REGION}

このコマンドは、マシンタイプ、アクセラレータタイプ、アクセラレータ数を使用して、モデルを特定のリージョンにデプロイする方法を示しています。特定のマシン構成を選択する場合は、3 つのフィールドすべてを設定する必要があります。

gcloud ai model-garden models deploy --model=${MODEL_GCS} --machine-type=${MACHINE_TYE} --accelerator-type=${ACCELERATOR_TYPE} --accelerator-count=${ACCELERATOR_COUNT} --region ${REGION}

Python

import vertexai
from google.cloud import aiplatform
from vertexai.preview import model_garden

vertexai.init(project=${PROJECT_ID}, location=${REGION})
custom_model = model_garden.CustomModel(
  gcs_uri=GCS_URI,
)
endpoint = custom_model.deploy(
  machine_type="${MACHINE_TYPE}",
  accelerator_type="${ACCELERATOR_TYPE}",
  accelerator_count="${ACCELERATOR_COUNT}",
  model_display_name="custom-model",
  endpoint_display_name="custom-model-endpoint")

endpoint.predict(instances=[{"prompt": "${PROMPT}"}], use_dedicated_endpoint=True)

また、custom_model.deploy() メソッドにパラメータを渡す必要はありません。

import vertexai
from google.cloud import aiplatform
from vertexai.preview import model_garden

vertexai.init(project=${PROJECT_ID}, location=${REGION})
custom_model = model_garden.CustomModel(
  gcs_uri=GCS_URI,
)
endpoint = custom_model.deploy()

endpoint.predict(instances=[{"prompt": "${PROMPT}"}], use_dedicated_endpoint=True)

curl


curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  "https://${REGION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${REGION}:deploy" \
  -d '{
    "custom_model": {
    "gcs_uri": "'"${MODEL_GCS}"'"
  },
  "destination": "projects/'"${PROJECT_ID}"'/locations/'"${REGION}"'",
  "model_config": {
     "model_user_id": "'"${MODEL_ID}"'",
  },
}'

または、API を使用してマシンタイプを明示的に設定することもできます。


curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  "https://${REGION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${REGION}:deploy" \
  -d '{
    "custom_model": {
    "gcs_uri": "'"${MODEL_GCS}"'"
  },
  "destination": "projects/'"${PROJECT_ID}"'/locations/'"${REGION}"'",
  "model_config": {
     "model_user_id": "'"${MODEL_ID}"'",
  },
  "deploy_config": {
    "dedicated_resources": {
      "machine_spec": {
        "machine_type": "'"${MACHINE_TYPE}"'",
        "accelerator_type": "'"${ACCELERATOR_TYPE}"'",
        "accelerator_count": '"${ACCELERATOR_COUNT}"'
      },
      "min_replica_count": 1
    }
  }
}'

クエリを作成する

モデルがデプロイされると、カスタムの重み付けはパブリック専用エンドポイントをサポートします。クエリは、API または SDK を使用して送信できます。

クエリを送信する前に、 Google Cloud コンソールでエンドポイント URL、エンドポイント ID、モデル ID を取得する必要があります。

情報を取得する手順は次のとおりです。

Google Cloud コンソールで、[Model Garden] ページに移動します。

Model Garden
[エンドポイントとモデルを表示] をクリックします。
[リージョン] リストからリージョンを選択します。
エンドポイント ID とエンドポイント URL を取得するには、[マイエンドポイント] セクションでエンドポイントをクリックします。

エンドポイント ID が [エンドポイント ID] フィールドに表示されます。

パブリックエンドポイント URL が [専用エンドポイント] フィールドに表示されます。
モデル ID を取得するには、[デプロイされたモデル] セクションにリストされているモデルを見つけて、次の操作を行います。
1. [モデル] フィールドで、デプロイ済みモデルの名前をクリックします。
2. [バージョンの詳細] をクリックします。[モデル ID] フィールドにモデル ID が表示されます。

エンドポイントとデプロイされたモデルの情報を取得したら、次のコードサンプルで推論リクエストの送信方法を確認するか、専用パブリックエンドポイントにオンライン推論リクエストを送信するをご覧ください。

API

次のコードサンプルは、ユースケースに基づいて API を使用するさまざまな方法を示しています。

チャットの補完（単項）

このリクエストのサンプルでは、完全なチャットメッセージをモデルに送信し、レスポンス全体が生成された後に、レスポンスを 1 つのチャンクで取得します。これは、テキストメッセージを送信して 1 つの完全な返信を受け取るのと同様です。

  curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://${ENDPOINT_URL}/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/endpoints/${ENDPOINT_ID}/chat/completions" \
    -d '{
    "model": "'"${MODEL_ID}"'",
    "temperature": 0,
    "top_p": 1,
    "max_tokens": 154,
    "ignore_eos": true,
    "messages": [
      {
        "role": "user",
        "content": "How to tell the time by looking at the sky?"
      }
    ]
  }'

チャットの補完（ストリーミング）

このリクエストは、単項チャット完了リクエストのストリーミングバージョンです。リクエストに "stream": true を追加すると、モデルはレスポンスを生成と同時に少しずつ送信します。これは、チャットアプリケーションでリアルタイムのタイプライターのような効果を作成するのに便利です。

  curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \  "https://${ENDPOINT_URL}/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/endpoints/${ENDPOINT_ID}/chat/completions" \
    -d '{
    "model": "'"${MODEL_ID}"'",
    "stream": true,
    "temperature": 0,
    "top_p": 1,
    "max_tokens": 154,
    "ignore_eos": true,
    "messages": [
      {
        "role": "user",
        "content": "How to tell the time by looking at the sky?"
      }
    ]
  }'

予測

このリクエストは、モデルから推論を取得するための直接プロンプトを送信します。これは、テキストの要約や分類など、必ずしも会話型ではないタスクでよく使用されます。

  curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
  "https://${ENDPOINT_URL}/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/endpoints/${ENDPOINT_ID}:predict" \
    -d '{
    "instances": [
      {
        "prompt": "How to tell the time by looking at the sky?",
        "temperature": 0,
        "top_p": 1,
        "max_tokens": 154,
        "ignore_eos": true
      }
    ]
  }'

未加工の予測

このリクエストは、Predict リクエストのストリーミングバージョンです。:streamRawPredict エンドポイントを使用して "stream": true を含めることで、このリクエストは直接プロンプトを送信し、生成されたモデルの出力を連続したデータストリームとして受け取ります。これは、ストリーミングチャット完了リクエストに似ています。

  curl -X POST \
    -N \
    --output - \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://${ENDPOINT_URL}/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/endpoints/${ENDPOINT_ID}:streamRawPredict" \
    -d '{
    "instances": [
      {
        "prompt": "How to tell the time by looking at the sky?",
        "temperature": 0,
        "top_p": 1,
        "max_tokens": 154,
        "ignore_eos": true,
        "stream": true
      }
    ]
  }'

SDK

このコードサンプルでは、SDK を使用してモデルにクエリを送信し、そのモデルからレスポンスを取得します。

  from google.cloud import aiplatform

  project_id = ""
  location = ""
  endpoint_id = "" # Use the short ID here

  aiplatform.init(project=project_id, location=location)

  endpoint = aiplatform.Endpoint(endpoint_id)

  prompt = "How to tell the time by looking at the sky?"
  instances=[{"text": prompt}]
  response = endpoint.predict(instances=instances, use_dedicated_endpoint=True)
  print(response.predictions)

API の使用例については、カスタム重みをインポートするノートブックをご覧ください。

Vertex AI のセルフデプロイモデルの詳細

セルフデプロイモデルの詳細については、セルフデプロイモデルの概要をご覧ください。
Model Garden の詳細については、Model Garden の概要をご覧ください。
モデルのデプロイの詳細については、Model Garden でモデルを使用するをご覧ください。
Gemma オープンモデルを使用する
Llama のオープンモデルを使用する
Hugging Face オープンモデルを使用する