Deploy models with custom weights

Deploy models with custom weights is a Preview offering. You can fine tune models based on a predefined set of base models, and deploy your customized models on Vertex AI Model Garden. You can deploy your custom models using the custom weights import by uploading your model artifacts to a Cloud Storage bucket in your project, which is a one-click experience in Vertex AI.

Supported models

The public preview of Deploy models with custom weights is supported by the following base models:

Model name	Version
Llama	Llama-2: 7B, 13B Llama-3.1: 8B, 70B Llama-3.2: 1B, 3B Llama-4: Scout-17B, Maverick-17B CodeLlama-13B
Gemma	Gemma-2: 27B Gemma-3: 1B, 4B, 3-12B, 27B Medgemma: 4B, 27B-text
Qwen	Qwen2: 1.5B Qwen2.5: 0.5B, 1.5B, 7B, 32B Qwen3: 0.6B, 1.7B, 8B, 32B, Qwen3-Coder-480B-A35B-Instruct, Qwen3-Next-80B-A3B-Instruct, Qwen3-Next-80B-A3B-Thinking
Deepseek	Deepseek-R1 Deepseek-V3 DeepSeek-V3.1
Mistral and Mixtral	Mistral-7B-v0.1 Mixtral-8x7B-v0.1 Mistral-Nemo-Base-2407
Phi-4	Phi-4-reasoning
OpenAI OSS	gpt-oss: 20B, 120B

Limitations

Custom weights don't support the import of quantized models.

Model files

You must supply the model files in the Hugging Face weights format. For more information on the Hugging Face weights format, see Use Hugging Face Models.

If the required files aren't provided, the model deployment might fail.

This table lists the types of model files, which depend on the model's architecture:

Model file content	File type
Model configuration	`config.json`
Model weights	`.safetensors` `.bin`
Weights index	`*.index.json`
Tokenizer file(s)	`tokenizer.model` `tokenizer.json` `tokenizer_config.json`

Locations

You can deploy custom models in all regions from Model Garden services.

Prerequisites

This section demonstrates how to deploy your custom model.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

This tutorial assumes that you are using Cloud Shell to interact with Google Cloud. If you want to use a different shell instead of Cloud Shell, then perform the following additional configuration:

Install the Google Cloud CLI.
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
To initialize the gcloud CLI, run the following command:
```
gcloud init
```

Deploy the custom model

This section demonstrates how to deploy your custom model.

If you're using the command-line interface (CLI), Python, or JavaScript, replace the following variables with a value for your code samples to work:

REGION: Your region. For example, uscentral1.
MODEL_GCS: Your Google Cloud model. For example, gs://custom-weights-fishfooding/meta-llama/Llama-3.2-1B-Instruct.
PROJECT_ID: Your project ID.
MODEL_ID: Your model ID.
MACHINE_TYPE: Your machine type. For example, g2-standard-12.
ACCELERATOR_TYPE: Your accelerator type. For example, NVIDIA_L4.
ACCELERATOR_COUNT: Your accelerator count.
PROMPT: Your text prompt.

Console

The following steps show you how to use the Google Cloud console to deploy your model with custom weights.

In the Google Cloud console, go to the Model Garden page.

Go to Model Garden
Click Deploy model with custom weights. The Deploy a model with custom weights on Vertex AI pane appears.
In the Model source section, do the following:
1. Click Browse, and choose your bucket where your model is stored, and click Select.
2. Optional: Enter your model's name in the Model name field.
In the Deployment settings section, do the following:
1. From the Region field, select your region, and click OK.
2. In the Machine Spec field, select your machine specification, which is used to the deploy your model.
3. Optional: In the Endpoint name field, your model's endpoint appears by default. However, you can enter a different endpoint name in the field.
Click Deploy model with custom weights.

gcloud CLI

This command demonstrates how to deploy the model to a specific region.

gcloud ai model-garden models deploy --model=${MODEL_GCS} --region ${REGION}

This command demonstrates how to deploy the model to a specific region with its machine type, accelerator type, and accelerator count. If you want to select a specific machine configuration, then you must set all three fields.

gcloud ai model-garden models deploy --model=${MODEL_GCS} --machine-type=${MACHINE_TYE} --accelerator-type=${ACCELERATOR_TYPE} --accelerator-count=${ACCELERATOR_COUNT} --region ${REGION}

Python

import vertexai
from google.cloud import aiplatform
from vertexai.preview import model_garden

vertexai.init(project=${PROJECT_ID}, location=${REGION})
custom_model = model_garden.CustomModel(
  gcs_uri=GCS_URI,
)
endpoint = custom_model.deploy(
  machine_type="${MACHINE_TYPE}",
  accelerator_type="${ACCELERATOR_TYPE}",
  accelerator_count="${ACCELERATOR_COUNT}",
  model_display_name="custom-model",
  endpoint_display_name="custom-model-endpoint")

endpoint.predict(instances=[{"prompt": "${PROMPT}"}], use_dedicated_endpoint=True)

Alternatively, you don't have to pass a parameter to the custom_model.deploy() method.

import vertexai
from google.cloud import aiplatform
from vertexai.preview import model_garden

vertexai.init(project=${PROJECT_ID}, location=${REGION})
custom_model = model_garden.CustomModel(
  gcs_uri=GCS_URI,
)
endpoint = custom_model.deploy()

endpoint.predict(instances=[{"prompt": "${PROMPT}"}], use_dedicated_endpoint=True)

curl


curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  "https://${REGION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${REGION}:deploy" \
  -d '{
    "custom_model": {
    "gcs_uri": "'"${MODEL_GCS}"'"
  },
  "destination": "projects/'"${PROJECT_ID}"'/locations/'"${REGION}"'",
  "model_config": {
     "model_user_id": "'"${MODEL_ID}"'",
  },
}'

Alternatively, you can use the API to explicitly set the machine type.


curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  "https://${REGION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${REGION}:deploy" \
  -d '{
    "custom_model": {
    "gcs_uri": "'"${MODEL_GCS}"'"
  },
  "destination": "projects/'"${PROJECT_ID}"'/locations/'"${REGION}"'",
  "model_config": {
     "model_user_id": "'"${MODEL_ID}"'",
  },
  "deploy_config": {
    "dedicated_resources": {
      "machine_spec": {
        "machine_type": "'"${MACHINE_TYPE}"'",
        "accelerator_type": "'"${ACCELERATOR_TYPE}"'",
        "accelerator_count": '"${ACCELERATOR_COUNT}"'
      },
      "min_replica_count": 1
    }
  }
}'

Make a query

After your model is deployed, custom weights support the public dedicated endpoint. You can send queries using the API or the SDK.

Before sending queries, you must get your endpoint URL, endpoint ID, and model ID, which are available in the Google Cloud console.

Follow these steps to get the information:

In the Google Cloud console, go to the Model Garden page.

Model Garden
Click View my endpoints & models.
Select your region from the Region list.
To get the endpoint ID and the endpoint URL, click your endpoint from the My endpoints section.

Your endpoint ID is displayed in the Endpoint ID field.

Your public endpoint URL is displayed in the Dedicated endpoint field.
To get the model ID, find your model listed in the Deployed models section, and follow these steps:
1. Click your deployed model's name in the Model field.
2. Click Version details. Your model ID displays in the Model ID field.

After you get your endpoint and deployed model information, see the following code samples for how to send an inference request, or see Send an online inference request to a dedicated public endpoint.

API

The following code samples demonstrate different ways to use the API based on your use case.

Chat completion (unary)

This sample request sends a complete chat message to the model and gets a response in a single chunk after the entire response is generated. This is similar to sending a text message and getting a single full reply.

  curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://${ENDPOINT_URL}/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/endpoints/${ENDPOINT_ID}/chat/completions" \
    -d '{
    "model": "'"${MODEL_ID}"'",
    "temperature": 0,
    "top_p": 1,
    "max_tokens": 154,
    "ignore_eos": true,
    "messages": [
      {
        "role": "user",
        "content": "How to tell the time by looking at the sky?"
      }
    ]
  }'

Chat completion (streaming)

This request is the streaming version of the unary chat completion request. By adding "stream": true to the request, the model sends its response piece by piece as it's being generated. This is useful for creating a real-time, typewriter-like effect in a chat application.

  curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \  "https://${ENDPOINT_URL}/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/endpoints/${ENDPOINT_ID}/chat/completions" \
    -d '{
    "model": "'"${MODEL_ID}"'",
    "stream": true,
    "temperature": 0,
    "top_p": 1,
    "max_tokens": 154,
    "ignore_eos": true,
    "messages": [
      {
        "role": "user",
        "content": "How to tell the time by looking at the sky?"
      }
    ]
  }'

Predict

This request sends a direct prompt to get an inference from a model. This is often used for tasks that aren't necessarily conversational, like text summarization or classification.

  curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
  "https://${ENDPOINT_URL}/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/endpoints/${ENDPOINT_ID}:predict" \
    -d '{
    "instances": [
      {
        "prompt": "How to tell the time by looking at the sky?",
        "temperature": 0,
        "top_p": 1,
        "max_tokens": 154,
        "ignore_eos": true
      }
    ]
  }'

Raw predict

This request is a streaming version of the Predict request. By using the :streamRawPredict endpoint and including "stream": true, this request sends a direct prompt and receives the model's output as a continuous stream of data as it's generated, which is similar to the streaming chat completion request.

  curl -X POST \
    -N \
    --output - \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://${ENDPOINT_URL}/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/endpoints/${ENDPOINT_ID}:streamRawPredict" \
    -d '{
    "instances": [
      {
        "prompt": "How to tell the time by looking at the sky?",
        "temperature": 0,
        "top_p": 1,
        "max_tokens": 154,
        "ignore_eos": true,
        "stream": true
      }
    ]
  }'

SDK

This code sample uses the SDK to send a query to a model and get a response back from that model.

  from google.cloud import aiplatform

  project_id = ""
  location = ""
  endpoint_id = "" # Use the short ID here

  aiplatform.init(project=project_id, location=location)

  endpoint = aiplatform.Endpoint(endpoint_id)

  prompt = "How to tell the time by looking at the sky?"
  instances=[{"text": prompt}]
  response = endpoint.predict(instances=instances, use_dedicated_endpoint=True)
  print(response.predictions)

For another example of how to use the API, see the Import Custom Weights notebook.

Learn more about self-deployed models in Vertex AI

For more information about self-deployed models, see Overview of self-deployed models.
For more information about Model Garden, see Overview of Model Garden.
For more information about deploying models, see Use models in Model Garden.
Use Gemma open models
Use Llama open models
Use Hugging Face open models