Overview of self-deployed models

This document describes the different types of self-deployed models available in Model Garden and covers the following topics:

Choose a self-deployment option

The following table compares the self-deployment options available on Vertex AI.

Option Description Pros Cons
Self-deploy open models Freely available models with public weights. You manage the deployment infrastructure. High transparency; no model licensing cost; portable. You are responsible for all infrastructure costs and management.
Self-deployed partner models Proprietary models from third-party partners, purchased and deployed through Cloud Marketplace. Access to specialized, commercial-grade models; partner support available. Incurs model usage costs; weights cannot be exported; some platform limitations (for example, no VPC Service Controls support).
Deploy models with custom weights Deploy your own fine-tuned versions of supported base models by providing custom model weights. Maximum customization for your specific use case; deploy on your preferred infrastructure. Requires you to prepare model files in a specific format; doesn't support quantized models during import.

In Model Garden, you can deploy and serve open, partner, and custom models on Vertex AI. Unlike model-as-a-service (MaaS) offerings that are serverless, self-deployed models are deployed securely within your Google Cloud project and VPC network.

Self-deploy open models

Open models provide pretrained capabilities for various AI tasks, including Gemini models that excel in multimodal processing. An open model is freely available, and you can publish its outputs and use it anywhere as long as you adhere to its licensing terms. Vertex AI offers both open (also known as open weight) and open source models.

When you use an open model with Vertex AI, your deployment uses Vertex AI infrastructure. You can also use open models with other infrastructure products, such as PyTorch or Jax.

Open weight models

Many open models are considered open weight large language models (LLMs). Open weight models provide more transparency than models whose weights are not public. A model's weights are the numerical values stored in the model's neural network architecture that represent learned patterns and relationships from the data a model is trained on. The pretrained parameters, or weights, of open weight models are released. You can use an open weight model for inference and tuning. However, details such as the original dataset, model architecture, and training code aren't always provided.

Open source models

Open weight models differ from open source AI models. While open weight models often expose the weights and the core numerical representation of learned patterns, they don't necessarily provide the full source code or training details. Providing weights offers a level of AI model transparency, which lets you understand the model's capabilities without needing to build it yourself.

Self-deployed partner models

Model Garden helps you purchase and manage model licenses from partners who offer proprietary models as a self-deploy option. After you purchase access to a model from Cloud Marketplace, you can choose to deploy on on-demand hardware or use your Compute Engine reservations and committed use discounts to meet your budget requirements. You are charged for model usage and for the Vertex AI infrastructure that you use.

To request usage of a self-deployed partner model, find the relevant model in the Model Garden console, click Contact sales, and then complete the form. This action initiates contact with a Google Cloud sales representative.

For more information about deploying and using partner models, see Deploy a partner model and make prediction requests.

Considerations

Consider the following limitations when using self-deployed partner models:

  • Unlike with open models, you cannot export weights.
  • If you have VPC Service Controls set up for your project, you can't upload models, which prevents you from deploying partner models.
  • For endpoints, only the shared public endpoint type is supported.

Support for model-specific issues is provided by the partner. To contact a partner about model performance issues, use the contact details in the Support section of their Model Garden model card.

Deploy models with custom weights

You can fine-tune models based on a predefined set of base models and deploy your customized models in Vertex AI Model Garden. To deploy your custom models, you import custom weights by uploading your model artifacts to a Cloud Storage bucket in your project.

Supported models

The public preview of Deploy models with custom weights is supported by the following base models:

Model name Version
Llama
  • Llama-2: 7B, 13B
  • Llama-3.1: 8B, 70B
  • Llama-3.2: 1B, 3B
  • Llama-4: Scout-17B, Maverick-17B
  • CodeLlama-13B
Gemma
  • Gemma-2: 27B
  • Gemma-3: 1B, 4B, 3-12B, 27B
  • Medgemma: 4B, 27B-text
Qwen
  • Qwen2: 1.5B
  • Qwen2.5: 0.5B, 1.5B, 7B, 32B
  • Qwen3: 0.6B, 1.7B, 8B, 32B
Deepseek
  • Deepseek-R1
  • Deepseek-V3
Mistral and Mixtral
  • Mistral-7B-v0.1
  • Mixtral-8x7B-v0.1
  • Mistral-Nemo-Base-2407
Phi-4
  • Phi-4-reasoning

Limitations

Custom weights don't support the import of quantized models.

Model files

You must supply the model files in the Hugging Face weights format. For more information on the Hugging Face weights format, see Use Hugging Face Models.

If the required files aren't provided, the model deployment might fail.

This table lists the types of model files, which depend on the model's architecture:

Model file content File type
Model configuration
  • config.json
Model weights
  • *.safetensors
  • *.bin
Weights index
  • *.index.json
Tokenizer file(s)
  • tokenizer.model
  • tokenizer.json
  • tokenizer_config.json

Locations

You can deploy custom models in all regions where Model Garden is available.

Prerequisites

Before you deploy your custom model, complete the following initial setup steps.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. Enable the Vertex AI API.

    Enable the API

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Verify that billing is enabled for your Google Cloud project.

  7. Enable the Vertex AI API.

    Enable the API

  8. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

The following instructions use Cloud Shell. If you use a local development environment, you must authenticate to Google Cloud:

  1. Install the Google Cloud CLI.

  2. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  3. To initialize the gcloud CLI, run the following command:

    gcloud init

Deploy the custom model

If you're using the gcloud CLI, Python, or curl, replace the following variables in the code samples:

  • REGION: Your region (for example, us-central1).
  • MODEL_GCS: The Cloud Storage path to your model (for example, gs://custom-weights-fishfooding/meta-llama/Llama-3.2-1B-Instruct).
  • PROJECT_ID: Your project ID.
  • MODEL_ID: Your model ID.
  • MACHINE_TYPE: Your machine type (for example, g2-standard-12).
  • ACCELERATOR_TYPE: Your accelerator type (for example, NVIDIA_L4).
  • ACCELERATOR_COUNT: Your accelerator count.
  • PROMPT: Your text prompt.

Console

The following steps show you how to use the Google Cloud console to deploy your model with custom weights.

  1. In the Google Cloud console, go to the Model Garden page.

    Go to Model Garden

  2. Click Deploy model with custom weights. The Deploy a model with custom weights on Vertex AI pane appears.

  3. In the Model source section, do the following:

    1. Click Browse, select the bucket where your model is stored, and click Select.

    2. Optional: In the Model name field, enter a name for your model.

  4. In the Deployment settings section, do the following:

    1. From the Region list, select your region.

    2. In the Machine Spec field, select the machine specification to use for deploying your model.

    3. Optional: In the Endpoint name field, you can change the default endpoint name.

  5. Click Deploy model with custom weights.

gcloud CLI

This command demonstrates how to deploy the model to a specific region.

gcloud ai model-garden models deploy --model=${MODEL_GCS} --region ${REGION}

This command demonstrates how to deploy the model to a specific region with its machine type, accelerator type, and accelerator count. To select a specific machine configuration, you must set all three fields.

gcloud ai model-garden models deploy --model=${MODEL_GCS} --machine-type=${MACHINE_TYE} --accelerator-type=${ACCELERATOR_TYPE} --accelerator-count=${ACCELERATOR_COUNT} --region ${REGION}

Python

import vertexai
from google.cloud import aiplatform
from vertexai.preview import model_garden

vertexai.init(project=${PROJECT_ID}, location=${REGION})
custom_model = model_garden.CustomModel(
  gcs_uri=GCS_URI,
)
endpoint = custom_model.deploy(
  machine_type="${MACHINE_TYPE}",
  accelerator_type="${ACCELERATOR_TYPE}",
  accelerator_count="${ACCELERATOR_COUNT}",
  model_display_name="custom-model",
  endpoint_display_name="custom-model-endpoint")

endpoint.predict(instances=[{"prompt": "${PROMPT}"}], use_dedicated_endpoint=True)

Alternatively, you can call the custom_model.deploy() method without arguments to use default settings.

import vertexai
from google.cloud import aiplatform
from vertexai.preview import model_garden

vertexai.init(project=${PROJECT_ID}, location=${REGION})
custom_model = model_garden.CustomModel(
  gcs_uri=GCS_URI,
)
endpoint = custom_model.deploy()

endpoint.predict(instances=[{"prompt": "${PROMPT}"}], use_dedicated_endpoint=True)

curl


curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  "https://${REGION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${REGION}:deploy" \
  -d '{
    "custom_model": {
    "gcs_uri": "'"${MODEL_GCS}"'"
  },
  "destination": "projects/'"${PROJECT_ID}"'/locations/'"${REGION}"'",
  "model_config": {
     "model_user_id": "'"${MODEL_ID}"'",
  },
}'

Alternatively, you can use the API to explicitly set the machine type.


curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  "https://${REGION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${REGION}:deploy" \
  -d '{
    "custom_model": {
    "gcs_uri": "'"${MODEL_GCS}"'"
  },
  "destination": "projects/'"${PROJECT_ID}"'/locations/'"${REGION}"'",
  "model_config": {
     "model_user_id": "'"${MODEL_ID}"'",
  },
  "deploy_config": {
    "dedicated_resources": {
      "machine_spec": {
        "machine_type": "'"${MACHINE_TYPE}"'",
        "accelerator_type": "'"${ACCELERATOR_TYPE}"'",
        "accelerator_count": '"${ACCELERATOR_COUNT}"'
      },
      "min_replica_count": 1
    }
  }
}'

What's next