Llama models

Llama models on Vertex AI offer fully managed and serverless models as APIs. To use a Llama model on Vertex AI, send a request directly to the Vertex AI API endpoint. Because Llama models use a managed API, there's no need to provision or manage infrastructure.

You can stream your responses to reduce the end-user latency perception. A streamed response uses server-sent events (SSE) to incrementally stream the response.

There are no charges during the Preview period. If you require a production-ready service, use the self-hosted Llama models.

Available Llama 3.1 models

Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

The following Llama models are available from Meta to use in Vertex AI. To access a Llama model, go to its Model Garden model card.

Llama 3.1 405B

Llama 3.1 405B is Meta's most powerful and versatile model to date. It's the largest openly available foundation model, providing capabilities from synthetic data generation to model distillation, steerability, math, tool use, multilingual translation, and more. For more information, see Meta's Llama 3.1 site.

Llama 3.1 405B is optimized for the following use cases:

  • Enterprise-level applications
  • Research and development
  • Synthetic data generation and model distillation
Go to the Llama 3.1 405B model card

Use Llama models

When you send requests to use Llama's models, use the following model names:

  • For Llama 3.1 405B, use llama3-405b-instruct-mass.
  • We recommend that you use the model versions that include a suffix that starts with an @ symbol because of the possible differences between model versions. If you don't specify a model version, the latest version is always used, which can inadvertently affect your workflows when a model version changes.

    Before you begin

    To use Llama models with Vertex AI, you must perform the following steps. The Vertex AI API (aiplatform.googleapis.com) must be enabled to use Vertex AI. If you already have an existing project with the Vertex AI API enabled, you can use that project instead of creating a new project.

    Make sure you have the required permissions to enable and use partner models. For more information, see Grant the required permissions.

    1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
    2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

      Go to project selector

    3. Make sure that billing is enabled for your Google Cloud project.

    4. Enable the Vertex AI API.

      Enable the API

    5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

      Go to project selector

    6. Make sure that billing is enabled for your Google Cloud project.

    7. Enable the Vertex AI API.

      Enable the API

    8. Go to one of the following Model Garden model cards, then click enable:

    Make a streaming call to a Llama model

    The following sample makes a streaming call to a Llama model.

    REST

    After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

    Before using any of the request data, make the following replacements:

    • LOCATION: A region that supports Llama models.
    • MODEL: The model name you want to use.
    • ROLE: The role associated with a message. You can specify a user or an assistant. The first message must use the user role. The models operate with alternating user and assistant turns. If the final message uses the assistant role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response.
    • CONTENT: The content, such as text, of the user or assistant message.
    • MAX_OUTPUT_TOKENS: Maximum number of tokens that can be generated in the response. A token is approximately four characters. 100 tokens correspond to roughly 60-80 words.

      Specify a lower value for shorter responses and a higher value for potentially longer responses.

    • STREAM: A boolean that specifies whether the response is streamed or not. Stream your response to reduce the end-use latency perception. Set to true to stream the response and false to return the response all at once.

    HTTP method and URL:

    POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions

    Request JSON body:

    {
      "model": "meta/MODEL",
      "messages": [
        {
          "role": "ROLE",
          "content": "CONTENT"
        }
      ],
      "max_tokens": MAX_OUTPUT_TOKENS,
      "stream": true
    }
    

    To send your request, choose one of these options:

    curl

    Save the request body in a file named request.json, and execute the following command:

    curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json; charset=utf-8" \
    -d @request.json \
    "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions"

    PowerShell

    Save the request body in a file named request.json, and execute the following command:

    $cred = gcloud auth print-access-token
    $headers = @{ "Authorization" = "Bearer $cred" }

    Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions" | Select-Object -Expand Content

    You should receive a JSON response similar to the following.

    Make a unary call to a Llama model

    The following sample makes a unary call to a Llama model.

    REST

    After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

    Before using any of the request data, make the following replacements:

    • LOCATION: A region that supports Llama models.
    • MODEL: The model name you want to use.
    • ROLE: The role associated with a message. You can specify a user or an assistant. The first message must use the user role. The models operate with alternating user and assistant turns. If the final message uses the assistant role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response.
    • CONTENT: The content, such as text, of the user or assistant message.
    • MAX_OUTPUT_TOKENS: Maximum number of tokens that can be generated in the response. A token is approximately four characters. 100 tokens correspond to roughly 60-80 words.

      Specify a lower value for shorter responses and a higher value for potentially longer responses.

    • STREAM: A boolean that specifies whether the response is streamed or not. Stream your response to reduce the end-use latency perception. Set to true to stream the response and false to return the response all at once.

    HTTP method and URL:

    POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions

    Request JSON body:

    {
      "model": "meta/MODEL",
      "messages": [
        {
          "role": "ROLE",
          "content": "CONTENT"
        }
      ],
      "max_tokens": MAX_OUTPUT_TOKENS,
      "stream": false
    }
    

    To send your request, choose one of these options:

    curl

    Save the request body in a file named request.json, and execute the following command:

    curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json; charset=utf-8" \
    -d @request.json \
    "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions"

    PowerShell

    Save the request body in a file named request.json, and execute the following command:

    $cred = gcloud auth print-access-token
    $headers = @{ "Authorization" = "Bearer $cred" }

    Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/openapi/chat/completions" | Select-Object -Expand Content

    You should receive a JSON response similar to the following.

    Examples

    To see examples of using Llama models, run the following notebooks:

    Description Open in
    Use Llama Guard to safeguard LLM inputes and outputs. Colab
    GitHub
    Vertex AI Workbench
    Evaluate Llama 3.1 models by using Automatic side-by-side (AutoSxS) evaluation. Colab
    GitHub
    Vertex AI Workbench

    Llama model region availability and quotas

    For Llama models, a quota applies for each region where the model is available. The quota is specified in queries per minute (QPM).

    The supported regions, default quotas, and maximum context length for each Llama model is listed in the following tables:

    Llama 3.1 405B

    Region Quota system Supported context length
    us-central1 15 QPM 32,000 tokens

    If you want to increase any of your quotas for Generative AI on Vertex AI, you can use the Google Cloud console to request a quota increase. To learn more about quotas, see Work with quotas.