Self-deployed Llama models

Llama is a collection of open models developed by Meta. You can fine-tune and deploy these models on Vertex AI. Llama offers pre-trained and instruction-tuned generative text and multimodal models.

This document describes the Llama models available on Vertex AI, including the following:

  • Llama 4: A family of powerful multimodal models that use a Mixture-of-Experts (MoE) architecture for efficient inference.
  • Llama 3.3: A high-performance, text-only instruction-tuned model.
  • Llama 3.2: Multimodal models designed for efficiency, on-device applications, and image reasoning.
  • Llama 3.1: A collection of multilingual text-only models optimized for dialogue.
  • Llama 3: Instruction-tuned text models optimized for dialogue use cases.
  • Llama 2: A collection of pre-trained and fine-tuned generative text models.
  • Code Llama: Models specialized for code synthesis, completion, and understanding.
  • Llama Guard 3: A safety model for classifying content based on a safety taxonomy.

To help you choose a Llama model for your use case, the following table compares the available model families.

Model Family Description Primary Use Case
Llama 4 Multimodal models (text, image) with a Mixture-of-Experts (MoE) architecture. Includes Scout (long context) and Maverick (highest capability). Advanced image analysis, visual Q&A, creative text generation, and reasoning over large documents or codebases.
Llama 3.3 A 70B parameter, text-only, instruction-tuned model with enhanced performance for text applications. High-performance text-only tasks where it can approach the performance of much larger models.
Llama 3.2 Efficient multimodal models (text, image) designed for a range of applications, including on-device use cases. Image reasoning, chart analysis, on-device summarization, and multilingual knowledge retrieval.
Llama 3.1 Multilingual text-only models (8B, 70B, 405B) optimized for dialogue. Multilingual dialogue and chat applications.
Llama 3 Instruction-tuned text-only models optimized for dialogue. General dialogue and chat applications.
Llama 2 A collection of pre-trained and fine-tuned generative text models (7B to 70B). General-purpose generative text tasks.
Code Llama Text-to-code models based on Llama 2. Code generation, completion, and debugging.
Llama Guard 3 A safety model for classifying content against a risk taxonomy. Multilingual and enhanced over previous versions. Content moderation and implementing safety layers for generative AI applications.

Llama 4

The Llama 4 family of models is a collection of multimodal models that use the Mixture-of-Experts (MoE) architecture. The MoE architecture allows models with large parameter counts to activate only a subset of parameters for any given input, which results in more efficient inference. Additionally, Llama 4 uses early fusion, which integrates text and vision information from the initial processing stages. This method helps Llama 4 models better understand complex relationships between text and images. Model Garden on Vertex AI offers two Llama 4 models: Llama 4 Scout and Llama 4 Maverick.

For more information, see the Llama 4 model card in Model Garden or view the Introducing Llama 4 on Vertex AI blog post.

Llama 4 Maverick

Llama 4 Maverick is the largest and most capable Llama 4 model. It performs well on coding, reasoning, and image benchmarks. It features 17 billion active parameters out of 400 billion total parameters with 128 experts. Llama 4 Maverick uses alternating dense and MoE layers, where each token activates a shared expert plus one of the 128 routed experts. You can use the model as a pretrained (PT) model or instruction-tuned (IT) model with FP8 support. The model is pretrained on 200 languages and is optimized for high-quality chat interactions through a refined post-training pipeline.

Llama 4 Maverick is a multimodal model with a 1M context length. It is suited for use cases that require advanced intelligence and image understanding, such as the following:

  • Advanced image captioning, analysis, and precise image understanding
  • Visual Q&A
  • Creative text generation
  • General-purpose AI assistants
  • Sophisticated chatbots

Llama 4 Scout

Llama 4 Scout delivers strong performance for its size class. With a 10 million token context window, it performs well on several benchmarks compared to previous Llama generations and other open and proprietary models. It features 17 billion active parameters out of the 109 billion total parameters with 16 experts and is available as a pretrained (PT) model or instruction-tuned (IT) model.

Llama 4 Scout is suited for tasks that require reasoning over large amounts of information, such as the following:

  • Retrieval tasks within long contexts
  • Summarizing multiple large documents
  • Analyzing extensive user interaction logs for personalization
  • Reasoning across large codebases

Llama 3.3

Llama 3.3 is a 70B parameter, text-only, instruction-tuned model. For text-only applications, it offers enhanced performance compared to Llama 3.1 70B and Llama 3.2 90B. For some applications, Llama 3.3 70B approaches the performance of Llama 3.1 405B.

For more information, see the Llama 3.3 model card in Model Garden.

Llama 3.2

Llama 3.2 models help you build and deploy generative AI applications that use Llama's capabilities for features like image reasoning. Llama 3.2 is also designed for on-device applications.

Key features of Llama 3.2 include the following:

  • On-device processing: The smaller models support on-device processing for a more private and personalized AI experience.
  • Efficiency: The models are designed for efficiency, with reduced latency and improved performance for a wide range of applications.
  • Llama Stack: The models use the Llama Stack, a standardized interface that simplifies building and deploying applications.
  • Vision support: A new model architecture integrates image encoder representations into the language model to support vision tasks.

The 1B and 3B models are lightweight text-only models that support on-device use cases such as multilingual local knowledge retrieval, summarization, and rewriting.

The 11B and 90B models are small and medium-sized multimodal models with image reasoning capabilities. For example, they can analyze visual data from charts to provide more accurate responses and extract details from images to generate text descriptions.

For more information, see the Llama 3.2 model card in Model Garden.

Considerations

When you use the 11B and 90B models, there are no restrictions for text-only prompts. However, if you include an image in your prompt, the image must be at the beginning of your prompt, and you can include only one image. You cannot, for example, include some text and then an image.

Llama 3.1

The Llama 3.1 family of models is a collection of multilingual, pre-trained, and instruction-tuned generative text models available in 8B, 70B, and 405B sizes. The Llama 3.1 instruction-tuned models are optimized for multilingual dialogue use cases and perform well on common industry benchmarks compared to many available open-source and proprietary chat models.

For more information, see the Llama 3.1 model card in Model Garden.

Llama 3

The Llama 3 instruction-tuned models are a collection of LLMs optimized for dialogue use cases. Llama 3 models perform well on common industry benchmarks compared to many available open-source chat models.

For more information, see the Llama 3 model card in Model Garden.

Llama 2

The Llama 2 LLMs are a collection of pre-trained and fine-tuned generative text models, ranging in size from 7B to 70B parameters.

For more information, see the Llama 2 model card in Model Garden.

Code Llama

The Code Llama models from Meta are designed for code synthesis, understanding, and instruction.

For more information, see the Code Llama model card in Model Garden.

Llama Guard 3

Llama Guard 3 builds on the capabilities of Llama Guard 2, adding three new categories: Defamation, Elections, and Code Interpreter Abuse. Additionally, this model is multilingual and has a prompt format that is consistent with Llama 3 or later instruct models.

For more information, see the Llama Guard model card in Model Garden.

Resources

For more information about Model Garden, see Explore AI models in Model Garden.