Generative AI models like Gemini require robust safety measures to
mitigate risks such as generating harmful content, leaking sensitive
information, or being misused. Google Cloud's Vertex AI platform
provides a suite of tools and practices to implement holistic safety for your
Gemini models. When deploying Gemini models, it's crucial to identify and mitigate
various potential risks. A proactive approach to understanding these risks
allows for more effective implementation of safety measures. A multi-layered
approach to safety is critical, as it can mitigate or prevent: Our deployed models offer various features to address these potential issues: Vertex AI offers several tools to manage the safety of your
Gemini models. Understanding how each works, their considerations, and
ideal use cases will help you build a tailored safety solution. Continuous safety evaluation is crucial for AI systems, as the AI landscape and
misuse methods are constantly evolving. Regular evaluations help identify
vulnerabilities, assess mitigation effectiveness, adapt to evolving risks,
ensure alignment with policies and values, build trust, and maintain compliance.
Various evaluation types, including development evaluations, assurance
evaluations, red teaming, external evaluations, and benchmark testing, help
achieve this. The scope of evaluation should cover content safety, brand safety,
relevance, bias and fairness, truthfulness, and robustness to adversarial
attacks. Tools like Vertex AI's Gen AI evaluation
service can assist in
these efforts, emphasizing that iterative improvements based on evaluation
findings are essential for responsible AI development.Potential safety risks and mitigation strategies
Available safety tools in Vertex AI for Gemini
Approach
How it works
Protection provided
Risks
When to use
Default settings: Gemini + non-configurable filters
Gemini models are inherently designed with safety and fairness in mind, even
when faced with adversarial prompts. Google has invested in comprehensive safety
evaluations, including for bias and toxicity. The default settings include an
independent protection layer designed to prevent the generation of content related
to Child Sexual Abuse Material (CSAM) or Copyrighted Content (Recitation).
Baseline protection against Child sexual abuse material and copyright (Recitation)
Gemini's default safety might not meet your organization's needs. The model can
hallucinate or not follow instructions. Motivated attackers may still succeed in
jailbreaks and prompt injection
Workflows where no malicious input is expected
Configurable Filters
Gemini's prebuilt content filters provide additional protection against
various categories of harmful content, such as sexual, hate, harassment, or
dangerous content. You can configure blocking thresholds for each harm category
(e.g.,
BLOCK_LOW_AND_ABOVE
, BLOCK_MEDIUM_AND_ABOVE
,
BLOCK_ONLY_HIGH
) based on the probability and/or severity of the harmful
content. These are an independent layer from the model so are robust against
jailbreaks.
Robust against violations for predefined categories, adjustable sensitivity
Lacks fine-grained customization beyond threshold settings for predefined categories.
May occasionally block benign content (false positives) or miss some harmful
content (false negatives). Only available for response filtering, not prompt
filtering.
Provide a base level of safety for user facing applications or agents. If your goal
is to ensure content and brand safety, content filters should be paired with
system instructions.
System Instructions
You can instruct the model on your brand and content safety guidelines through
system instructions or preambles. For example, you can tell the model "don't
answer questions related to politics" or to adhere to specific brand voice and
tone guidelines. System instructions directly guide the model's behavior.
Customizable for content/brand safety, can be highly effective.
The model can hallucinate or not follow instructions. Motivated attackers may still
succeed in jailbreaks and prompt injection
Applications or agents requiring adherence to specific brand guidelines or nuanced
content policies. If your goal is to ensure content and brand safety, system
instructions should be paired with content filters.
DLP for custom block lists and Sensitive data protection
The DLP API can inspect text to identify and classify sensitive
information based on a wide range of predefined and custom infoType detectors.
Once identified, it can apply de-identification techniques such as redaction,
masking, or tokenization. The DLP API can also be used to block
keywords. Input Protection: Before sending user prompts or data to
Gemini, you can pass the text through the DLP API to
redact or mask any sensitive information. This prevents sensitive data from being
processed or logged by the model. Output Protection: If there's a risk that
Gemini might inadvertently generate or reveal sensitive information
(e.g., if it's summarizing source documents containing PII), the model's output
can be scanned by the DLP API before being sent to the user.
Robust filtering for profanities or custom words. Robust filtering for sensitive
data.
Adds latency. Can lead to over-blocking.
Data loss protection for agents that have access to sensitive data.
Gemini as a Filter
You can use Gemini to filter prompts and responses for your agent or app.
This involves making a second call to a fast and cost-effective Gemini model
(such as Gemini Flash or Flash Lite) to evaluate whether the input from
a user or tool, or the output from your primary Gemini model, is safe. The
filter model is given instructions to decide if the content is safe or unsafe
based on your defined policies including content safety, brand safety, and agent
misalignment. This offers robust and highly customizable protection against
content safety violations, brand safety issues, model drift, and hallucinations
and can analyze text, images, video, and audio for a holistic understanding.
Highly robust and customizable for content/brand safety, drift, hallucination;
multimodal understanding.
Additional cost and latency. Chance of extremely rare false negatives.
Provide a custom level of safety for user facing applications or agents
Multi-layered approach: configurable filters + system instructions + DLP +
Gemini as a filter
Highly robust and customizable for content/brand safety, drift, hallucination;
multimodal understanding
Additional cost and latency.
Provide a robust level of safety for user-facing applications or agents,
especially where adversarial and malicious use is expected
Continuous safety evaluation
Safety in Vertex AI
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-21 UTC.