Enhance Gemini model security with content filters and system instructions
Salah Ahmed
Senior Product Manager, Google Cloud
Anand Iyer
Group Product Manager, Google Cloud
As organizations rush to adopt generative AI-driven chatbots and agents, it’s important to reduce the risk of exposure to threat actors who force AI models to create harmful content.
We want to highlight two powerful capabilities of Vertex AI that can help manage this risk — content filters and system instructions. Today, we’ll show how you can use them to ensure consistent and trustworthy interactions.
Content filters: Post-response defenses
By analyzing generated text and blocking responses that trigger specific criteria, content filters can help block the output of harmful content. They function independently from Gemini models as part of a layered defense against threat actors who attempt to jailbreak the model.
Gemini models on Vertex AI use two types of content filters:
-
Non-configurable safety filters automatically block outputs containing prohibited content, such as child sexual abuse material (CSAM) and personally identifiable information (PII).
-
Configurable content filters allow you to define blocking thresholds in four harm categories (hate speech, harassment, sexually explicit, and dangerous content,) based on probability and severity scores. These filters are default off but you can configure them according to your needs.
It's important to note that, like any automated system, these filters can occasionally produce false positives, incorrectly flagging benign content. This can negatively impact user experience, particularly in conversational settings. System instructions (below) can help mitigate some of these limitations.
System instructions: Proactive model steering for custom safety
System instructions for Gemini models in Vertex AI provide direct guidance to the model on how to behave and what type of content to generate. By providing specific instructions, you can proactively steer the model away from generating undesirable content to meet your organization’s unique needs.
You can craft system instructions to define content safety guidelines, such as prohibited and sensitive topics, and disclaimer language, as well as brand safety guidelines to ensure the model's outputs align with your brand's voice, tone, values, and target audience.
System instructions have the following advantages over content filters:
-
You can define specific harms and topics you want to avoid, so you’re not restricted to a small set of categories.
-
You can be prescriptive and detailed. For example, instead of just saying “avoid nudity,” you can define what you mean by nudity in your cultural context and outline allowed exceptions.
-
You can iterate instructions to meet your needs. For example, if you notice that the instruction “avoid dangerous content” leads to the model being excessively cautious or avoiding a wider range of topics than intended, you can make the instruction more specific, such as “don’t generate violent content” or “avoid discussion of illegal drug use.”
However, system instructions have the following limitations:
-
They are theoretically more susceptible to zero-shot and other complex jailbreak techniques.
-
They can cause the model to be overly cautious on borderline topics.
-
In some situations, a complex system instruction for safety may inadvertently impact overall output quality.
We recommend using both content filters and system instructions.
Evaluate your safety configuration
You can create your own evaluation sets, and test model performance with your specific configurations ahead of time. We recommend creating separate harmful and benign sets, so you can measure how effective your configuration is at catching harmful content and how often it incorrectly blocks benign content.
Investing in an evaluation set can help reduce the time it takes to test the model when implementing changes in the future.
How to get started
Both content filters and system instructions play a role in ensuring safe and responsible use of Gemini. The best approach depends on your specific requirements and risk tolerance. To get started, check out content filters and system instructions for safety documentation.