Configure safety settings for the PaLM API

Safety attribute confidence and severity scoring

Content processed through the Vertex AI PaLM API is assessed against a list of safety attributes, which include "harmful categories" and topics that can be considered sensitive.

Each safety attribute has an associated confidence score between 0.0 and 1.0, rounded to one decimal place, reflecting the likelihood of the input or response belonging to a given category.

Four of these safety attributes (harassment, hate speech, dangerous content, and sexually explicit), are assigned a safety rating (severity level) and a severity score ranging from 0.0 to 1.0, rounded to one decimal place. These ratings and scores reflect the predicted severity of the content belonging to a given category.

Sample response

{
  "predictions": [
    {
      "safetyAttributes": {
        "categories": [
          "Derogatory",
          "Toxic",
          "Violent",
          "Sexual",
          "Insult",
          "Profanity",
          "Death, Harm & Tragedy",
          "Firearms & Weapons",
          "Public Safety",
          "Health",
          "Religion & Belief",
          "Illicit Drugs",
          "War & Conflict",
          "Politics",
          "Finance",
          "Legal"
        ],
        "scores": [
          0.1,
          0.1,
          0.1,
          0.1,
          0.1,
          0.1,
          0.1,
          0.1,
          0.1,
          0.1,
          0.1,
          0.1,
          0.1,
          0.1,
          0.1,
          0.1
        ],
       "safetyRatings": [
          {"category": "Hate Speech", "severity": "NEGLIGIBLE", "severityScore": 0.0,"probabilityScore": 0.1},
          {"category": "Dangerous Content", "severity": "LOW", "severityScore": 0.3, "probabilityScore": 0.1},
          {"category": "Harassment", "severity": "MEDIUM", "severityScore": 0.6, "probabilityScore": 0.1},
          {"category": "Sexually Explicit", "severity": "HIGH", "severityScore": 0.9, "probabilityScore": 0.1}
       ],
        "blocked": false
      },
      "content": "<>"
    }
  ]
}

Note: Categories with a score that rounds to 0.0 are omitted in the response. This sample response is for illustrative purposes only.

Sample response when blocked

{
  "predictions": [
    {
      "safetyAttributes": {
        "blocked": true,
        "errors": [
          150,
          152,
          250
        ]
      },
      "content": ""
    }
  ]
}

Safety attribute descriptions

Safety Attribute Description
Derogatory Negative or harmful comments targeting identity and/or protected attributes.
Toxic Content that is rude, disrespectful, or profane.
Sexual Contains references to sexual acts or other lewd content.
Violent Describes scenarios depicting violence against an individual or group, or general descriptions of gore.
Insult Insulting, inflammatory, or negative comment towards a person or a group of people.
Profanity Obscene or vulgar language such as cursing.
Death, Harm & Tragedy Human deaths, tragedies, accidents, disasters, and self-harm.
Firearms & Weapons Content that mentions knives, guns, personal weapons, and accessories such as ammunition, holsters, etc.
Public Safety Services and organizations that provide relief and ensure public safety.
Health Human health, including: Health conditions, diseases, and disorders; medical therapies, medication, vaccination, and medical practices; resources for healing, including support groups.
Religion & Belief Belief systems that deal with the possibility of supernatural laws and beings; religion, faith, belief, spiritual practice, churches, and places of worship. Includes astrology and the occult.
Illicit Drugs Recreational and illicit drugs; drug paraphernalia and cultivation, headshops, and more. Includes medicinal use of drugs typically used recreationally (for example, marijuana).
War & Conflict War, military conflicts, and major physical conflicts involving large numbers of people. Includes discussion of military services, even if not directly related to a war or conflict.
Finance Consumer and business financial services, such as banking, loans, credit, investing, insurance, and more.
Politics Political news and media; discussions of social, governmental, and public policy.
Legal Law-related content, to include: law firms, legal information, primary legal materials, paralegal services, legal publications and technology, expert witnesses, litigation consultants, and other legal service providers.

Safety attributes with safety ratings

Safety Attribute Definition Levels
Hate Speech Negative or harmful comments targeting identity and/or protected attributes. High, Medium, Low, Negligible
Harassment Malicious, intimidating, bullying, or abusive comments targeting another individual. High, Medium, Low, Negligible
Sexually Explicit Contains references to sexual acts or other lewd content. High, Medium, Low, Negligible
Dangerous Content Promotes or enables access to harmful goods, services, and activities. High, Medium, Low, Negligible

Safety thresholds

Safety thresholds are in place for the following safety attributes:

  • Hate Speech
  • Harassment
  • Sexually Explicit
  • Dangerous Content

Google blocks model responses that exceed the designated severity scores for these safety attributes. To request the ability to modify a safety threshold, contact your Google Cloud account team.

Testing your confidence and severity thresholds

You can test Google's safety filters and define confidence thresholds that are right for your business. By using these thresholds, you can take comprehensive measures to detect content that violates Google's usage policies or terms of service and take appropriate action.

The confidence scores are predictions only, and you shouldn't depend on the scores for reliability or accuracy. Google is not responsible for interpreting or using these scores for business decisions.

Important: Probability and Severity

With the exception of the four safety attributes with safety ratings, the PaLM API safety filters confidence scores are based on the probability of content being unsafe and not the severity. This is important to consider because some content can have low probability of being unsafe even though the severity of harm could still be high. For example, comparing the sentences:

  1. The robot punched me.
  2. The robot slashed me up.

Sentence 1 might cause a higher probability of being unsafe but you might consider sentence 2 to be a higher severity in terms of violence.

Given this, it is important customers carefully test and consider what the appropriate level of blocking is needed to support their key use cases while minimizing harm to end users.

Safety errors

Safety error codes are three-digit codes that represent the reason that a prompt or response was blocked. The first digit is a prefix that indicates whether the code applies to the prompt or the response, and the remaining digits identify the reason that the prompt or response was blocked. For example, an error code of 251 indicates that the response was blocked due to an issue with hate speech content in the response from the model.

Multiple error codes can be returned in a single response.

If you encounter errors that block the content in your response from the model (prefix = 2, for example 250), adjust the temperature setting in your request. This helps generate a different set of responses with less of a chance of being blocked.

Error code prefix

The error code prefix is the first digit of the error code.

1 The error code applies to the prompt sent to the model.
2 The error code applies to the response from the model.

Error code reason

The error code reason is the second and third digits of the error code.

Error codes reasons that begin with 3 or 4 indicate prompts or responses blocked because the confidence threshold for a safety attribute violation was met.

Error codes reasons that begin with 5 indicate prompts or responses where unsafe content was found.

10

The response was blocked because of a quality issue or a parameter setting affecting citation metadata. This applies only to responses from the model. That is, 210.

Citation checker identifies quality issues or issues stemming from a parameter setting. Try increasing the temperature, top-k, or top-p parameters to generate a different response.

For more information, see Citation metadata.

20 The supplied or returned language is unsupported. For a list of supported languages, see Language support.
30 The prompt or response was blocked because it was found to be potentially harmful. A term is included from the terminology blocklist. Rephrase your prompt.
31 The content might include Sensitive Personally Identifiable Information (SPII). Rephrase your prompt.
40 The prompt or response was blocked because it was found to be potentially harmful. The content violates SafeSearch settings. Rephrase your prompt.
50 The prompt or response was blocked because it might contain sexually explicit content. Rephrase your prompt.
51 The prompt or response was blocked because it might contain hate speech content. Rephrase your prompt.
52 The prompt or response was blocked because it might contain harassment content. Rephrase your prompt.
53 The prompt or response was blocked because it might contain dangerous content. Rephrase your prompt.
54 The prompt or response was blocked because it might contain toxic content. Rephrase your prompt.
00 Reason unknown. Rephrase your prompt.

Safety settings in Vertex AI Studio

With the adjustable safety filter threshold, you can adjust how likely you are to see responses that could be harmful. Model responses are blocked based on the probability that it contains harassment, hate speech, dangerous content, or sexually explicit content. The safety filter setting is located on the right side of the prompt field in Vertex AI Studio. You can choose from three options: block most, block some, and block few.

Console-image

What's next