Create alert rules on the PA organization

Create groups containing the Observability alert rules for metrics or logs of your project on the GDCH console. Metric rules send alerts based on metric data, and log rules send alerts based on logging data. You must enter the query language expression that determines whether the alert must move to a pending state. Additionally, you can include optional values like labels and annotations.

Labels let you differentiate the characteristics of an alert as a map of key-value pairs. Use labels to add or overwrite information, such as the level of severity (error, critical, warning, or information), the alert code, and a short name to identify the resource.

On the other hand, use annotations to add non-identifying metadata to alerts. For example, you can include values for messages and expressions that display in fields of the user interface (UI) or runbook URLs to help with resolutive actions.

Alternatively, you can create alert rules using the Observability API to directly interact with a custom resource and update changes in your project namespace.

Create rules

You can create alert rules for data observability using the GDCH console, which is the preferred method, or by deploying a custom resource using the Observability API in your project namespace.

Console

Work through the following steps to create alert rules for data observability from the GDCH console:

  1. In the GDCH console, select a project.
  2. In the navigation menu, click Operations > Alerting.
  3. Click the Alerting Policy tab.
  4. Click Create Rule Group.
  5. Select whether you want to create a group for Metrics or Logs. Metric rules send alerts based on system monitoring data and logging rules send alerts based on system logging data.
  6. In the Alert rule group name field, enter a name for the group.
  7. In the Rule evaluation interval field, enter the number of seconds for each interval.
  8. In the Limit field, enter the maximum number of alerts. Enter 0 for unlimited alerts.
  9. In the Alert rules section, click Create Alert Rule.
  10. Enter a name for the alert rule.
  11. Enter an expression for the alert rule:

    • For a system logging rule, enter a LogQL (Log Query Language) expression.
    • For a system monitoring rule, enter a PromQL (Prometheus Query Language) expression.

    This expression must evaluate to a true or false statement, which determines whether the alert must move to a pending state or not.

  12. In the Duration field, enter the number of seconds to define when an active alert goes from the pending state to the open state.

  13. In the Severity field, choose the level of severity, such as Error or Warning.

  14. Enter a short name to identify the related resource, such as AIS or DHCP.

  15. Enter an alert code to identify the alert.

  16. Enter a runbook URL or information to help resolve the issue.

  17. Enter a message or description of the alert.

  18. Optional: Click Add label to add labels as key-value pairs.

  19. Optional: Click Add annotation to add annotations as key-value pairs.

  20. Click Save to create the rule.

  21. Click Create to create the rule group. The rule group appears in the Alert rule group list.

API

You can create system monitoring and logging rules in GDCH using the Observability API by deploying custom resources. A MonitoringRule or LoggingRule custom resource consists of one or more queries and expressions to form a condition, the frequency of evaluation, and, optionally, the duration over which the condition is met.

Work through the following steps to create alert rules by deploying a custom resource in your project namespace:

  1. Create a YAML file for the custom resource using the following templates for system monitoring or logging alert rules:
  2. In the namespace field of the custom resource, enter your project namespace.
  3. In the name field, enter a name for the alerting rule configuration.
  4. Optional: If you are configuring the LoggingRule custom resource for logging rules, you can choose the log source for alerts in the source field. For example, enter a value such as operational or audit.
  5. In the interval field, enter the number of seconds for the duration of the rule evaluation interval.
  6. Optional: In the limit field, enter the maximum number of alerts. Enter 0 for unlimited alerts.
  7. Optional: If you also want to calculate metrics and configure recording rules, enter the following information in the recordRules field:

    • In the record field, enter the recording name. This value defines the time series in which to write the recording rule and it must be a valid metric name.
    • In the expr field, enter an expression for the recording rule:

      • For a system logging rule, enter a LogQL (Log Query Language) expression.
      • For a system monitoring rule, enter a PromQL (Prometheus Query Language) expression.

      This expression must resolve to a numeric value to be recorded as a new metric.

    • Optional: In the labels field, define the labels that you want to add or overwrite as key-value pairs.

  8. In the alertRules field, enter the following information to configure alert rules:

    • In the alert field, enter the alert name.
    • In the expr field, enter an expression for the alert rule:

      • For a system logging rule, enter a LogQL expression.
      • For a system monitoring rule, enter a PromQL expression.

      This expression must evaluate to a true or false statement, which determines whether the alert must move to a pending state or not.

    • Optional: In the for field, enter the duration in seconds over which the specified condition must be met to move the alert from the pending state to the open state. The default duration if you don't specify another value is 0 seconds.

    • In the labels field, define the labels that you want to add or overwrite as key-value pairs. The following labels are required:

      • severity: Choose the level of severity, such as error, critical, warning, or info.
      • code: Enter the alert code to identify the alert.
      • resource: Enter a short name to identify the related resource, such as AIS or DHCP.
    • Optional: In the annotations field, add annotations as key-value pairs.

  9. Save the YAML file of the custom resource.

  10. Deploy the custom resource in your project namespace of the admin cluster to create the alert rules.

Configure system logging and monitoring rules from custom resources

This section contains the YAML templates you must use to create alert rules by deploying custom resources. If you create alerts from the GDCH console, you can skip this section.

The MonitoringRule custom resource

To create system monitoring rules, you must create a MonitoringRule custom resource. A MonitoringRule consists of recording rules and alert rules that describe the conditions to send an alert.

The following YAML file shows a template of the MonitoringRule custom resource:

# Configures either an alert or a target record for precomputation
apiVersion: monitoring.gdc.goog/v1
kind: MonitoringRule
metadata:
  # Choose namespace that matches the project namespace
  # Note: The alert or record will be produced in the same namespace
  namespace: PROJECT_NAMESPACE
  name: alerting-config
spec:
  # Rule evaluation interval
  interval: 60s

  # Configure limit for number of alerts (0: no limit)
  # Optional. Default: 0 (no limit)
  limit: 0

  # Configure recording rules to generate new metrics based on pre-existing metrics.
  # Recording rules precompute expressions that are frequently needed or computationally expensive.
  # These rules save their result as a new set of time series.
  recordRules:
    # Define which timeseries to write to. The value must be a valid metric name.
  - record: MyMetricsName

    # Define PromQL expression to evaluate for this rule
    expr: rate({service_name="bob-service"} [1m])

    # Define labels to add or overwrite
    # Optional. Map of key-value pairs
    labels:
      <label_key>: <label_value>

  # Configure alert rules
  alertRules:
    # Define alert name 
  - alert: <string>

    # Define PromQL expression to evaluate for this rule
    # https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
    expr: rate({service_name="bob-service"} [1m])

    # Define when an active alert moves from pending to open
    # Optional. Default: 0s
    for: 0s

    # Define labels to add or overwrite
    # Required, Map of key-value pairs
    # Required labels:
    #     severity: [error, critical, warning, info]
    #     code:
    #     resource: component/service/hardware related to the alert
    #     additional labels are optional
    labels:
      severity: error
      code: 202
      resource: AIS
      <label_key>: <label_value>

    # Define annotations to add
    # Optional. Map of key-value pairs
    # Recommended annotations:
    #     message: value of Message field in UI
    #     expression: value of Rule field in UI
    #     runbookurl: URL for link in Actions to take field in UI
    annotations:
      <label_key>: <label_value>

Replace PROJECT_NAMESPACE with the namespace of your project.

The LoggingRule custom resource

To create system logging rules, you must create a LoggingRule custom resource. A LoggingRule consists of recording rules and alert rules that describe the conditions to send an alert.

The following YAML file shows a template of the LoggingRule custom resource:

# Configures either an alert or a target record for precomputation
apiVersion: logging.gdc.goog/v1
kind: LoggingRule
metadata:
  # Choose namespace that matches the project namespace
  # Note: The alert or record will be produced in the same namespace
  namespace: PROJECT_NAMESPACE
  name: alerting-config
spec:
  # Choose which log source to base alerts on (Operational/Audit Logs)
  # Optional. Default: Operational
  source: operational

  # Rule evaluation interval
  interval: 60s

  # Configure limit for number of alerts (0: no limit)
  # Optional. Default: 0 (no limit)
  limit: 0

  # Configure recording rules to generate new metrics based on pre-existing logs.
  # Recording rules generate metrics based on logs.
  # Use recording rules for complex alerts, which query the same expression repeatedly every time they are evaluated.
  recordRules:
    # Define which timeseries to write to. The value must be a valid metric name.
  - record: MyMetricsName

    # Define LogQL expression to evaluate for this rule
    # https://grafana.com/docs/loki/latest/rules/
    expr: rate({service_name="bob-service"} [1m])

    # Define labels to add or overwrite
    # Optional. Map of key-value pairs
    labels:
      <label_key>: <label_value>

  # Configure alert rules
  alertRules:
    # Define alert name
  - alert: <string>

    # Define LogQL expression to evaluate for this rule
    expr: rate({service_name="bob-service"} [1m])

    # Define when an active alert moves from pending to open
    # Optional. Default: 0s
    for: 0s

    # Define labels to add or overwrite
    # Required, Map of key-value pairs
    # Required labels:
    #     severity: [error, critical, warning, info]
    #     code:
    #     resource: component/service/hardware related to alert
    #     additional labels are optional
    labels:
      severity: warning
      code: 202
      resource: AIS
      <label_name>: <label_value>

    # Define annotations to add
    # Optional. Map of key-value pairs
    # Recommended annotations:
    #     message: value of Message field in UI
    #     expression: value of Rule field in UI
    #     runbookurl: URL for link in Actions to take field in UI
    annotations:
      <label_name>: <label_value>

Replace PROJECT_NAMESPACE with the namespace of your project.