Query and view open alerts

After creating alert rules in your Google Distributed Cloud (GDC) air-gapped appliance project, you can query and view alerts on dashboards from the user interface (UI) of the project's system monitoring instance or query alerts from the GDC Observability HTTP API.

Query and view alerts on dashboards

You can view alerts on dashboards from the system monitoring instance of the platform-obs project.

The system monitoring instance includes project-level metrics, logs, and alerts to perform monitoring processes such as network monitoring and server monitoring.

Before you begin

Before querying and viewing alerts on dashboards, you must obtain access to the system monitoring instance. For more information, see Get access to dashboards.

To sign in and visualize alerts, ask your Project IAM Admin to grant you the Project Grafana Viewer (project-grafana-viewer) role. This role-based access control process lets you access data visualizations safely.

System monitoring instance endpoint

For Application Operator (AO):

Open the following URL to access the endpoint of your project:

https://GDC_URL/PROJECT_NAMESPACE/grafana

Replace the following:

  • GDC_URL: The URL of your organization in GDC.
  • PROJECT_NAMESPACE: The namespace of your project.

The project's UI contains default dashboards such as the Alerts - Overview dashboard with information about alerts. Querying alerts from the UI lets you visually retrieve alerting information from your project and get an integrated view of resources for awareness and quick resolution of problems.

For Platform Admin (PA):

Open the following URL to access the endpoint of your platform-obs project:

https://GDC_URL/platform-obs/grafana

Replace GDC_URL with the URL of your organization in GDC.

The user interface (UI) of the system monitoring instance contains default dashboards such as the Alerts - Overview dashboard with information about alerts for data observability. Querying alerts from the UI lets you visually retrieve alerting information from your project and get an integrated view of resources for awareness and quick resolution of problems.

The Alerts - Overview dashboard shows information about the number of alerts for a specific data source and a line graph of the alerts history, showing the number of alerts open per hour for the data source.

Figure 1. The Alerts - Overview dashboard on the Grafana UI.

Alertmanager

Alertmanager lets you monitor alert notifications from client applications. You can inspect and silence alerts using Alertmanager, and filter or group alerts:

Ignore Loki rejection audit logs alerts in the root admin cluster

Figure 2. Menu option to query audit logs from the Alertmanager.

Predefined alerting policies

The following table lists the pre-installed alerting rules in Prometheus:

Name Description
KubeAPIDown (critical) KubeAPI has disappeared from Prometheus target discovery for 15 minutes.
KubeClientErrors (warning) Kubernetes API server client errors ratio > 0.01 for 15 minutes.
KubeClientErrors (critical) Kubernetes API server client errors ratio > 0.1 for 15 minutes.
KubePodCrashLooping (warning) Pod has been in a crash looping state for longer than 15 minutes.
KubePodNotReady (warning) Pod has been in a non-ready state for longer than 15 minutes.
KubePersistentVolumeFillingUp (critical) Free bytes of a claimed PersistentVolume < 0.03.
KubePersistentVolumeFillingUp (warning) Free bytes of a claimed PersistentVolume < 0.15.
KubePersistentVolumeErrors (critical) The persistent volume is in the Failed or Pending phase for five minutes.
KubeNodeNotReady (warning) Node has been unready for more than 15 minutes.
KubeNodeCPUUsageHigh (critical) Node CPU usage is > 80%.
KubeNodeMemoryUsageHigh (critical) Node memory usage is > 80%.
NodeFilesystemSpaceFillingUp (warning) Node file system usage is > 60%.
NodeFilesystemSpaceFillingUp (critical) Node file system usage is > 85%.
CertManagerCertExpirySoon (warning) A certificate is expiring in 21 days.
CertManagerCertNotReady (critical) A certificate is not ready to serve traffic after 10 minutes.
CertManagerHittingRateLimits (critical) A rate limit has been hit creating and renewing certificates for five minutes.
DeploymentNotReady (critical). A Deployment on the org admin cluster has been in a non-ready state for longer than 15 minutes.

Sample alertmanagerConfigurationConfigmaps

Syntax of configs in ConfigMaps that alertmanagerConfigurationConfigmaps lists must follow https://prometheus.io/docs/alerting/latest/configuration/

apiVersion: observability.gdc.goog/v1alpha1
kind: ObservabilityPipeline
metadata:
  # Choose namespace that matches the project's namespace
  namespace: kube-system
  name: observability-config
# Configure Alertmanager
 alerting:
  # Storage size for alerting data within organization
  # Permission: PA
  localStorageSize: 1Gi

  # Permission: PA & AO
  # alertmanager config must be under the key "alertmanager.yml" in the configMap
  alertmanagerConfig: <configmap-for-alertmanager-config>

  # Permission: PA
  volumes:
    - <volume referenced in volumeMounts>

  # Permission: PA
  volumeMounts:
    - <volumeMount referenced in alertmanagerConfig>

Sample rule configuration

# Configures either an alert or a target record for precomputation
apiVersion: monitoring.gdc.goog/v1alpha1
kind: MonitoringRule
metadata:
  # Choose namespace that contains the metrics that rules are based on
  # Note: alert/record will be produced in the same namespace
  namespace: g-fleetns-a
  name: alerting-config
spec:
  # Rule evaluation interval
  interval: <duration>

  # Configure limit for number of alerts (0: no limit)
  # Optional, Default: 0 (no limit)
  limit: <int>

  # Configure record rules
  recordRules:
    # Define which timeseries to write to (must be a valid metric name)
  - record: <string>

    # Define PromQL expression to evaluate for this rule
    expr: <string>

    # Define labels to add or overwrite
    # Optional, Map of {key, value} pairs
    labels:
      <labelname>: <labelvalue>

  # Configure alert rules
  alertRules:
    # Define alert name 
  - alert: <string>

    # Define PromQL expression to evaluate for this rule
    # https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
    expr: <string>

    # Define when an active alert moves from pending to firing
    # Optional, Default: 0s
    for: <duration>

    # Define labels to add or overwrite
    # Required, Map of {key, value} pairs
    # Required labels: 
    #     severity: [error, critical, warning, info]
    #     code: 
    #     resource: component/service/hardware related to alert
    #     additional labels are optional
    labels:
      severity: <enum: [error, critical, warning, info]>
      code: 
      resource: <Short name of the related operable component>
      <labelname>: <tmpl_string>

    # Define annotations to add
    # Optional, Map of {key, value} pairs
    # Recommended annotations:
    #     message: value of Message field in UI
    #     expression: value of Rule field in UI
    #     runbookurl: URL for link in Actions to take field in UI
    annotations:
      <labelname>: <tmpl_string>
# Configures either an alert or a target record for precomputation
apiVersion: logging.gdc.goog/v1alpha1
kind: LoggingRule
metadata:
  # Choose namespace that contains the logs that rules are based on
  # Note: alert/record will be produced in the same namespace
  namespace: g-fleetns-a
  name: alerting-config
spec:
  # Choose which log source to base alerts on (Operational/Audit/Security Logs)
  # Optional, Default: Operational
  source: <string>

  # Rule evaluation interval
  interval: <duration>

  # Configure limit for number of alerts (0: no limit)
  # Optional, Default: 0 (no limit)
  limit: <int>

  # Configure record rules
  recordRules:
    # Define which timeseries to write to (must be a valid metric name)
  - record: <string>

    # Define LogQL expression to evaluate for this rule
    # https://grafana.com/docs/loki/latest/rules/
    expr: <string>

    # Define labels to add or overwrite
    # Optional, Map of {key, value} pairs
    labels:
      <labelname>: <labelvalue>

  # Configure alert rules
  alertRules:
    # Define alert name
  - alert: <string>

    # Define LogQL expression to evaluate for this rule
    expr: <string>

    # Define when an active alert moves from pending to firing
    # Optional, Default: 0s
    for: <duration>

    # Define labels to add or overwrite
    # Required, Map of {key, value} pairs
    # Required labels: 
    #     severity: [error, critical, warning, info]
    #     code: 
    #     resource: component/service/hardware related to alert
    #     additional labels are optional
    labels:
      severity: <enum: [error, critical, warning, info]>
      code:
      resource: <Short name of the related operable component>
      <labelname>: <tmpl_string>

    # Define annotations to add
    # Optional, Map of {key, value} pairs
    # Recommended annotations:
    #     message: value of Message field in UI
    #     expression: value of Rule field in UI
    #     runbookurl: URL for link in Actions to take field in UI
    annotations:
      <labelname>: <tmpl_string>

Query alerts from the HTTP API

The Observability platform exposes an HTTP API endpoint for querying and reading metrics, alerts, and other time series data from your project for system monitoring.

Query alerts directly from the Observability HTTP API to set up automated tasks, adapt responses, and build integrations according to your use case. For example, insert the output into another command, export details to text file formats, or configure a Linux cron job. You can call the Observability HTTP API from the command-line interface (CLI) or a web browser and obtain the result in JSON format.

This section explains how to call the Observability HTTP API endpoint from the CLI using the API specification to query alerts.

Query alerts directly from the Observability HTTP API to set up automated tasks, adapt responses, and build integrations according to your use case. For example, insert the output into another command, export details to text file formats, or configure a Linux cron job. You can call the Observability HTTP API from the command-line interface (CLI) or a web browser and obtain the result in JSON format.

This section explains how to call the Observability HTTP API endpoint from the CLI using the Alertmanager API specification to query metrics.

Before you begin

To get the permissions you need to access the Observability HTTP API endpoint, ask your Project IAM Admin to grant you the Project Cortex Alertmanager Viewer (project-cortex-alertmanager-viewer) role in your project namespace.

The Project IAM Admin can grant you access by creating a role binding:

a. Infrastructure Operator (IO) Root-Admin - Project Cortex Alertmanager Viewer:

kubectl --kubeconfig $HOME/root-admin-kubeconfig create rolebinding 
io-cortex-alertmanager-viewer-binding -n infra-obs 
--user=fop-infrastructure-operator@example.com 
--role=project-cortex-alertmanager-viewer

b. Platform-Admin (PA) Root-Admin - Project Cortex Alertmanager Viewer:

kubectl --kubeconfig $HOME/root-admin-kubeconfig create rolebinding
pa-cortex-alertmanager-viewer-binding -n platform-obs 
--user=fop-platform-admin@example.com 
--role=project-cortex-alertmanager-viewer

c. Application Operator (AO) Root-Admin - Project Cortex Alertmanager Viewer: Project: $AO_PROJECT AO User Name: $AO_USER

kubectl --kubeconfig $HOME/root-admin-kubeconfig create rolebinding 
project-cortex-alertmanager-viewer-binding -n $AO_PROJECT 
--user=$AO_USER 
--role=project-cortex-alertmanager-viewer

After the role binding is created, you can access corresponding Alertmanager with your login username.

Verify the role binding

kubectl --kubeconfig $HOME/org-1-admin-kubeconfig get rolebinding -n platform-obs

For information about setting role bindings from the GDC console, see Grant access to resources.

Cortex endpoint

The following URL is the Cortex endpoint for accessing alerts:

https://GDC_URL/PROJECT_NAME/cortex/alertmanager/

Replace the following:

  • GDC_URL: The URL of your organization in GDC.
  • PROJECT_NAME: The name of your project.

Call the API endpoint

Follow these steps to reach the Cortex API endpoint from the CLI and query alerts:

  1. Ensure you meet the prerequisites.
  2. Open the CLI.
  3. Use the curl tool to call the Cortex endpoint URL and extend the URL using the standard https://prometheus.io/docs/prometheus/latest/querying/api/#alertmanagers to query alerts. For example:

    curl https://console.org-1.zone1.google.gdch.test/alice/cortex/alertmanager/api/v1/alertmanagers
    

You obtain the output in the CLI following the command. The API response format is JSON.