This document describes how to identify deployment failures and operational incidents you might encounter in Google Distributed Cloud (GDC) air-gapped appliance and contains descriptions of all the alerts displayed in the system to help you solve common issues with logging and monitoring components.
Identify Observability components
The Observability platform deploys its components to the obs-system
namespace in all cluster types, including user clusters.
The Grafana instance of the Infrastructure Operator (IO) provides access to organization-level metrics to monitor infrastructure components such as CPU utilization and storage consumption. It also provides access to operational and audit logs. Additionally, it gives access to alerts, logs, and metrics from the GDC operable components.
The GDC monitoring and logging stacks use open source solutions as part of the Observability platform. These components collect logs and metrics from Kubernetes pods, bare metal machines, network switches, storage, and managed services.
The following table contains a description of all the components that integrate the Observability platform:
Component | Description |
---|---|
Prometheus |
Prometheus is a time-series database for collecting and storing metrics and evaluating alerts. Prometheus stores metrics in the Cortex instance of the admin cluster for long-term storage. Prometheus adds labels as key-value pairs and collects metrics from Kubernetes nodes, pods, bare metal machines, network switches, and storage appliances. The database stores metrics from the user cluster in the same cluster and aggregates metrics from all clusters in the admin cluster. |
Alertmanager |
Alertmanager is a user-defined manager system that sends alerts when logs or metrics indicate system components fail or do not operate normally. It manages Prometheus alerts routing, silencing, and aggregation. |
Loki |
Loki is a time-series database that stores and aggregates logs from various sources. It indexes logs for efficient querying. |
Grafana |
Grafana provides a user interface (UI) for viewing metrics that Prometheus collects and querying audit and operational logs from the corresponding Loki instances. The UI lets you visualize dashboards of metrics and alerts. |
Fluent Bit |
Fluent Bit is a processor that pulls logs from various components or locations and sends them into Loki. It runs on every node of all clusters. |
Identify deployment failures
If your deployment is running and healthy, the components run in the READY
state.
Work through the following steps to identify deployment failures:
Confirm the current state of a component:
kubectl get -n obs-system TYPE/COMPONENT
Replace the following:
TYPE
: the component typeCOMPONENT
: the component name
You get an output similar to the following:
NAME READY AGE COMPONENT 1/1 23h
If the component is healthy, the
READY
column of the output showsN/N
as a value. If theREADY
column doesn't show a value, it doesn't necessarily indicate failure. The service might need more time to process.Check the pods in each component:
kubectl get pods -n obs-system | awk 'NR==1 | /COMPONENT/'
Replace
COMPONENT
with the component name.You get an output similar to the following:
NAME READY STATUS RESTARTS AGE COMPONENT 1/1 Running 0 23h
Verify that the
READY
column showsN/N
as a value, theSTATUS
column shows aRunning
value, and the number ofRESTARTS
does not exceed a value of2
.A high number of restarts indicates the following symptoms:
- The pods fail and Kubernetes restarts them.
- The
STATUS
column shows theCrashLoopBackoff
value.
To resolve the failing status, view the logs of the pods.
If a pod is in a state of
PENDING
, this state indicates one or more of the following symptoms:- The pod is waiting for network access to download the necessary container.
- A configuration issue prevents the pod from starting. For example, a
Secret
value that the pod requires is missing. - Your Kubernetes cluster has run out of resources to schedule the pod, which occurs if many applications are running on the cluster.
Determine the cause of a
PENDING
state:kubectl describe -n obs-system pod/POD_NAME
Replace
POD_NAME
with the name of the pod that shows thePENDING
state.The output shows more details about the pod.
Navigate to the
Events
section of the output to view a table listing the recent events of the pod and a summary on the cause of thePENDING
state.The following output shows a sample
Events
section for a GrafanaStatefulSet
object:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreate 13s (x3 over 12d) statefulset-controller create Pod grafana-0 in StatefulSet grafana successful
If there are no events in your pod or any other resource for an extended time, you receive the following output:
Events: <none>
Check that the Observability logging stack is running
Work through the following steps to verify that the logging stack is running:
Check that Istio Service Mesh includes the following Observability logging components in each of the following clusters:
- Root admin cluster: Verify that all Loki instances or pods have the Istio sidecar injected.
- All clusters: Verify that all Fluent Bit pods named
anthos-audit-logs-forwarder-SUFFIX
andanthos-log-forwarder-SUFFIX
have the Istio sidecar injected.
Check that all Loki instances are running without errors in the root admin cluster.
In all clusters, check the status of
anthos-audit-logs-forwarder
andanthos-log-forwarder
DaemonSet
objects to verify that all instances are running in all nodes without errors.Verify that you get the operational logs from
kube-apiserver-SUFFIX
containers and audit logs from the Kubernetes API server for the last five minutes in all clusters. To do so, run the following queries in the Grafana instance:- Operational logs:
sum (count_over_time({service_name="apiserver"} [5m])) by (cluster, fluentbit_pod)
- Audit logs:
sum (count_over_time({cluster=~".+"} [5m])) by (cluster, node)
You must obtain non-zero values for all control plane nodes in all clusters.
- Operational logs:
Check that the Observability monitoring stack is running
Work through the following steps to verify that the monitoring stack is running:
Check that the Grafana instances are running in the root admin cluster. The
grafana-0
pods must run without errors in the following namespaces:obs-system
infra-obs-obs-system
platform-obs-obs-system
Ensure that all monitoring components are in the Istio Service Mesh. Work through the steps of the Identify deployment failures section. Each of the following pods must show all containers are ready in the
READY
column. For example, a value of3/3
means that three containers out of three are ready. Additionally, the pods must have anistio-proxy
container. If the pods don't meet these conditions, restart the pod:Root admin cluster
Pod name Number of containers ready cortex-*
2/2
cortex-etcd-0
2/2
cortex-proxy-server-*
2/2
cortex-tenant-*
2/2
meta-blackbox-exporter-*
2/2
meta-grafana-0
3/3
meta-grafana-proxy-server-*
2/2
meta-prometheus-0
4/4
System cluster
Pod name Number of containers ready cortex-alertmanager-*
2/2
cortex-compactor-*
2/2
cortex-distributor-*
2/2
cortex-etcd-0
2/2
cortex-ingester-*
2/2
cortex-querier-*
2/2
cortex-query-frontend-*
2/2
cortex-query-scheduler-*
2/2
cortex-ruler-
2/2
cortex-store-gateway-*
2/2
cortex-tenant-*
2/2
grafana-proxy-server-*
2/2
meta-blackbox-exporter-*
2/2
meta-grafana-0
3/3
meta-grafana-proxy-server-*
2/2
meta-prometheus-0
4/4
Ensure that Cortex is running without errors.
Retrieve Observability logs
The following table contains the commands that you must run to retrieve the logs for each of the components the Observability platform deploys.
Component | Log-retrieval command |
---|---|
grafana |
kubectl logs -n obs-system statefulset/grafana |
anthos-prometheus-k8s |
kubectl logs -n obs-system statefulset/anthos-prometheus-k8s |
alertmanager |
kubectl logs -n obs-system statefulset/alertmanager |
ops-logs-loki-io |
kubectl logs -n obs-system statefulset/ops-logs-loki-io |
ops-logs-loki-io-read |
kubectl logs -n obs-system statefulset/ops-logs-loki-io-read |
ops-logs-loki-all |
kubectl logs -n obs-system statefulset/ops-logs-loki-all |
ops-logs-loki-all-read |
kubectl logs -n obs-system statefulset/ops-logs-loki-all-read |
audit-logs-loki-io |
kubectl logs -n obs-system statefulset/audit-logs-loki-io |
audit-logs-loki-io-read |
kubectl logs -n obs-system statefulset/audit-logs-loki-io-read |
audit-logs-loki-pa |
kubectl logs -n obs-system statefulset/audit-logs-loki-pa |
audit-logs-loki-pa-read |
kubectl logs -n obs-system statefulset/audit-logs-loki-pa-read |
audit-logs-loki-all |
kubectl logs -n obs-system statefulset/audit-logs-loki-all |
audit-logs-loki-all-read |
kubectl logs -n obs-system statefulset/audit-logs-loki-all-read |
anthos-log-forwarder |
kubectl logs -n obs-system daemonset/anthos-log-forwarder |
anthos-audit-logs-forwarder |
kubectl logs -n obs-system daemonset/anthos-audit-logs-forwarder |
oplogs-forwarder |
kubectl logs -n obs-system daemonset/oplogs-forwarder |
logmon-operator |
kubectl logs -n obs-system deployment/logmon-operator |
To view the logs of a component's previous instance, add the -p
flag at the
end of each command. Adding the -p
flag lets you review logs of a previous
failed instance instead of the current running instance.
View the configuration
The Observability stack uses Kubernetes API custom resources to configure monitoring and logging pipelines.
The LoggingPipeline
custom resource is deployed in the root admin cluster and configures Loki instances.
The following commands show the available actions that you can perform on the logging pipeline:
View the current configuration of your logging pipeline deployment:
kubectl get loggingpipeline -n obs-system default -o yaml
Change the configuration of your logging pipeline deployment:
kubectl edit loggingpipeline -n obs-system default
GDC uses a logging and monitoring operator named logmon-operator
to manage the deployment of Observability components such as Prometheus and Fluent Bit. The API to the logmon-operator
component is the logmon
custom resource definition. The logmon
custom resource definition instructs the logmon-operator
on how to configure Observability for your deployment. This custom resource definition includes the properties of volumes to store your metrics, alert rules for Alertmanager, Prometheus configurations to collect metrics, and Grafana configurations for dashboards.
The following commands show the available actions that you can perform on the logmon
custom resource definition:
View the current configuration for your Observability deployment:
kubectl get logmon -n obs-system logmon-default -o yaml
Change the configuration of your Observability deployment:
kubectl edit logmon -n obs-system logmon-default
The output from running either command might reference multiple Kubernetes
ConfigMap
objects for further configuration. For example, you can configure
Alertmanager rules in a separate ConfigMap
object, which is referenced in the
logmon
custom resource definition by name. You can change and view the
Alertmanager configuration through the logmon
custom resource definition named
gpc-alertmanager-config
.
To view the Alertmanager configuration, run:
kubectl get configmap -n obs-system gpc-alertmanager-config -o yaml
Common issues
This section contains common issues you might face when deploying the Observability platform.
You cannot access Grafana
By default, Grafana is not exposed to machines external to your Kubernetes admin
cluster. To temporarily access the Grafana interface from outside of the admin
cluster, you can port-forward the Service
to localhost. To port-forward the
Service
, run:
kubectl port-forward -n gpc-system service/grafana 33000\:3000
In your web browser, navigate to http://localhost:33000
to view the Grafana
dashboard for your deployment. To end the process, press the Control+C keys.
Grafana runs slowly
Grafana running slowly indicates the following:
- Queries to Prometheus or Loki return excessive data.
- Queries return more data than is reasonable to display on a single graph.
To resolve slow speeds within Grafana, check the queries on your custom Grafana dashboards. If the queries return more data than is reasonable to display on a single graph, consider reducing the amount of data shown to improve dashboard performance.
Grafana dashboard shows no metrics or logs
Grafana showing no metrics or logs could be caused by the following reasons:
- Grafana data sources are not properly set.
- The system has connectivity issues to either monitoring or logging data sources.
- The system is not collecting metrics or logs.
To view the logs and metrics, follow these steps:
- In the Grafana user interface, click Dashboard settings.
- Select Data Sources.
- On the Data Sources page, make sure you see the following sources:
Name | Organization | URL |
---|---|---|
Audit Logs |
All |
http://audit-logs-loki-io-read.obs-system.svc:3100 |
Operational Logs |
Root |
http://ops-logs-loki-io-read.obs-system.svc:3100 |
Operational Logs |
Org |
http://ops-logs-loki-all-read.obs-system.svc:3100 |
prometheus |
http://anthos-prometheus-k8s.obs-system.svc:9090 |
Missing these data sources indicates that the Observability stack failed to configure Grafana correctly.
If you configured the data sources correctly but no data shows, this might
indicate an issue with Service
objects that collect metrics or logs to feed
into Prometheus or Loki.
As Prometheus collects metrics, it follows a pull model to periodically query
your Service
objects for metrics and store the values found. For Prometheus to
discover your Service
objects for metric collection, the following must be true:
All pods for
Service
objects are annotated with'monitoring.gke.io/scrape: "true"'
.The Prometheus metric format exposes the pod metrics over HTTP. By default, Prometheus looks for these metrics at the endpoint
http://POD_NAME:80/metrics
. If needed, you can override the port, endpoint, and schema through annotations.
Fluent Bit collects logs and is intended to run on every node of your Kubernetes clusters. Fluent Bit sends the logs to Loki for long term storing.
If no logs are present in Grafana, try the following workarounds:
Check the logs of Loki instances to ensure they run without errors.
If the Loki instances are running properly but the logs do not appear, check the logs in Fluent Bit to ensure the service works as expected. To review how to pull logs, see Retrieve Observability logs.
Alertmanager is not opening alerts
If Alertmanager fails to open alerts, work through the following steps:
- In your
configMap
object within thegpc-system
namespace, ensure the labellogmon: system_metrics
exists. - Verify that the
configMap
data section includes a key namedalertmanager.yml
. The value for thealertmanager.yml
key must be the alert rules contained in your valid Alertmanager configuration file. Ensure the
logmon
custom resource definition namedlogmon-default
in thegpc-system
namespace contains a reference to theconfigMap
object. Thelogmon-default
custom resource definition must contain the name of theconfigMap
object as shown in the following example:apiVersion: addons.gke.io/v1 kind: Logmon spec: system_metrics: outputs: default_prometheus: deployment: components: alertmanager: alertmanagerConfigurationConfigmaps: - alerts-config
The
alerts-config
value in the example is the name of yourconfigMap
object.
Alertmanager is not sending alerts to the configured notification channels
A configuration error might prevent you from receiving notifications in the external software you configured as a notification channel, such as Slack, even if Alertmanager generates alerts in the Grafana instance.
To receive alerts in your external software, follow these steps:
Check that the values in your Alertmanager configuration file are properly formatted. When Alertmanager triggers an alert, it queries a webhook on the external software.
Ensure the webhook URLs that connect to the external software are correct. If the URLs are correct, ensure the software is configured to accept webhooks. You might need to generate an API key to access the external service API, or your current API key might be expired, and you need to refresh it.
If the external service is outside of your GDC air-gapped appliance deployment, ensure your admin cluster has its egress rules configured. This configuration lets Alertmanager send requests to a service outside of the internal Kubernetes network. Failure to verify the egress rules could result in Alertmanager being unable to find the external software.
You cannot see metrics from your project-scoped workload
Work through the following steps to apply a workaround and get metrics from your workload:
- Ensure the
MonitoringTarget
custom resource has aReady
status. - To scrape your workload, you must declare all target information specified to the
MonitoringTarget
in the workloads pod specification. For example, if you declare that metrics are available on port8080
, the workload pod must declare to Kubernetes that port8080
is open. Otherwise, Prometheus ignores the workload. - Prometheus runs multiple shards, which means that not all Prometheus pods are expected to scrape your pod. You can identify the shard number in the name of each Prometheus pod. As an example,
primary-prometheus-shard0-replica0-0
is part of shard0
. Check for the pod you want to scrape from each Prometheus shard:- Port forward the
primary-prometheus-shardSHARD_NUMBER-replica0-0
pod of Prometheus in theobs-system
namespace to gain access to the Prometheus UI. ReplaceSHARD_NUMBER
in the pod name with increasing numbers every time you check a new shard. - Go to the Prometheus UI in your web browser and follow these steps:
- Click Status > Targets.
- Ensure the pod you want to scrape is in the list. If not, check the next shard. If there are no more shards to check, revalidate that Prometheus has enough information to discover it.
- Verify the
primary-prometheus-shardSHARD_NUMBER-replica0-0
pod of Prometheus logs errors in theobs-system
namespace.
- Port forward the
- Verify the
cortex-tenant
pod logs errors in theobs-system
namespace.
A dashboard was not created
Work through the following steps to apply a workaround and find why a dashboard is not created:
- Review the status of the
Dashboard
custom resource to look for any errors. The custom resource must have aReady
status. - Ensure you are checking the correct Grafana instance. For example, if you deployed the dashboard in your project namespace named
my-namespace
, then the dashboard must be in the Grafana instance at thehttps://GDCH_URL/my-namespace/grafana
endpoint. - Check logs of the
fleet-admin-controller
in thegpc-system
namespace. Look for any errors related to the dashboard by searching the dashboard name in the logs. If you find errors, the JSON file in yourconfigMap
object has an incorrect format, and you must correct it. - Check the Grafana logs in the
PROJECT_NAME-obs-system
namespace to look for any errors. Dashboards query the Grafana REST API, so Grafana must be working for a dashboard to be created.
Your alert is not opening
Work through the following steps to apply a workaround and find why an alert is not opening:
- Ensure Cortex and Loki are both in bucket-storage mode. Rules do not work unless the backend is backed by bucket storage.
- Verify the status of the
MonitoringRule
andLoggingRule
custom resources isReady
. - Check the following alerting conditions:
- PromQL and LogQL expressions: Compare all the functions you are using against the Create alert rules documentation to ensure your rules are configured as you want. Make sure that the expressions return a
true
orfalse
value. - Duration: The
for
field of the custom resource defines how long a condition must be true. Theinterval
field defines how often to evaluate the condition. Check the values of these fields against each other and ensure your conditions are logical.
- PromQL and LogQL expressions: Compare all the functions you are using against the Create alert rules documentation to ensure your rules are configured as you want. Make sure that the expressions return a
- Check the Grafana UI to see if the alert is open using the Alerts page.
- If Grafana shows that the alert is open, check your notification channels and ensure Alertmanager can contact them to produce the alert.
Expected logs are not available
Work through the following steps if you don't see operational logs from your component:
- Check if your component is running and producing logs.
- Check if your component logs should be collected as a built-in functionality. If not, ensure you have the
LoggingTarget
custom resource deployed with a valid specification and with aReady
status.
Work through the following steps if you don't see audit logs from your component:
- If your component writes logs to a file, ensure the file actually exists at the filesystem of the node on the path
/var/log/audit/SERVICE_NAME/NAMESPACE/ACCESS_LEVEL/audit.log
. - Verify that the
anthos-audit-logs-forwarder-SUFFIX
pod on the same node has no errors. - If your component uses a syslog endpoint to receive logs, ensure you have the
AuditLoggingTarget
custom resource deployed with a valid specification and with aReady
status.
Identify predefined alert rules
This section contains information about the predefined alerting rules that exist in Observability components to notify you about system failures.
Predefined alert rules in Loki
The following table provides the preinstalled alerting rules in Loki for audit logging failures:
Name | Type | Description |
---|---|---|
FluentBitAuditLoggingWriteFailure |
Critical | Fluent Bit failed to forward audit logs in the last five minutes. |
LokiAuditLoggingWriteFailure |
Critical | Loki failed to write audit logs to backend storage. |
When one or more of these alerts are displayed, the system has lost at least one audit record.
Predefined alert rules in Prometheus
The following table provides the preinstalled alerting rules in Prometheus for Kubernetes components:
Name | Type | Description |
---|---|---|
KubeAPIDown |
Critical | The Kube API has disappeared from the Prometheus target discovery for 15 minutes. |
KubeClientErrors |
Warning | The errors ratio from the Kubernetes API server client has been greater than 0.01 for 15 minutes. |
KubeClientErrors |
Critical | The errors ratio from the Kubernetes API server client has been greater than 0.1 for 15 minutes. |
KubePodCrashLooping |
Warning | The Pod has been in a crash-looping state for longer than 15 minutes. |
KubePodNotReady |
Warning | The Pod has been in an unready state for longer than 15 minutes. |
KubePersistentVolumeFillingUp |
Critical | The free bytes of a claimed PersistentVolume object are less than 0.03. |
KubePersistentVolumeFillingUp |
Warning | The free bytes of a claimed PersistentVolume object are less than 0.15. |
KubePersistentVolumeErrors |
Critical | The persistent volume has been in a Failed or Pending phase for five minutes. |
KubeNodeNotReady |
Warning | The node has not been ready for more than 15 minutes. |
KubeNodeCPUUsageHigh |
Critical | The CPU usage of the node is greater than 80%. |
KubeNodeMemoryUsageHigh |
Critical | The memory usage of the node is greater than 80%. |
NodeFilesystemSpaceFillingUp |
Warning | The file system usage of the node is greater than 60%. |
NodeFilesystemSpaceFillingUp |
Critical | The file system usage of the node is greater than 85%. |
CertManagerCertExpirySoon |
Warning | A certificate is expiring in 21 days. |
CertManagerCertNotReady |
Critical | A certificate is not ready to serve traffic after 10 minutes. |
CertManagerHittingRateLimits |
Critical | You reached a rate limit when creating or renewing certificates for five minutes. |
DeploymentNotReady |
Critical | A Deployment on the admin cluster has been in a non-ready state for longer than 15 minutes. |
StatefulSetNotReady |
Critical | A StatefulSet object on the admin cluster has been in a non-ready state for longer than 15 minutes. |
AuditLogsForwarderDown |
Critical | The anthos-audit-logs-forwarder DaemonSet has been down for longer than 15 minutes. |
AuditLogsLokiDown |
Critical | The audit-logs-loki StatefulSet has been in a non-ready state for longer than 15 minutes. |