Anthos Config Management uses Prometheus to collect and show metrics related to its processes.
Scraping the Metrics
All Prometheus metrics are available for scraping at port 8675. Before you can scrape metrics, you need to configure your cluster for Prometheus in one of two ways. Either:
Follow the Prometheus documentation to configure your cluster for scraping, or
Use the Prometheus Operator provided by CoreOS along with the following manifests, which will scrape all Anthos Config Management metrics every 10 seconds.
Create a temporary directory to hold the manifest files.
mkdir acm-monitor cd acm-monitor
Download the Promtheus Operator manifest from the CoreOS repository. repository, using the
curl -o bundle.yaml https://raw.githubusercontent.com/coreos/prometheus-operator/master/bundle.yaml
This manifest is configured to use the
defaultNamespace, which is not recommended. The next step modifies the configuration to use a Namespace called
monitoringinstead. To use a different Namespace, substitute it where you see
monitoringin the remaining steps.
Create a file to update the Namespace of the ClusterRoleBinding in the bundle above.
# patch-crb.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus-operator subjects: - kind: ServiceAccount name: prometheus-operator namespace: monitoring # we are patching from default namespace
kustomizationfile that applies the patch and modifies the Namespace for other resources in the manifest.
# kustomization.yaml resources: - bundle.yaml namespace: monitoring patchesStrategicMerge: - patch-crb.yaml
monitoringNamespace. You can use a different name for the Namespace, but if you do, also change the value of
namespacein the YAML manifests from the previous steps.
kubectl create namespace monitoring
kustomizedmanifest using the following commands:
kubectl apply -k . until kubectl get customresourcedefinitions servicemonitors.monitoring.coreos.com ; \ do date; sleep 1; echo ""; done
The second command blocks until the CRDs are available on the cluster.
Create the manifest for the resources necessary to configure a Prometheus server which scrapes metrics from Anthos Config Management.
# acm.yaml apiVersion: v1 kind: ServiceAccount metadata: name: prometheus-acm namespace: monitoring --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRole metadata: name: prometheus-acm rules: - apiGroups: [""] resources: - nodes - services - endpoints - pods verbs: ["get", "list", "watch"] - apiGroups: [""] resources: - configmaps verbs: ["get"] - nonResourceURLs: ["/metrics"] verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: prometheus-acm roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus-acm subjects: - kind: ServiceAccount name: prometheus-acm namespace: monitoring --- apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: acm namespace: monitoring labels: prometheus: acm spec: replicas: 2 serviceAccountName: prometheus-acm serviceMonitorSelector: matchLabels: prometheus: config-management podMonitorSelector: matchLabels: prometheus: config-management alerting: alertmanagers: - namespace: default name: alertmanager port: web resources: requests: memory: 400Mi --- apiVersion: v1 kind: Service metadata: name: prometheus-acm namespace: monitoring labels: prometheus: acm spec: type: NodePort ports: - name: web nodePort: 31900 port: 9190 protocol: TCP targetPort: web selector: app: prometheus prometheus: acm --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: acm-service namespace: monitoring labels: prometheus: config-management spec: selector: matchLabels: monitored: "true" namespaceSelector: matchNames: - config-management-system endpoints: - port: metrics interval: 10s --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: cnrm namespace: monitoring labels: prometheus: config-management spec: endpoints: - interval: 10s port: metrics namespaceSelector: matchNames: - cnrm-system selector: matchLabels: cnrm.cloud.google.com/monitored: "true" cnrm.cloud.google.com/system: "true" --- apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: acm-pod namespace: monitoring labels: prometheus: config-management spec: selector: matchLabels: monitored: "true" namespaceSelector: matchNames: - gatekeeper-system podMetricsEndpoints: - port: metrics interval: 10s
Apply the manifest using the following commands:
kubectl apply -f acm.yaml until kubectl rollout status statefulset/prometheus-acm -n monitoring; \ do sleep 1; done
The second command blocks until the Pods are running.
You can verify the installation by forwarding the web port of the Prometheus server to your local machine.
kubectl -n monitoring port-forward svc/prometheus-acm 9190
You can now access the Prometheus web UI at
Remove the temporary directory.
cd .. rm -rf acm-monitor
Anthos Config Management collects the following metrics and makes them available to Prometheus. The Labels column lists all labels that are applicable to each metric. Metrics without labels represent a single measurement over time while metrics with labels represent multiple measurements, one for each combination of label values.
If this table becomes out of sync, you can filter metrics by prefix in the
Prometheus user interface, and all of the metrics start with the prefix
||Histogram||status||Number of cycles that the importer has attempted to import configs to the cluster (distributed into buckets by duration of each cycle)|
||Histogram||status||Number of cycles that the importer has attempted to import configs to the cluster (ignoring duration)|
||Histogram||status||Sum of the durations of all cycles that the importer has attempted to import configs to the cluster|
||Gauge||Number of namespace configs in current state|
||Gauge||component||Number of errors in the config repo grouped by the component where they occurred|
||Gauge||state||Number of configs (cluster and namespace) grouped by their sync status|
||Gauge||Timestamp of the most recent import|
||Gauge||Timestamp of the most recent sync|
||Histogram||Number of import-to-sync measurements taken (distributed into buckets by latency between the two)|
||Histogram||Number of import-to-sync measurements taken (ignoring latency between the two)|
||Histogram||Sum of the latencies of all import-to-sync measurements taken|
||Histogram||operation, type, status||Number of calls made by the syncer to the API server (distributed into buckets by duration of each call)|
||Histogram||operation, type, status||Number of calls made by the importer to the API server (ignoring duration)|
||Histogram||operation, type, status||Sum of the durations of all calls made by the syncer to the API server|
||Counter||source||Total number of restarts for the namespace and cluster config controllers|
||Counter||operation, type, status||Total number of operations that have been performed to sync resources to configs|
||Histogram||type, status||Number of reconcile events processed by the syncer (distributed into buckets by duration)|
||Histogram||type, status||Number of reconcile events processed by the syncer (ignoring duration)|
||Histogram||type, status||Sum of the durations of all reconcile events processed by the syncer|
||Gauge||type||Timestamps when syncer reconcile events occurred|
If you are using Config Connector, you can find the list of metrics in Monitoring Config Connector with Prometheus.
If Policy Controller is enabled on your cluster, the following addition
metrics will be available (all prefixed with
||Histogram||Audit cycle duration distribution|
||Gauge||The epoch timestamp since the last audit runtime, given as seconds in floating-point|
||Counter||status||Total number of constraint template ingestion actions|
||Histogram||status||Constraint Template ingestion duration distribution|
||Gauge||status||Current number of constraint templates|
||Gauge||enforcement_action, status||Current number of constraints|
||Counter||admission_status||Count of admission requests from the API server|
||Histogram||admission_status||Admission request duration distribution|
||Gauge||enforcement_action||Number of audit violations detected in the last audit cycle|
||Gauge||How many unique GroupVersionKinds Policy Controller is meant to be watching. This is a combination of synced resources and constraints.|
||Gauge||How many unique GroupVersionKinds Policy Controller is actually watching. This is meant to converge on being equal to gatekeeper_watch_manager_intended_watch_gvk|
||Gauge||Whether the watch manager is running. Either 1 or 0. If 0, then new constraints and synced resources will not be ingested|
||Gauge||The epoch timestamp of the last time Policy Controller's watch manager was checked for restart. This should be a frequent occurrence, on the order of seconds. Given in floating-point seconds|
||Gauge||The epoch timestamp of the last time Policy Controller's watch manager was restarted. This is expected to happen when the set of resources being watched has changed (caused by modifying the sync config or adding/removing constraint templates)|
||Counter||The total number of times Policy Controller's watch manager has restarted. A rapidly increasing number could mean flapping|
Example debugging procedures
The following examples illustrate some patterns for using Prometheus metrics, object status fields, and object annotations to detect and diagnose problems related to Anthos Config Management. These examples show how you can start with high level monitoring that detects a problem and then progressively refine your search to drill down and diagnose the root cause of the problem.
Querying configs by status
monitor process provides high-level metrics that give useful insight into
an overall view of how Anthos Config Management is operating on the cluster. You
can see if any errors have occurred, and can even
set up alerts for them.
nomos status to display errors
In addition to using Prometheus metrics to monitor the status of
Anthos Config Management on your clusters, you can use the
nomos status command which prints errors
from all of your clusters on the command line.
Querying import and sync operations by status
Anthos Config Management uses a two-step process to apply configs from the
repo to a cluster. The
gkeconfig_monitor_errors metric is labeled by component
so you can see where any errors occurred.
You can also check the metrics for the importer and syncer processes themselves.
Checking config object status
Anthos Config Management defines two custom Kubernetes objects: ClusterConfig
and NamespaceConfig. These objects define a status field which contains
information about the change that was last applied to the config and any errors
that occurred. For instance, if there is an error in a Namespace called
shipping-dev, you can check the status of the corresponding NamespaceConfig.
kubectl get namespaceconfig shipping-dev -o yaml
Checking an object's
You may want to know when a managed Kubernetes object was last updated by Anthos Config Management. Each managed object is annotated with the hash of the Git commit when it was last modified, as well as the path to the config that contained the modification.
kubectl get clusterrolebinding namespace-readers
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: annotations: configmanagement.gke.io/source-path: cluster/namespace-reader-clusterrolebinding.yaml configmanagement.gke.io/token: bbb6a1e2f3db692b17201da028daff0d38797771 name: namespace-readers ...
For more information, see labels and annotations.