Monitoring Anthos Config Management

Anthos Config Management uses Prometheus to collect and show metrics related to its processes.

You can also configure Stackdriver Monitoring to to pull custom metrics from Prometheus. Then you can see custom metrics in both Prometheus and Stackdriver. For more information see Using Prometheus.

Scraping the Metrics

All Prometheus metrics are available for scraping at port 8675. Before you can scrape metrics, you need to configure your cluster for Prometheus in one of two ways. Either:

  • Follow the Prometheus documentation to configure your cluster for scraping, or

  • Use the Prometheus Operator provided by CoreOS along with the following manifests, which will scrape all Anthos Config Management metrics every 10 seconds.

    1. Create a temporary directory to hold the manifest files.

      mkdir acm-monitor
      cd acm-monitor
      
    2. Download the Promtheus Operator manifest from the CoreOS repository. repository, using the curl command:

      curl -o bundle.yaml https://raw.githubusercontent.com/coreos/prometheus-operator/master/bundle.yaml
      

      This manifest is configured to use the default Namespace, which is not recommended. The next step modifies the configuration to use a Namespace called monitoring instead. To use a different Namespace, substitute it where you see monitoring in the remaining steps.

    3. Create a file to update the Namespace of the ClusterRoleBinding in the bundle above.

      # patch-crb.yaml
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRoleBinding
      metadata:
        name: prometheus-operator
      subjects:
      - kind: ServiceAccount
        name: prometheus-operator
        namespace: monitoring # we are patching from default namespace
      
    4. Create a kustomization file that applies the patch and modifies the Namespace for other resources in the manifest.

      # kustomization.yaml
      resources:
      - bundle.yaml
      
      namespace: monitoring
      
      patchesStrategicMerge:
      - patch-crb.yaml
      
    5. Create the monitoring Namespace. You can use a different name for the Namespace, but if you do, also change the value of namespace in the YAML manifests from the previous steps.

      kubectl create namespace monitoring
      
    6. Apply the kustomized manifest using the following commands:

      kubectl apply -k .
      
      until kubectl get customresourcedefinitions servicemonitors.monitoring.coreos.com ; \
      do date; sleep 1; echo ""; done

      The second command blocks until the CRDs are available on the cluster.

    7. Create the manifest for the resources necessary to configure a Prometheus server which scrapes metrics from Anthos Config Management.

      # acm.yaml
      apiVersion: v1
      kind: ServiceAccount
      metadata:
        name: prometheus-acm
        namespace: monitoring
      ---
      apiVersion: rbac.authorization.k8s.io/v1beta1
      kind: ClusterRole
      metadata:
        name: prometheus-acm
      rules:
      - apiGroups: [""]
        resources:
        - nodes
        - services
        - endpoints
        - pods
        verbs: ["get", "list", "watch"]
      - apiGroups: [""]
        resources:
        - configmaps
        verbs: ["get"]
      - nonResourceURLs: ["/metrics"]
        verbs: ["get"]
      ---
      apiVersion: rbac.authorization.k8s.io/v1beta1
      kind: ClusterRoleBinding
      metadata:
        name: prometheus-acm
      roleRef:
        apiGroup: rbac.authorization.k8s.io
        kind: ClusterRole
        name: prometheus-acm
      subjects:
      - kind: ServiceAccount
        name: prometheus-acm
        namespace: monitoring
      ---
      apiVersion: monitoring.coreos.com/v1
      kind: Prometheus
      metadata:
        name: acm
        namespace: monitoring
        labels:
          prometheus: acm
      spec:
        replicas: 2
        serviceAccountName: prometheus-acm
        serviceMonitorSelector:
          matchLabels:
            prometheus: config-management
        podMonitorSelector:
          matchLabels:
            prometheus: config-management
        alerting:
          alertmanagers:
          - namespace: default
            name: alertmanager
            port: web
        resources:
          requests:
            memory: 400Mi
      ---
      apiVersion: v1
      kind: Service
      metadata:
        name: prometheus-acm
        namespace: monitoring
        labels:
          prometheus: acm
      spec:
        type: NodePort
        ports:
        - name: web
          nodePort: 31900
          port: 9190
          protocol: TCP
          targetPort: web
        selector:
          app: prometheus
          prometheus: acm
      ---
      apiVersion: monitoring.coreos.com/v1
      kind: ServiceMonitor
      metadata:
        name: acm-service
        namespace: monitoring
        labels:
          prometheus: config-management
      spec:
        selector:
          matchLabels:
            monitored: "true"
        namespaceSelector:
          matchNames:
          - config-management-system
        endpoints:
        - port: metrics
          interval: 10s
      ---
      apiVersion: monitoring.coreos.com/v1
      kind: ServiceMonitor
      metadata:
        name: cnrm
        namespace: monitoring
        labels:
          prometheus: config-management
      spec:
        endpoints:
        - interval: 10s
          port: metrics
        namespaceSelector:
          matchNames:
          - cnrm-system
        selector:
          matchLabels:
            cnrm.cloud.google.com/monitored: "true"
            cnrm.cloud.google.com/system: "true"
      ---
      apiVersion: monitoring.coreos.com/v1
      kind: PodMonitor
      metadata:
        name: acm-pod
        namespace: monitoring
        labels:
          prometheus: config-management
      spec:
        selector:
          matchLabels:
            monitored: "true"
        namespaceSelector:
          matchNames:
          - gatekeeper-system
        podMetricsEndpoints:
        - port: metrics
          interval: 10s
      
    8. Apply the manifest using the following commands:

      kubectl apply -f acm.yaml
      
      until kubectl rollout status statefulset/prometheus-acm -n monitoring; \
      do sleep 1; done
      

      The second command blocks until the Pods are running.

    9. You can verify the installation by forwarding the web port of the Prometheus server to your local machine.

      kubectl -n monitoring port-forward svc/prometheus-acm 9190
      

      You can now access the Prometheus web UI at http://localhost:9190.

    10. Remove the temporary directory.

      cd ..
      rm -rf acm-monitor
      

Available metrics

Anthos Config Management collects the following metrics and makes them available to Prometheus. The Labels column lists all labels that are applicable to each metric. Metrics without labels represent a single measurement over time while metrics with labels represent multiple measurements, one for each combination of label values.

If this table becomes out of sync, you can filter metrics by prefix in the Prometheus user interface, and all of the metrics start with the prefix gkeconfig_.

Name Type Labels Description
gkeconfig_importer_cycle_duration_seconds_bucket Histogram status Number of cycles that the importer has attempted to import configs to the cluster (distributed into buckets by duration of each cycle)
gkeconfig_importer_cycle_duration_seconds_count Histogram status Number of cycles that the importer has attempted to import configs to the cluster (ignoring duration)
gkeconfig_importer_cycle_duration_seconds_sum Histogram status Sum of the durations of all cycles that the importer has attempted to import configs to the cluster
gkeconfig_importer_namespace_configs Gauge Number of namespace configs in current state
gkeconfig_monitor_errors Gauge component Number of errors in the config repo grouped by the component where they occurred
gkeconfig_monitor_configs Gauge state Number of configs (cluster and namespace) grouped by their sync status
gkeconfig_monitor_last_import_timestamp Gauge Timestamp of the most recent import
gkeconfig_monitor_last_sync_timestamp Gauge Timestamp of the most recent sync
gkeconfig_monitor_sync_latency_seconds_bucket Histogram Number of import-to-sync measurements taken (distributed into buckets by latency between the two)
gkeconfig_monitor_sync_latency_seconds_count Histogram Number of import-to-sync measurements taken (ignoring latency between the two)
gkeconfig_monitor_sync_latency_seconds_sum Histogram Sum of the latencies of all import-to-sync measurements taken
gkeconfig_syncer_api_duration_seconds_bucket Histogram operation, type, status Number of calls made by the syncer to the API server (distributed into buckets by duration of each call)
gkeconfig_syncer_api_duration_seconds_count Histogram operation, type, status Number of calls made by the importer to the API server (ignoring duration)
gkeconfig_syncer_api_duration_seconds_sum Histogram operation, type, status Sum of the durations of all calls made by the syncer to the API server
gkeconfig_syncer_controller_restarts_total Counter source Total number of restarts for the namespace and cluster config controllers
gkeconfig_syncer_operations_total Counter operation, type, status Total number of operations that have been performed to sync resources to configs
gkeconfig_syncer_reconcile_duration_seconds_bucket Histogram type, status Number of reconcile events processed by the syncer (distributed into buckets by duration)
gkeconfig_syncer_reconcile_duration_seconds_count Histogram type, status Number of reconcile events processed by the syncer (ignoring duration)
gkeconfig_syncer_reconcile_duration_seconds_sum Histogram type, status Sum of the durations of all reconcile events processed by the syncer
gkeconfig_syncer_reconcile_event_timestamps Gauge type Timestamps when syncer reconcile events occurred

If you are using Config Connector, you can find the list of metrics in Monitoring Config Connector with Prometheus.

If Policy Controller is enabled on your cluster, the following addition metrics will be available (all prefixed with gatekeeper_):

Name Type Labels Description
gatekeeper_audit_duration_seconds Histogram Audit cycle duration distribution
gatekeeper_audit_last_run_time Gauge The epoch timestamp since the last audit runtime, given as seconds in floating-point
gatekeeper_constraint_template_ingestion_count Counter status Total number of constraint template ingestion actions
gatekeeper_constraint_template_ingestion_duration_seconds Histogram status Constraint Template ingestion duration distribution
gatekeeper_constraint_templates Gauge status Current number of constraint templates
gatekeeper_constraints Gauge enforcement_action, status Current number of constraints
gatekeeper_request_count Counter admission_status Count of admission requests from the API server
gatekeeper_request_duration_seconds Histogram admission_status Admission request duration distribution
gatekeeper_violations Gauge enforcement_action Number of audit violations detected in the last audit cycle
gatekeeper_watch_manager_intended_watch_gvk Gauge How many unique GroupVersionKinds Policy Controller is meant to be watching. This is a combination of synced resources and constraints.
gatekeeper_watch_manager_watched_gvk Gauge How many unique GroupVersionKinds Policy Controller is actually watching. This is meant to converge on being equal to gatekeeper_watch_manager_intended_watch_gvk
gatekeeper_watch_manager_is_running Gauge Whether the watch manager is running. Either 1 or 0. If 0, then new constraints and synced resources will not be ingested
gatekeeper_watch_manager_last_restart_check_time Gauge The epoch timestamp of the last time Policy Controller's watch manager was checked for restart. This should be a frequent occurrence, on the order of seconds. Given in floating-point seconds
gatekeeper_watch_manager_last_restart_time Gauge The epoch timestamp of the last time Policy Controller's watch manager was restarted. This is expected to happen when the set of resources being watched has changed (caused by modifying the sync config or adding/removing constraint templates)
gatekeeper_watch_manager_restart_attempts Counter The total number of times Policy Controller's watch manager has restarted. A rapidly increasing number could mean flapping

Example debugging procedures

The following examples illustrate some patterns for using Prometheus metrics, object status fields, and object annotations to detect and diagnose problems related to Anthos Config Management. These examples show how you can start with high level monitoring that detects a problem and then progressively refine your search to drill down and diagnose the root cause of the problem.

Querying configs by status

The monitor process provides high-level metrics that give useful insight into an overall view of how Anthos Config Management is operating on the cluster. You can see if any errors have occurred, and can even set up alerts for them.

gkeconfig_monitor_errors

Using nomos status to display errors

In addition to using Prometheus metrics to monitor the status of Anthos Config Management on your clusters, you can use the nomos status command which prints errors from all of your clusters on the command line.

Querying import and sync operations by status

Anthos Config Management uses a two-step process to apply configs from the repo to a cluster. The gkeconfig_monitor_errors metric is labeled by component so you can see where any errors occurred.

gkeconfig_monitor_errors{component="importer"}
gkeconfig_monitor_errors{component="syncer"}

You can also check the metrics for the importer and syncer processes themselves.

gkeconfig_importer_cycle_duration_seconds_count{status="error"}
gkeconfig_syncer_reconcile_duration_seconds_count{status="error"}

Checking config object status

Anthos Config Management defines two custom Kubernetes objects: ClusterConfig and NamespaceConfig. These objects define a status field which contains information about the change that was last applied to the config and any errors that occurred. For instance, if there is an error in a Namespace called shipping-dev, you can check the status of the corresponding NamespaceConfig.

kubectl get namespaceconfig shipping-dev -o yaml

Checking an object's token annotation

You may want to know when a managed Kubernetes object was last updated by Anthos Config Management. Each managed object is annotated with the hash of the Git commit when it was last modified, as well as the path to the config that contained the modification.

kubectl get clusterrolebinding namespace-readers
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  annotations:
    configmanagement.gke.io/source-path: cluster/namespace-reader-clusterrolebinding.yaml
    configmanagement.gke.io/token: bbb6a1e2f3db692b17201da028daff0d38797771
  name: namespace-readers
...

For more information, see labels and annotations.

What's next?