Monitor Config Sync using Prometheus

Config Sync uses Prometheus to collect and show metrics related to its processes.

You can also configure Cloud Monitoring to pull custom metrics from Prometheus. Then you can see custom metrics in both Prometheus and Monitoring. For more information, see Using Prometheus.

If you are syncing from multiple repositories in Config Sync, there are more ways that you can monitor your resources in addition to using Prometheus. To learn more, see Monitoring Config Sync in multi-repo mode.

Scraping the metrics

All Prometheus metrics are available for scraping at port 8675. Before you can scrape metrics, you need to configure your cluster for Prometheus in one of two ways. Either:

  • Follow the Prometheus documentation to configure your cluster for scraping, or

  • Use the Prometheus Operator along with the following manifests, which scrape all Anthos Config Management metrics every 10 seconds.

    1. Create a temporary directory to hold the manifest files.

      mkdir acm-monitor
      cd acm-monitor
      
    2. Download the Prometheus Operator manifest from the CoreOS repository. repository, using the curl command:

      curl -o bundle.yaml https://raw.githubusercontent.com/coreos/prometheus-operator/master/bundle.yaml
      

      This manifest is configured to use the default namespace, which is not recommended. The next step modifies the configuration to use a namespace called monitoring instead. To use a different namespace, substitute it where you see monitoring in the remaining steps.

    3. Create a file to update the namespace of the ClusterRoleBinding in the bundle above.

      # patch-crb.yaml
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRoleBinding
      metadata:
        name: prometheus-operator
      subjects:
      - kind: ServiceAccount
        name: prometheus-operator
        namespace: monitoring # we are patching from default namespace
      
    4. Create a kustomization.yaml file that applies the patch and modifies the namespace for other resources in the manifest.

      # kustomization.yaml
      resources:
      - bundle.yaml
      
      namespace: monitoring
      
      patchesStrategicMerge:
      - patch-crb.yaml
      
    5. Create the monitoring namespace. You can use a different name for the namespace, but if you do, also change the value of namespace in the YAML manifests from the previous steps.

      kubectl create namespace monitoring
      
    6. Apply the kustomized manifest using the following commands:

      kubectl apply -k .
      
      until kubectl get customresourcedefinitions servicemonitors.monitoring.coreos.com ; \
      do date; sleep 1; echo ""; done

      The second command blocks until the CRDs are available on the cluster.

    7. Create the manifest for the resources necessary to configure a Prometheus server which scrapes metrics from Anthos Config Management.

      # acm.yaml
      apiVersion: v1
      kind: ServiceAccount
      metadata:
        name: prometheus-acm
        namespace: monitoring
      ---
      apiVersion: rbac.authorization.k8s.io/v1beta1
      kind: ClusterRole
      metadata:
        name: prometheus-acm
      rules:
      - apiGroups: [""]
        resources:
        - nodes
        - services
        - endpoints
        - pods
        verbs: ["get", "list", "watch"]
      - apiGroups: [""]
        resources:
        - configmaps
        verbs: ["get"]
      - nonResourceURLs: ["/metrics"]
        verbs: ["get"]
      ---
      apiVersion: rbac.authorization.k8s.io/v1beta1
      kind: ClusterRoleBinding
      metadata:
        name: prometheus-acm
      roleRef:
        apiGroup: rbac.authorization.k8s.io
        kind: ClusterRole
        name: prometheus-acm
      subjects:
      - kind: ServiceAccount
        name: prometheus-acm
        namespace: monitoring
      ---
      apiVersion: monitoring.coreos.com/v1
      kind: Prometheus
      metadata:
        name: acm
        namespace: monitoring
        labels:
          prometheus: acm
      spec:
        replicas: 2
        serviceAccountName: prometheus-acm
        serviceMonitorSelector:
          matchLabels:
            prometheus: config-management
        podMonitorSelector:
          matchLabels:
            prometheus: config-management
        alerting:
          alertmanagers:
          - namespace: default
            name: alertmanager
            port: web
        resources:
          requests:
            memory: 400Mi
      ---
      apiVersion: v1
      kind: Service
      metadata:
        name: prometheus-acm
        namespace: monitoring
        labels:
          prometheus: acm
      spec:
        type: NodePort
        ports:
        - name: web
          nodePort: 31900
          port: 9190
          protocol: TCP
          targetPort: web
        selector:
          app: prometheus
          prometheus: acm
      --- 
      apiVersion: monitoring.coreos.com/v1
      kind: ServiceMonitor
      metadata:
        name: acm-service
        namespace: monitoring
        labels:
          prometheus: config-management
      spec:
        selector:
          matchLabels:
            monitored: "true"
        namespaceSelector:
          matchNames:
          # If you are using multi-repo mode, change
          # config-management-system to config-management-monitoring
          - config-management-system 
        endpoints:
        - port: metrics
          interval: 10s 
      
    8. Apply the manifest using the following commands:

      kubectl apply -f acm.yaml
      
      until kubectl rollout status statefulset/prometheus-acm -n monitoring; \
      do sleep 1; done
      

      The second command blocks until the Pods are running.

    9. You can verify the installation by forwarding the web port of the Prometheus server to your local machine.

      kubectl -n monitoring port-forward svc/prometheus-acm 9190
      

      You can now access the Prometheus web UI at http://localhost:9190.

    10. Remove the temporary directory.

      cd ..
      rm -rf acm-monitor
      

Available metrics

Config Sync collects the following metrics and makes them available to Prometheus. The Labels column lists all labels that are applicable to each metric. Metrics without labels represent a single measurement over time while metrics with labels represent multiple measurements, one for each combination of label values.

If this table becomes out of sync, you can filter metrics by prefix in the Prometheus user interface, and all of the metrics start with the prefix gkeconfig_.

Name Type Labels Description
gkeconfig_importer_cycle_duration_seconds_bucket Histogram status Number of cycles that the importer has attempted to import configs to the cluster (distributed into buckets by duration of each cycle)
gkeconfig_importer_cycle_duration_seconds_count Histogram status Number of cycles that the importer has attempted to import configs to the cluster (ignoring duration)
gkeconfig_importer_cycle_duration_seconds_sum Histogram status Sum of the durations of all cycles that the importer has attempted to import configs to the cluster
gkeconfig_importer_namespace_configs Gauge Number of namespace configs in current state
gkeconfig_monitor_errors Gauge component Number of errors in the config repo grouped by the component where they occurred
gkeconfig_monitor_configs Gauge state Number of configs (cluster and namespace) grouped by their sync status
gkeconfig_monitor_last_import_timestamp Gauge Timestamp of the most recent import
gkeconfig_monitor_last_sync_timestamp Gauge Timestamp of the most recent sync
gkeconfig_monitor_sync_latency_seconds_bucket Histogram Number of import-to-sync measurements taken (distributed into buckets by latency between the two)
gkeconfig_monitor_sync_latency_seconds_count Histogram Number of import-to-sync measurements taken (ignoring latency between the two)
gkeconfig_monitor_sync_latency_seconds_sum Histogram Sum of the latencies of all import-to-sync measurements taken
gkeconfig_syncer_api_duration_seconds_bucket Histogram operation, type, status Number of calls made by the syncer to the API server (distributed into buckets by duration of each call)
gkeconfig_syncer_api_duration_seconds_count Histogram operation, type, status Number of calls made by the importer to the API server (ignoring duration)
gkeconfig_syncer_api_duration_seconds_sum Histogram operation, type, status Sum of the durations of all calls made by the syncer to the API server
gkeconfig_syncer_controller_restarts_total Counter source Total number of restarts for the namespace and cluster config controllers
gkeconfig_syncer_operations_total Counter operation, type, status Total number of operations that have been performed to sync resources to configs
gkeconfig_syncer_reconcile_duration_seconds_bucket Histogram type, status Number of reconcile events processed by the syncer (distributed into buckets by duration)
gkeconfig_syncer_reconcile_duration_seconds_count Histogram type, status Number of reconcile events processed by the syncer (ignoring duration)
gkeconfig_syncer_reconcile_duration_seconds_sum Histogram type, status Sum of the durations of all reconcile events processed by the syncer
gkeconfig_syncer_reconcile_event_timestamps Gauge type Timestamps when syncer reconcile events occurred

Example debugging procedures

The following examples illustrate some patterns for using Prometheus metrics, object status fields, and object annotations to detect and diagnose problems related to Config Sync. These examples show how you can start with high level monitoring that detects a problem and then progressively refine your search to drill down and diagnose the root cause of the problem.

Query configs by status

The monitor process provides high-level metrics that give useful insight into an overall view of how Config Sync is operating on the cluster. You can see if any errors have occurred, and can even set up alerts for them.

gkeconfig_monitor_errors

Query metrics by reconciler

If you are using Config Sync to sync to multiple repositories, then you can monitor the RootSync and RepoSync objects. The RootSync and RepoSync objects are instrumented with high-level metrics that give you useful insight into how Config Sync is operating on the cluster. Almost all metrics are tagged by the reconciler name, so you can see if any errors have occurred and can set up alerts for them in Prometheus.

A reconciler is a Pod that syncs manifests from a Git repository to a cluster. When you create a RootSync object, Config Sync creates a reconciler called root-reconciler. When you create a RepoSync object, Config Sync creates a reconciler called ns-reconciler-NAMESPACE, where NAMESPACE is the namespace you created your RepoSync object in.

In Prometheus, you can use the following filters for the reconcilers:

# Querying Root reconciler
config_sync_reconciler_errors{root_reconciler="root-reconciler"}

# Querying Namespace reconciler for a namespace called retail
config_sync_reconciler_errors{ns_reconciler_retail="ns-reconciler-retail"}

Use nomos status to display errors

In addition to using Prometheus metrics to monitor the status of Config Sync on your clusters, you can use the nomos status command which prints errors from all of your clusters on the command line.

Query import and sync operations by status

Config Sync uses a two-step process to apply configs from the repo to a cluster. The gkeconfig_monitor_errors metric is labeled by component so you can see where any errors occurred.

gkeconfig_monitor_errors{component="importer"}
gkeconfig_monitor_errors{component="syncer"}

You can also check the metrics for the importer and syncer processes themselves.

gkeconfig_importer_cycle_duration_seconds_count{status="error"}
gkeconfig_syncer_reconcile_duration_seconds_count{status="error"}

In Config Sync's multi-repo mode, importing and sourcing from a Git repository and syncing to a cluster is handled by the reconcilers. The reconciler_errors metric is labeled by component so you can see where any errors occurred.

In Prometheus, you could use the following queries:

# Check for errors that occurred when sourcing configs from the Git repository.
config_sync_reconciler_errors{component="source"}

# Check for errors that occurred when syncing configs to the cluster.
config_sync_reconciler_errors{component="sync"}

You can also check the metrics for the source and sync processes themselves:

config_sync_parse_duration_seconds{status="error"}
config_sync_apply_duration_seconds{status="error"}
config_sync_remediate_duration_seconds{status="error"}

Check a config's object status

Config Sync defines two custom Kubernetes objects: ClusterConfig and NamespaceConfig. These objects define a status field which contains information about the change that was last applied to the config and any errors that occurred. For instance, if there is an error in a namespace called shipping-dev, you can check the status of the corresponding NamespaceConfig.

kubectl get namespaceconfig shipping-dev -o yaml

Check an object's token annotation

You may want to know when a managed Kubernetes object was last updated by Config Sync. Each managed object is annotated with the hash of the Git commit when it was last modified, as well as the path to the config that contained the modification.

kubectl get clusterrolebinding namespace-readers
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  annotations:
    configmanagement.gke.io/source-path: cluster/namespace-reader-clusterrolebinding.yaml
    configmanagement.gke.io/token: bbb6a1e2f3db692b17201da028daff0d38797771
  name: namespace-readers
...

For more information, see labels and annotations.

What's next