Monitor Config Sync
The page describes how you can monitor your Config Sync resources.
When you enable the RootSync and RepoSync APIs, Config Sync uses OpenCensus to create and record metrics and OpenTelemetry to export its metrics to Prometheus and Cloud Monitoring. You can also use OpenTelemety to export metrics to a custom monitoring system. This process provides you with three ways to monitor your resources:
If you don't enable the RootSync and RepoSync APIs, you can only Monitor resources with Prometheus.
Available OpenTelemetry metrics
When you have the RootSync and RepoSync APIs enabled, Config Sync and the Resource Group Controller collect the following metrics and make them available to OpenTelemetry. The Tags column lists all tags that are applicable to each metric. Metrics with tags represent multiple measurements, one for each combination of tag values.
Config Sync metrics
The following metrics are available from Anthos Config Management versions 1.10.1 and later.
Name | Type | Tags | Description |
---|---|---|---|
api_duration_seconds | Distribution | operation, status | The latency distribution of API server calls |
apply_duration_seconds | Distribution | status | The latency distribution of applier resource sync events |
apply_operations_total | Count | operation, status | The total number of operations that have been performed to sync resources to source of truth |
declared_resources | Last Value | reconciler | The number of declared resources parsed from Git |
internal_errors_total | Count | reconciler, source | The total number of internal errors triggered by Config Sync |
last_sync_timestamp | Last Value | reconciler | The timestamp of the most recent sync from Git |
parser_duration_seconds | Distribution | reconciler, status, trigger, source | The latency distribution of the parse-apply-watch loop |
pipeline_error_observed | Last Value | name, reconciler, component | The status of RootSync and RepoSync custom resources. A value of 1 indicates a failure. |
reconcile_duration_seconds | Distribution | status | The latency distribution of reconcile events handled by the reconciler manager. |
reconciler_errors | Last Value | reconciler, component | The number of errors in the RootSync and RepoSync reconcilers |
remediate_duration_seconds | Distribution | status | The latency distribution of remediator reconciliation events |
resource_conflicts_total | Count | reconciler | The total number of resource conflicts resulting from a mismatch between the cached resources and cluster resources |
resource_fights_total | Count | reconciler | The total number of resources that are being synced too frequently. Any result higher than zero indicates a problem. For more information, see KNV2005: ResourceFightWarning. |
rendering_count_total | Count | reconciler | The count of sync executions that used Kustomize or Helm chart rendering on the resources. |
skip_rendering_count_total | Count | reconciler | The count of sync executions that did not use Kustomize or Helm charts rendering on the resources. |
resource_override_count_total | Count | reconciler, container, resource | The count of resource overrides specified in resource patch |
git_sync_depth_override_count_total | Count | - | The count of Root/RepoSync objects where the spec.override.gitSyncDepth override is set. Git depth can be used to improve performance when syncing from large repos. |
no_ssl_verify_count_total | Count | - | The count of Root/RepoSync objects with .spec.git.noSSLVerify override set. |
Resource Group Controller metrics
The Resource Group Controller is a component in Config Sync that keeps track of the managed resources and checks if each individual resource is ready or reconciled. The following metrics are available from Anthos Config Management versions 1.10 and later.
Name | Type | Tags | Description |
---|---|---|---|
reconcile_duration_seconds | Distribution | stallreason | The distribution of time taken to reconcile a ResourceGroup CR |
resource_group_total | Last Value | The current number of ResourceGroup CRs | |
resource_count_total | Sum | The total number of resources tracked by all ResourceGroup CRs in the cluster | |
resource_count | Last Value | resourcegroup | The total number of resources tracked by a ResourceGroup |
ready_resource_count_total | Sum | The total number of resources ready across all ResourceGroup CRs in the cluster | |
ready_resource_count | Last Value | resourcegroup | The total number of ready resources in a ResourceGroup |
resource_ns_count | Last Value | resourcegorup | The number of namespaces used by resources in a ResourceGroup |
cluster_scoped_resource_count | Last Value | resourcegroup | The number of cluster scoped resources in a ResourceGroup |
crd_count | Last Value | resourcegroup | The number of CRDs in a ResourceGroup |
kcc_resource_count_total | Sum | The total number of Config Connector resources across all ResourceGroup CRs in the cluster | |
kcc_resource_count | Gauge | resourcegroup | The total number of KCC resources in a ResourceGroup |
pipeline_error_observed | Last Value | name, reconciler, component | The status of RootSync and RepoSync custom resources. A value of 1 indicates a failure. |
Config Sync Metric Tags
Metric tags can be used to aggregate metric data. It is selectable in the Group By dropdown list from Monitoring Console
Custom tags
The custom tags are listed for each metric in the above tables Tag column.
Name | Values | Description |
---|---|---|
operation |
create, patch, update, delete | The type of operation performed |
status |
success, error | The execution status of an operation |
reconciler |
rootsync, reposync | The type of the Reconciler |
source |
parser, differ, remediator | The source of the internal error |
trigger |
retry, watchUpdate, managementConflict, resync, reimport | The trigger of an reconciliation event |
name |
The name of reconciler | The name of the Reconciler |
component |
parsing, source, sync, rendering, readiness | The name of component / stage the reconciliation is currently at |
container |
reconciler, git-sync | The name of the container |
resource |
cpu, memory | The type of the resource |
Resource tags
Config Sync custom monitoring metrics are using the K8s_Pod resource type, which comes with the following tags
Name | Description |
---|---|
project |
The ID of the project |
cluster_name |
The name of the cluster |
location |
The location / zone of the cluster |
namespace_name |
The name of the Namespace from where the metrics are exported |
pod_name |
The name of the Pod from where the metrics are exported |
Understand the pipeline_error_observed metric
The pipeline_error_observed
metric is a metric that can help you quickly identify
RepoSync or RootSync CRs that are not in sync or contain resources that are not
reconciled to the desired state. This metric is available from Anthos Config Management
versions 1.10 and later.
For a successful sync by a RootSync or RepoSync, the metrics with all components (
rendering
,source
,sync
,readiness
) are observed with value 0.When the latest commit fails the automated rendering, the metric with the component
rendering
is observed with value 1.When checking out the latest commit encounters error or the latest commit contains invalid configuration, the metric with the component
source
is observed with value 1.When a resource fails to be applied to the cluster, the metric with the component
sync
is observed with value 1.When a resource is applied, but fails to reach its desired state, the metric with the component
readiness
is observed with value 1. For example, a Deployment is applied to the cluster, but the corresponding Pods are not created successfully.
Monitor resources with Cloud Monitoring
If Config Sync is running inside a Google Cloud environment that has a default service account, Config Sync automatically exports metrics to Cloud Monitoring.
If Workload Identity is enabled, complete the following steps:
Bind the Kubernetes ServiceAccount
default
in the namespaceconfig-management-monitoring
to a Google service account with the metric writer role:gcloud iam service-accounts add-iam-policy-binding \ --role roles/iam.workloadIdentityUser \ --member "serviceAccount:PROJECT_ID.svc.id.goog[config-management-monitoring/default]" \ GSA_NAME@PROJECT_ID.iam.gserviceaccount.com
Replace the following:
PROJECT_ID
: your project IDGSA_NAME
: the Google service account with the metric writer role
This action requires
iam.serviceAccounts.setIamPolicy
permission on the project.Annotate the Kubernetes ServiceAccount using the email address of the Google service account:
kubectl annotate serviceaccount \ --namespace config-management-monitoring \ default \ iam.gke.io/gcp-service-account=GSA_NAME@PROJECT_ID.iam.gserviceaccount.com
For examples on how to view these metrics, see the following Example debugging procedures section and the OpenCensus metrics in Cloud Monitoring article.
Example debugging procedures for Cloud Monitoring
The following Cloud Monitoring examples illustrate some patterns for using OpenCensus metrics to detect and diagnose problems related to Config Sync when you are using the RootSync and RepoSync APIs.
Metric format
In Cloud Monitoring, metrics have the following format:
custom.googleapis.com/opencensus/config_sync/METRIC
.
This metric name is composed of the following components:
custom.googleapis.com
: all custom metrics have this prefixopencensus
: this prefix is added because Config Sync uses the OpenCensus libraryconfig_sync/
: metrics that Config Sync exports to Cloud Monitoring have this prefixMETRIC
: the name of the metric that you want to query
Query metrics by reconciler
RootSync and RepoSync objects are instrumented with high-level metrics that give you useful insight into how Config Sync is operating on the cluster. Almost all metrics are tagged by the reconciler name, so you can see if any errors have occurred and can set up alerts for them in Cloud Monitoring.
A reconciler is a Pod that is deployed as a Deployment. It syncs manifests from
a Git repository to a cluster. When you create a RootSync object, Config Sync
creates a reconciler called root-reconciler
. When you create a RepoSync object,
Config Sync creates a reconciler called
ns-reconciler-NAMESPACE
, where
NAMESPACE
is the namespace you created your RepoSync
object in.
The following diagram shows you how reconciler Pods function:
For example, to filter by the reconciler name when you are using Cloud Monitoring, complete the following tasks:
In the Google Cloud console, go to Monitoring:
In the Monitoring navigation pane, click leaderboard Metrics explorer.
In the Select a metric drop-down list, add:
custom.googleapis.com/opencensus/config_sync/reconciler_errors
.In the Filter dropdown list, select reconciler. A filter fields box appears.
In the filter fields box, select = in the first field and root-reconciler in the second.
Click Apply.
You can now see metrics for your RootSync objects.
For more instructions on how to filter by a specific data type, see Filtering the data.
Query Config Sync operations by component and status
When you have enabled the RootSync and RepoSync APIs, importing and sourcing
from a Git repository and syncing to a cluster is handled by the reconcilers.
The reconciler_errors
metric is labeled by component so you can see where any
errors occurred.
For example, to filter by component when you are using Cloud Monitoring, complete the following tasks:
In the Google Cloud console, go to Monitoring:
In the Monitoring navigation pane, click leaderboard Metrics explorer.
In the Select a metric drop-down list, add
custom.googleapis.com/opencensus/config_sync/reconciler_errors
.In the Filter dropdown list, select component. A filter fields box appears.
In the filter fields box, select = in the first box and source in the second.
Click Apply.
You can now see errors that occurred when sourcing from a Git repository for your reconcilers.
You can also check the metrics for the source and sync processes themselves by
querying the following metrics and filtering by the status
tag:
custom.googleapis.com/opencensus/config_sync/parser_duration_seconds
custom.googleapis.com/opencensus/config_sync/apply_duration_seconds
custom.googleapis.com/opencensus/config_sync/remediate_duration_seconds
Configure a custom OpenTelemetry exporter
If you want to send your metrics to a different monitoring system, you can modify the OpenTelemetry configuration. For a list of supported monitoring systems, see OpenTelemetry Collector Exporters and OpenTelemetry Collector-Contrib Exporters.
OpenTelemetry monitoring resources are managed in a separate
config-management-monitoring
namespace. To configure a custom OpenTelemetry
exporter for use with Config Sync, you need to create a ConfigMap with the
name otel-collector-custom
in that config-management-monitoring
namespace.
The ConfigMap should contain a otel-collector-config.yaml
key and the value
should be the file contents of the custom OpenTelemetry Collector configuration.
For more information on the configuration options, see the
OpenTelemetry Collector configuration documentation.
The following ConfigMap is an example of a ConfigMap with a custom logging exporter:
# otel-collector-custom-cm.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-custom
namespace: config-management-monitoring
labels:
app: opentelemetry
component: otel-collector
data:
otel-collector-config.yaml: |
receivers:
opencensus:
exporters:
logging:
logLevel: debug
processors:
batch:
extensions:
health_check:
service:
extensions: [health_check]
pipelines:
metrics:
receivers: [opencensus]
processors: [batch]
exporters: [logging]
All custom configurations must define an opencensus
receiver and metrics
pipeline. The rest of the fields are optional and configurable, but we recommend
that you include a batch
processor and health check extension like in the
example.
After you have created the ConfigMap, use kubectl
to create the resource:
kubectl apply -f otel-collector-custom-cm.yaml
The OpenTelemetry Collector Deployment picks up this ConfigMap and automatically restarts to apply the custom configuration.
Monitor resources with Prometheus
Config Sync uses Prometheus to collect and show metrics related to its processes.
You can also configure Cloud Monitoring to pull custom metrics from Prometheus. Then you can see custom metrics in both Prometheus and Monitoring. For more information, see Using Prometheus.
Scrape the metrics
All Prometheus metrics are available for scraping at port 8675. Before you can scrape metrics, you need to configure your cluster for Prometheus in one of two ways. Either:
Follow the Prometheus documentation to configure your cluster for scraping, or
Use the Prometheus Operator along with the following manifests, which scrape all Anthos Config Management metrics every 10 seconds.
Create a temporary directory to hold the manifest files.
mkdir acm-monitor cd acm-monitor
Download the Prometheus Operator manifest from the CoreOS repository. repository, using the
curl
command:curl -o bundle.yaml https://raw.githubusercontent.com/coreos/prometheus-operator/master/bundle.yaml
This manifest is configured to use the
default
namespace, which is not recommended. The next step modifies the configuration to use a namespace calledmonitoring
instead. To use a different namespace, substitute it where you seemonitoring
in the remaining steps.Create a file to update the namespace of the ClusterRoleBinding in the bundle above.
# patch-crb.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus-operator subjects: - kind: ServiceAccount name: prometheus-operator namespace: monitoring # we are patching from default namespace
Create a
kustomization.yaml
file that applies the patch and modifies the namespace for other resources in the manifest.# kustomization.yaml resources: - bundle.yaml namespace: monitoring patchesStrategicMerge: - patch-crb.yaml
Create the
monitoring
namespace if one does not exist. You can use a different name for the namespace, but if you do, also change the value ofnamespace
in the YAML manifests from the previous steps.kubectl create namespace monitoring
Apply the Kustomize manifest using the following commands:
kubectl apply -k . until kubectl get customresourcedefinitions servicemonitors.monitoring.coreos.com ; \ do date; sleep 1; echo ""; done
The second command blocks until the CRDs are available on the cluster.
Create the manifest for the resources necessary to configure a Prometheus server which scrapes metrics from Anthos Config Management.
# acm.yaml apiVersion: v1 kind: ServiceAccount metadata: name: prometheus-acm namespace: monitoring --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus-acm rules: - apiGroups: [""] resources: - nodes - services - endpoints - pods verbs: ["get", "list", "watch"] - apiGroups: [""] resources: - configmaps verbs: ["get"] - nonResourceURLs: ["/metrics"] verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus-acm roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus-acm subjects: - kind: ServiceAccount name: prometheus-acm namespace: monitoring --- apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: acm namespace: monitoring labels: prometheus: acm spec: replicas: 2 serviceAccountName: prometheus-acm serviceMonitorSelector: matchLabels: prometheus: config-management podMonitorSelector: matchLabels: prometheus: config-management alerting: alertmanagers: - namespace: default name: alertmanager port: web resources: requests: memory: 400Mi --- apiVersion: v1 kind: Service metadata: name: prometheus-acm namespace: monitoring labels: prometheus: acm spec: type: NodePort ports: - name: web nodePort: 31900 port: 9190 protocol: TCP targetPort: web selector: app: prometheus prometheus: acm --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: acm-service namespace: monitoring labels: prometheus: config-management spec: selector: matchLabels: monitored: "true" namespaceSelector: matchNames: # If you are using RootSync and RepoSync APIs, change # config-management-system to config-management-monitoring - config-management-system endpoints: - port: metrics interval: 10s ---
Apply the manifest using the following commands:
kubectl apply -f acm.yaml until kubectl rollout status statefulset/prometheus-acm -n monitoring; \ do sleep 1; done
The second command blocks until the Pods are running.
You can verify the installation by forwarding the web port of the Prometheus server to your local machine.
kubectl -n monitoring port-forward svc/prometheus-acm 9190
You can now access the Prometheus web UI at
http://localhost:9190
.Remove the temporary directory.
cd .. rm -rf acm-monitor
Available Prometheus metrics
Config Sync collects the following metrics and makes them available to Prometheus. The Labels column lists all labels that are applicable to each metric. Metrics without labels represent a single measurement over time while metrics with labels represent multiple measurements, one for each combination of label values.
If this table becomes out of sync, you can filter metrics by prefix in the
Prometheus user interface, and all of the metrics start with the prefix
gkeconfig_
.
Name | Type | Labels | Description |
---|---|---|---|
gkeconfig_importer_cycle_duration_seconds_bucket |
Histogram | status | Number of cycles that the importer has attempted to import configs to the cluster (distributed into buckets by duration of each cycle) |
gkeconfig_importer_cycle_duration_seconds_count |
Histogram | status | Number of cycles that the importer has attempted to import configs to the cluster (ignoring duration) |
gkeconfig_importer_cycle_duration_seconds_sum |
Histogram | status | Sum of the durations of all cycles that the importer has attempted to import configs to the cluster |
gkeconfig_importer_namespace_configs |
Gauge | Number of namespace configs in current state | |
gkeconfig_monitor_errors |
Gauge | component | Number of errors in the config repo grouped by the component where they occurred |
gkeconfig_monitor_configs |
Gauge | state | Number of configs (cluster and namespace) grouped by their sync status |
gkeconfig_monitor_last_import_timestamp |
Gauge | Timestamp of the most recent import | |
gkeconfig_monitor_last_sync_timestamp |
Gauge | Timestamp of the most recent sync | |
gkeconfig_monitor_sync_latency_seconds_bucket |
Histogram | Number of import-to-sync measurements taken (distributed into buckets by latency between the two) | |
gkeconfig_monitor_sync_latency_seconds_count |
Histogram | Number of import-to-sync measurements taken (ignoring latency between the two) | |
gkeconfig_monitor_sync_latency_seconds_sum |
Histogram | Sum of the latencies of all import-to-sync measurements taken | |
gkeconfig_syncer_api_duration_seconds_bucket |
Histogram | operation, type, status | Number of calls made by the syncer to the API server (distributed into buckets by duration of each call) |
gkeconfig_syncer_api_duration_seconds_count |
Histogram | operation, type, status | Number of calls made by the importer to the API server (ignoring duration) |
gkeconfig_syncer_api_duration_seconds_sum |
Histogram | operation, type, status | Sum of the durations of all calls made by the syncer to the API server |
gkeconfig_syncer_controller_restarts_total |
Counter | source | Total number of restarts for the namespace and cluster config controllers |
gkeconfig_syncer_operations_total |
Counter | operation, type, status | Total number of operations that have been performed to sync resources to configs |
gkeconfig_syncer_reconcile_duration_seconds_bucket |
Histogram | type, status | Number of reconcile events processed by the syncer (distributed into buckets by duration) |
gkeconfig_syncer_reconcile_duration_seconds_count |
Histogram | type, status | Number of reconcile events processed by the syncer (ignoring duration) |
gkeconfig_syncer_reconcile_duration_seconds_sum |
Histogram | type, status | Sum of the durations of all reconcile events processed by the syncer |
gkeconfig_syncer_reconcile_event_timestamps |
Gauge | type | Timestamps when syncer reconcile events occurred |
Example debugging procedures for Prometheus
The following examples illustrate some patterns for using Prometheus metrics, object status fields, and object annotations to detect and diagnose problems related to Config Sync. These examples show how you can start with high level monitoring that detects a problem and then progressively refine your search to drill down and diagnose the root cause of the problem.
Query configs by status
The monitor
process provides high-level metrics that give useful insight into
an overall view of how Config Sync is operating on the cluster. You
can see if any errors have occurred, and can even
set up alerts for them.
gkeconfig_monitor_errors
Query metrics by reconciler
If you are using Config Sync RootSync and RepoSync APIs, then you can monitor the RootSync and RepoSync objects. The RootSync and RepoSync objects are instrumented with high-level metrics that give you useful insight into how Config Sync is operating on the cluster. Almost all metrics are tagged by the reconciler name, so you can see if any errors have occurred and can set up alerts for them in Prometheus.
A reconciler is a Pod that syncs manifests from a Git repository to a cluster.
When you create a RootSync object, Config Sync creates a reconciler called
root-reconciler
. When you create a RepoSync object, Config Sync creates a
reconciler called ns-reconciler-NAMESPACE
, where
NAMESPACE
is the namespace you created your RepoSync
object in.
In Prometheus, you can use the following filters for the reconcilers:
# Querying Root reconciler
config_sync_reconciler_errors{root_reconciler="root-reconciler"}
# Querying Namespace reconciler for a namespace called retail
config_sync_reconciler_errors{ns_reconciler_retail="ns-reconciler-retail"}
Use nomos status
to display errors
In addition to using Prometheus metrics to monitor the status of
Config Sync on your clusters, you can use the
nomos status
command which prints errors
from all of your clusters on the command line.
Query import and sync operations by status
Config Sync uses a two-step process to apply configs from the
repo to a cluster. The gkeconfig_monitor_errors
metric is labeled by component
so you can see where any errors occurred.
gkeconfig_monitor_errors{component="importer"}
gkeconfig_monitor_errors{component="syncer"}
You can also check the metrics for the importer and syncer processes themselves.
gkeconfig_importer_cycle_duration_seconds_count{status="error"}
gkeconfig_syncer_reconcile_duration_seconds_count{status="error"}
When you have enabled the RootSync and RepoSync APIs, importing and sourcing from a Git
repository and syncing to a cluster is handled by the reconcilers. The
reconciler_errors
metric is labeled by component so you can see where any
errors occurred.
In Prometheus, you could use the following queries:
# Check for errors that occurred when sourcing configs from the Git repository.
config_sync_reconciler_errors{component="source"}
# Check for errors that occurred when syncing configs to the cluster.
config_sync_reconciler_errors{component="sync"}
You can also check the metrics for the source and sync processes themselves:
config_sync_parse_duration_seconds{status="error"}
config_sync_apply_duration_seconds{status="error"}
config_sync_remediate_duration_seconds{status="error"}
Check a config's object status
Config Sync defines two custom Kubernetes objects: ClusterConfig
and NamespaceConfig. These objects define a status field which contains
information about the change that was last applied to the config and any errors
that occurred. For instance, if there is an error in a namespace called
shipping-dev
, you can check the status of the corresponding NamespaceConfig.
kubectl get namespaceconfig shipping-dev -o yaml
Check an object's token
annotation
You may want to know when a managed Kubernetes object was last updated by Config Sync. Each managed object is annotated with the hash of the Git commit when it was last modified, as well as the path to the config that contained the modification.
kubectl get clusterrolebinding namespace-readers
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
annotations:
configmanagement.gke.io/source-path: cluster/namespace-reader-clusterrolebinding.yaml
configmanagement.gke.io/token: bbb6a1e2f3db692b17201da028daff0d38797771
name: namespace-readers
...
For more information, see labels and annotations.
What's next
- Learn more about how to monitor RootSync and RepoSync objects.