使用 Prometheus 监控 Config Sync

本页面介绍了如何将指标从 Config Sync 发送到 Prometheus。

本页面介绍了如何使用 Prometheus 查看 Config Sync 指标。建议您使用 Prometheus（本页）或 Cloud Monitoring 来导出指标。您还可以使用自定义指标。

Config Sync 会自动收集指标并将其导出到 Prometheus。您还可以配置 Cloud Monitoring 以从 Prometheus 中拉取自定义指标。然后，您可以在 Prometheus 和 Monitoring 中查看自定义指标。如需了解详情，请参阅 GKE 文档中的使用 Prometheus。

爬取指标

所有 Prometheus 指标均可在端口 8675 抓取。在抓取指标之前，您需要采用下列两种方式中的任意一种为 Prometheus 配置集群。采用以下任一方式：

按照 Prometheus 文档的说明配置集群以爬取指标，或者

使用 Prometheus Operator 以及以下清单，这些清单每 10 秒爬取所有 Config Sync 指标一次。

创建一个临时目录来保存清单文件。

mkdir config-sync-monitor
cd config-sync-monitor

使用 curl 命令从 CoreOS 代码库下载 Prometheus Operator 清单：
```
curl -o bundle.yaml https://raw.githubusercontent.com/coreos/prometheus-operator/master/bundle.yaml
```
此清单配置为使用 default 命名空间，但不建议使用该命名空间。下一步会将配置修改为使用名为 monitoring 的命名空间。如需使用其他命名空间，请通过其余步骤在显示 monitoring 的位置替换命名空间。

创建一个文件来更新上述软件包中 ClusterRoleBinding 的命名空间。

# patch-crb.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus-operator
subjects:
- kind: ServiceAccount
  name: prometheus-operator
  namespace: monitoring # we are patching from default namespace

创建一个 kustomization.yaml 文件，该文件将应用补丁程序并为清单中的其他资源修改命名空间。

# kustomization.yaml
resources:
- bundle.yaml

namespace: monitoring

patchesStrategicMerge:
- patch-crb.yaml

如果 monitoring 命名空间不存在，请创建一个。您可以对该命名空间使用其他名称，但这样的话，还要通过前面的步骤更改 YAML 清单中 namespace 的值。
```
kubectl create namespace monitoring
```

使用以下命令应用 Kustomize 清单：

kubectl apply -k .

until kubectl get customresourcedefinitions servicemonitors.monitoring.coreos.com ; \
do date; sleep 1; echo ""; done

第二个命令会进行阻止，直到集群上有可用的 CRD。

为配置 Prometheus 服务器所需的资源创建清单，该服务器将从 Config Sync 中抓取指标。

# config-sync-monitoring.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus-config-sync
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus-config-sync
rules:
- apiGroups: [""]
  resources:
  - nodes
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources:
  - configmaps
  verbs: ["get"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus-config-sync
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus-config-sync
subjects:
- kind: ServiceAccount
  name: prometheus-config-sync
  namespace: monitoring
---
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: config-sync
  namespace: monitoring
  labels:
    prometheus: config-sync
spec:
  replicas: 2
  serviceAccountName: prometheus-config-sync
  serviceMonitorSelector:
    matchLabels:
      prometheus: config-management
  alerting:
    alertmanagers:
    - namespace: default
      name: alertmanager
      port: web
  resources:
    requests:
      memory: 400Mi
---
apiVersion: v1
kind: Service
metadata:
  name: prometheus-config-sync
  namespace: monitoring
  labels:
    prometheus: config-sync
spec:
  type: NodePort
  ports:
  - name: web
    nodePort: 31900
    port: 9190
    protocol: TCP
    targetPort: web
  selector:
    prometheus: config-sync
--- 
---

使用以下命令应用清单：

kubectl apply -f config-sync.yaml

until kubectl rollout status statefulset/prometheus-config-sync -n monitoring; \
do sleep 1; done

第二个命令会进行阻止，直到 Pod 开始运行。

您可以通过将 Prometheus 服务器的 Web 端口转发到本地机器来验证安装。
```
kubectl -n monitoring port-forward svc/prometheus-config-sync 9190
```
现在，您可以通过 http://localhost:9190 访问 Prometheus 网页界面。
移除临时目录。
```
cd ..
rm -rf config-sync-monitor
```

可用的 Prometheus 指标

Config Sync 会收集以下指标并将其提供给 Prometheus。标签列列出了适用于每个指标的所有标签。没有标签的指标代表一段时间内的一次测量结果，而带有标签的指标代表多次测量结果，每组标签值对应一次测量结果。

如果此表变得不同步，您可以在 Prometheus 界面中按前缀过滤指标。所有指标都以 config_sync_ 前缀开头。

名称	类型	标签	说明
`config_sync_api_duration_seconds_bucket`	直方图	status、operation	API 服务器调用的延迟时间分布（按每个周期的持续时间分布到存储桶中）
`config_sync_api_duration_seconds_count`	直方图	status、operation	API 服务器调用的延迟时间分布（忽略持续时间）
`config_sync_api_duration_seconds_sum`	直方图	status、operation	所有 API 服务器调用的持续时间之和
`config_sync_apply_duration_seconds_bucket`	直方图	提交、状态	将从可靠来源声明的资源应用于集群的延迟时间分布（按每个周期的持续时间分布到存储桶中）
`config_sync_apply_duration_seconds_count`	直方图	提交、状态	将从可靠来源声明的资源应用于集群的延迟时间分布（忽略持续时间）
`config_sync_apply_duration_seconds_sum`	直方图	提交、状态	将从可靠来源声明的资源应用于集群的所有延迟时间的持续时间之和
`config_sync_apply_operations_total`	计数器	operation、status、controller	为将资源从可靠来源同步到集群而执行的操作数量
`config_sync_cluster_scoped_resource_count`	仪表盘	resourcegroup	ResourceGroup 中集群范围的资源的数量
`config_sync_crd_count`	仪表盘	resourcegroup	ResourceGroup 中的 CRD 数量
`config_sync_declared_resources`	仪表盘	提交	从 Git 解析的已声明资源数量
`config_sync_internal_errors_total`	计数器	来源	Config Sync 触发的内部错误数量。如果没有发生内部错误，指标可能不会出现
`config_sync_kcc_resource_count`	仪表盘	resourcegroup	ResourceGroup 中的 Config Connector 资源数量
`config_sync_last_apply_timestamp`	仪表盘	提交、状态	最近应用操作的时间戳
`config_sync_last_sync_timestamp`	仪表盘	提交、状态	从 Git 最近一次同步的时间戳
`config_sync_parser_duration_seconds_bucket`	直方图	status、trigger、source	从可靠来源同步到集群所涉及的不同阶段的延迟时间分布
`config_sync_parser_duration_seconds_count`	直方图	status、trigger、source	从可靠来源同步到集群所涉及的不同阶段的延迟时间分布（忽略持续时间）
`config_sync_parser_duration_seconds_sum`	直方图	status、trigger、source	从可靠来源同步到集群所涉及的不同阶段的延迟时间之和
`config_sync_pipeline_error_observed`	仪表盘	name、reconciler、component	RootSync 和 RepoSync 自定义资源的状态。值 1 表示失败。
`config_sync_ready_resource_count`	仪表盘	resourcegroup	ResourceGroup 中准备就绪的资源总数
`config_sync_reconcile_duration_seconds_bucket`	直方图	status	由协调器管理器处理的协调事件的延迟时间分布（按每次调用的持续时间分布到存储桶中）
`config_sync_reconcile_duration_seconds_count`	直方图	status	由协调器管理器处理的协调事件的延迟时间分布（忽略持续时间）
`config_sync_reconcile_duration_seconds_sum`	直方图	status	由协调器管理器处理的协调事件的所有延迟时间的持续时间之和
`config_sync_reconciler_errors`	仪表盘	component、errorclass	将资源从可靠来源同步到集群时遇到的错误数量
`config_sync_remediate_duration_seconds_bucket`	直方图	status	补救器协调事件的延迟时间分布（按持续时间分布到存储桶中）
`config_sync_remediate_duration_seconds_count`	直方图	status	补救器协调事件的延迟时间分布（忽略持续时间）
`config_sync_remediate_duration_seconds_sum`	直方图	status	补救器协调事件的所有延迟时间的持续时间之和
`config_sync_resource_count`	仪表盘	resourcegroup	一个 ResourceGroup 跟踪的资源数量
`config_sync_resource_conflicts_total`	计数器	提交	因缓存资源和集群资源之间存在不匹配而导致的资源冲突数量。如果没有发生资源冲突，指标可能不会出现
`config_sync_resource_fights_total`	计数器		过于频繁同步的资源数量。如果没有发生资源争夺，指标可能不会出现
`config_sync_resource_group_total`	仪表盘		ResourceGroup CR 数量
`config_sync_resource_ns_count`	仪表盘	resourcegroup	ResourceGroup 中的资源使用的命名空间数量
`config_sync_rg_reconcile_duration_seconds_bucket`。	直方图	stallreason	协调 ResourceGroup CR 的时间分布（按时长分布到存储桶中）
`config_sync_rg_reconcile_duration_seconds_count`	直方图	stallreason	协调 ResourceGroup CR 的时间分布（忽略持续时间）
`config_sync_rg_reconcile_duration_seconds_sum`	直方图	stallreason	协调 ResourceGroup CR 的所有时间之和
`config_sync_kustomize_build_latency_bucket`	直方图		`kustomize build` 执行时间的延迟时间分布（按每个操作的持续时间分布到存储桶中）
`config_sync_kustomize_build_latency_count`	直方图		`kustomize build` 执行时间的延迟时间分布（忽略持续时间）
`config_sync_kustomize_build_latency_sum`	直方图		所有 `kustomize build` 执行时间之和
`config_sync_kustomize_ordered_top_tier_metrics`	仪表盘	top_tier_field	资源、生成器、SecretGenerator、ConfigMapGenerator、转换器和验证器的使用
`config_sync_kustomize_builtin_transformers`	仪表盘	k8s_builtin_transformer	与 Kubernetes 对象元数据相关的内置转换器的使用
`config_sync_kustomize_resource_count`	仪表盘		`kustomize build` 输出的资源数量
`config_sync_kustomize_field_count`	仪表盘	field_name	在 kustomization 文件中使用特定字段的次数
`config_sync_kustomize_patch_count`	仪表盘	patch_field	`patches`、`patchesStrategicMerge` 和 `patchesJson6902` 字段中的补丁数量
`config_sync_kustomize_base_count`	仪表盘	base_source	远程基和本地基的数量
`kustomize_deprecating_field_count`	仪表盘	deprecating_field	可能已弃用的字段的使用
`kustomize_simplification_adoption_count`	仪表盘	simplification_field	简化转换器映像、副本和替换项的使用
`kustomize_helm_inflator_count`	仪表盘	helm_inflator	kustomize 中 helm 的使用（无论是通过内置字段还是自定义函数）

Prometheus 的调试过程示例

下面的示例展示了一些模式，它们使用 Prometheus 指标、对象状态字段和对象注释来检测和诊断与 Config Sync 相关的问题。这些示例显示了如何先从检测问题的基本监控开始，然后逐步优化搜索，从而深入分析并诊断问题的根本原因。

按状态查询配置

reconciler 进程提供了概要指标，有助于您全面了解 Config Sync 在集群上的运行情况。您可以查看是否发生了任何错误，甚至还可以设置错误提醒。

config_sync_reconciler_errors

按协调器查询指标

如果您使用 Config Sync RootSync 和 RepoSync API，则可以监控 RootSync 和 RepoSync 对象。RootSync 和 RepoSync 对象可通过概要指标进行检测，有助于您全面了解 Config Sync 在集群上的运行情况。几乎所有指标都按照协调器名称进行标记，因此您可以查看是否发生了任何错误，并且可以在 Prometheus 中为其设置提醒。

请参阅可用指标标签的完整列表进行过滤。

在 Prometheus 中，您可以将以下过滤条件用于 RootSync 或 RepoSync：

# Querying RootSync
config_sync_reconciler_errors{configsync_sync_name=ROOT_SYNC_NAME}

# Querying RepoSync
config_sync_reconciler_errors{configsync_sync_name=REPO_SYNC_NAME}

按状态查询导入和同步操作

在 Prometheus 中，您可以使用以下查询：

# Check for errors that occurred when sourcing configs.
config_sync_reconciler_errors{component="source"}

# Check for errors that occurred when syncing configs to the cluster.
config_sync_reconciler_errors{component="sync"}

您还可以检查来源的指标并自动同步进程：

config_sync_parser_duration_seconds{status="error"}
config_sync_apply_duration_seconds{status="error"}
config_sync_remediate_duration_seconds{status="error"}

使用适用于 Prometheus 的 Google Cloud 托管式服务监控资源

Google Cloud Managed Service for Prometheus 是适用于 Prometheus 指标的 Google Cloud全托管式多云解决方案。它支持两种数据收集模式：代管式收集（推荐模式）或自部署数据收集。完成以下步骤，设置在代管式数据收集模式下使用 Google Cloud Managed Service for Prometheus 监控 Config Sync。

按照设置代管式收集中的说明在集群上启用代管式 Prometheus。

将以下示例清单保存为 pod-monitoring-config-sync-monitoring.yaml。该清单配置 PodMonitoring 资源以在 config-management-monitoring 命名空间下的 otel-collector-* Pod 的端口 8675 上爬取 Config Sync 指标。PodMonitoring 资源使用 Kubernetes 标签选择器查找 otel-collector-* Pod。

apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: config-sync-monitoring
  namespace: config-management-monitoring
spec:
  selector:
    matchLabels:
      app: opentelemetry
      component: otel-collector
  endpoints:
  - port: 8675
    interval: 10s

将清单应用于集群：

kubectl apply -f pod-monitoring-config-sync-monitoring.yaml

按照 Cloud Monitoring 中的 Managed Service for Prometheus 数据中的说明，使用 Google Cloud 控制台中的 Cloud Monitoring Metrics Explorer 页面验证正在导出 Prometheus 数据。

后续步骤

将 Prometheus 提醒规则与 Config Sync SLI 搭配使用