日志记录和监控

Google Distributed Cloud for VMware（纯软件）具有多种集群日志记录和监控选项，其中包括云端托管式服务、开源工具，以及经过验证的与第三方商业解决方案的兼容性。本文档介绍了这些选项，并提供一些基本指导，帮助您为自己的环境选择合适的解决方案。

用于 Google Distributed Cloud 的选项

您可将多个日志记录和监控选项用于 Google Distributed Cloud：

Cloud Logging 和 Cloud Monitoring
Google Cloud Managed Service for Prometheus（预览版）
经过验证且适用于第三方解决方案的配置。

Cloud Logging 和 Cloud Monitoring

Google Cloud Observability（以前称为 Stackdriver）是Google Cloud的内置可观测性解决方案。它提供了全代管式日志记录解决方案、指标收集、监控、信息中心和提醒。Cloud Monitoring 监控 Google Distributed Cloud 集群的方式与监控云端 GKE 集群的方式类似。

您可以为集群内代理配置监控和日志记录范围，以及收集的指标级别：

日志记录和监控范围可以仅设置为系统组件（默认值）或者针对系统组件和应用进行设置
可以为优化的指标集或者针对所有指标配置所收集的指标级别

如需了解详情，请参阅本文档中的为 Google Distributed Cloud 配置日志记录和监控代理。

Cloud Logging 和 Cloud Monitoring 为需要单一、易于配置且基于云的强大可观测性解决方案的客户提供了理想的解决方案。如果仅在 Google Distributed Cloud 上运行工作负载，或者在 GKE 和 Google Distributed Cloud 上运行工作负载，我们强烈建议使用 Logging 和 Monitoring。对于组件在 Google Distributed Cloud 和传统本地基础设施上运行的应用，您可以考虑采用其他解决方案来实现这些应用的端到端视图。

如需详细了解架构、配置以及默认复制到您的 Google Cloud 项目中的 Google Distributed Cloud 数据，请参阅适用于 Google Distributed Cloud 的 Logging 和 Monitoring 的工作原理部分。
如需详细了解 Cloud Logging，请参阅 Cloud Logging 文档。
如需详细了解 Cloud Monitoring，请参阅 Cloud Monitoring 文档。

第三方解决方案

Google 与多个第三方日志记录和监控解决方案提供商合作，帮助他们的产品与 Google Distributed Cloud 很好地搭配使用。这些提供商包括 Datadog、Elastic 和 Splunk。将来我们会添加经过验证的其他第三方。

如需详细了解如何将第三方解决方案与 Google Distributed Cloud 搭配使用，请参阅以下内容：

适用于 Google Distributed Cloud 的 Logging 和 Monitoring 的工作原理

当您创建新管理员集群或用户集群时，系统会在每个集群中安装并激活 Logging 和 Monitoring 代理。代理会收集系统组件的相关数据，您可以配置这些组件的范围。

如需在 Google Cloud 控制台上查看收集的数据，您必须配置存储要查看的日志和指标的 Google Cloud 项目。

每个集群上的 Logging 和 Monitoring 代理包括：

GKE 指标代理 (gke-metrics-agent)。将指标发送到 Cloud Monitoring API 的 DaemonSet。
Log Forwarder (stackdriver-log-forwarder)。一个 Fluent Bit DaemonSet，用于将日志从每台机器转发到 Cloud Logging。Log Forwarder 会将节点上的日志条目在本地进行缓冲，并在 4 小时内将其重新发送出去。如果缓冲区已满，或者 Log Forwarder 无法访问 Cloud Logging API 的时间超过 4 小时，日志会被丢弃。
全局 GKE 指标代理 (gke-metrics-agent-global)。一个 Deployment，用于将指标发送到 Cloud Monitoring API。
Metadata 代理 (stackdriver-metadata-agent)。一个 Deployment，用于将 Pod、部署或节点等 Kubernetes 资源的元数据发送到 Stackdriver Resource Metadata API；这些数据用于通过部署名称、节点名称甚至 Kubernetes 服务名称进行查询来丰富指标查询。
kube-state-metrics：用于侦听 API 服务器并生成有关对象状态的指标的 Deployment。
node-exporter：用于生成硬件和操作系统指标的 DaemonSet。

您可以运行以下命令来查看所有 Deployment 代理：

  kubectl --kubeconfig CLUSTER_KUBECONFIG get deployments -l "managed-by=stackdriver" --all-namespaces

其中，CLUSTER_KUBECONFIG 是集群的 kubeconfig 文件的路径。

此命令的输出类似如下所示：

gke-metrics-agent-global                      1/1     Running   0   4h31m
stackdriver-metadata-agent-cluster-level      1/1     Running   0   4h31m

您可以运行以下命令来查看所有 DaemonSet 代理：

  kubectl --kubeconfig CLUSTER_KUBECONFIG get daemonsets -l "managed-by=stackdriver" --all-namespaces

此命令的输出类似以下内容：

gke-metrics-agent                             1/1     Running   0   4h31m
stackdriver-log-forwarder                     1/1     Running   0   4h31m

为 Google Distributed Cloud 配置日志记录和监控代理

与 Google Distributed Cloud 一起安装的代理会根据您的设置和配置收集有关系统组件的数据，以便维护集群并排查其问题。

仅限系统组件（默认范围）

安装后，代理会收集 Google 所提供系统组件的日志和指标，包括性能详情（如 CPU 和内存利用率）和类似元数据。这些组件包括管理员集群中的所有工作负载，以及用户集群的 kube-system、gke-system、gke-connect、istio-system、config-management-system 命名空间中的工作负载。您可以按照以下各部分所述配置或停用代理。

您还可以扩展收集的日志和指标的范围，使其也包括应用。有关启用应用日志记录和监控功能的说明，请参阅为用户应用启用 Logging 和 Monitoring。

优化指标（默认指标）

默认情况下，集群中运行的指标代理会收集一组优化的容器、kubelet 和 kube-state-metrics 指标并报告给 Google Cloud Observability（原 Stackdriver）。

收集这组优化的指标需要较少资源，可以提高整体性能和可伸缩性。这对于容器级和 kube 级指标尤为重要，因为需要监控大量对象。

排除的容器指标

以下容器指标已从优化指标中排除：

container_cpu_cfs_periods_total
container_cpu_cfs_throttled_periods_total
container_cpu_load_average_10s
container_cpu_system_seconds_total
container_cpu_user_seconds_total
container_fs_io_current
container_fs_io_time_seconds_total
container_fs_io_time_weighted_seconds_total
container_fs_read_seconds_total
container_fs_reads_bytes_total
container_fs_reads_merged_total
container_fs_reads_total
container_fs_sector_reads_total
container_fs_sector_writes_total
container_fs_write_seconds_total
container_fs_writes_bytes_total
container_fs_writes_merged_total
container_fs_writes_total
container_last_seen
container_memory_cache
container_memory_failcnt
container_memory_mapped_file
container_memory_max_usage_bytes
container_memory_swap
container_network_receive_packets_dropped_total
container_network_receive_packets_total
container_network_transmit_packets_dropped_total
container_network_transmit_packets_total
container_start_time_seconds
container_spec_cpu_period
container_spec_cpu_quota
container_spec_cpu_shares
container_spec_memory_limit_bytes
container_spec_memory_reservation_limit_bytes
container_spec_memory_swap_limit_bytes
container_start_time_seconds
container_tasks_state

Google Distributed Cloud 指标中记录了完整的 Google Distributed Cloud 指标集。

排除的 kubelet 指标

以下 kubelet 指标已从优化指标中排除：

kubelet_runtime_operations_duration_seconds
kubelet_runtime_operations_errors
kubelet_runtime_operations_duration_seconds
kubelet_runtime_operations_latency_microseconds
kubelet_runtime_operations_latency_microseconds_count
kubelet_runtime_operations_latency_microseconds_sum
rest_client_request_duration_seconds
rest_client_request_latency_seconds

Google Distributed Cloud 指标中记录了完整的 Google Distributed Cloud 指标集。

排除的 kube-state-metrics 指标

以下 kube-state-metrics 指标已从优化指标中排除：

kube_certificatesigningrequest_cert_length
kube_certificatesigningrequest_condition
kube_certificatesigningrequest_created
kube_certificatesigningrequest_labels
kube_configmap_annotations
kube_configmap_info
kube_configmap_labels
kube_configmap_metadata_resource_version
kube_daemonset_annotations
kube_daemonset_created
kube_daemonset_labels
kube_daemonset_metadata_generation
kube_daemonset_status_observed_generation
kube_deployment_annotations
kube_deployment_created
kube_deployment_labels
kube_deployment_spec_paused
kube_deployment_spec_strategy_rollingupdate_max_surge
kube_deployment_spec_strategy_rollingupdate_max_unavailable
kube_deployment_status_condition
kube_deployment_status_replicas_ready
kube_endpoint_annotations
kube_endpoint_created
kube_endpoint_info
kube_endpoint_labels
kube_endpoint_ports
kube_horizontalpodautoscaler_annotations
kube_horizontalpodautoscaler_info
kube_horizontalpodautoscaler_labels
kube_horizontalpodautoscaler_metadata_generation
kube_horizontalpodautoscaler_status_condition
kube_job_annotations
kube_job_complete
kube_job_created
kube_job_info
kube_job_labels
kube_job_owner
kube_job_spec_completions
kube_job_spec_parallelism
kube_job_status_completion_time
kube_job_status_start_time
kube_job_status_succeeded
kube_lease_owner
kube_lease_renew_time
kube_limitrange
kube_limitrange_created
kube_mutatingwebhookconfiguration_info
kube_namespace_labels
kube_networkpolicy_annotations
kube_networkpolicy_labels
kube_networkpolicy_spec_egress_rules
kube_networkpolicy_spec_ingress_rules
kube_node_annotations
kube_node_role
kube_persistentvolume_annotations
kube_persistentvolume_labels
kube_persistentvolumeclaim_access_mode
kube_persistentvolumeclaim_annotations
kube_persistentvolumeclaim_labels
kube_pod_annotations
kube_pod_completion_time
kube_pod_container_resource_limits
kube_pod_container_resource_requests
kube_pod_container_state_started
kube_pod_created
kube_pod_init_container_info
kube_pod_init_container_resource_limits
kube_pod_init_container_resource_requests
kube_pod_init_container_status_last_terminated_reason
kube_pod_init_container_status_ready
kube_pod_init_container_status_restarts_total
kube_pod_init_container_status_running
kube_pod_init_container_status_terminated
kube_pod_init_container_status_terminated_reason
kube_pod_init_container_status_waiting
kube_pod_init_container_status_waiting_reason
kube_pod_labels
kube_pod_owner
kube_pod_restart_policy
kube_pod_spec_volumes_persistentvolumeclaims_readonly
kube_pod_start_time
kube_poddisruptionbudget_annotations
kube_poddisruptionbudget_created
kube_poddisruptionbudget_labels
kube_poddisruptionbudget_status_expected_pods
kube_poddisruptionbudget_status_observed_generation
kube_poddisruptionbudget_status_pod_disruptions_allowed
kube_replicaset_annotations
kube_replicaset_created
kube_replicaset_labels
kube_replicaset_metadata_generation
kube_replicaset_owner
kube_replicaset_status_observed_generation
kube_resourcequota_created
kube_secret_annotations
kube_secret_info
kube_secret_labels
kube_secret_metadata_resource_version
kube_secret_type
kube_service_annotations
kube_service_created
kube_service_info
kube_service_labels
kube_service_spec_type
kube_statefulset_annotations
kube_statefulset_created
kube_statefulset_labels
kube_statefulset_status_current_revision
kube_statefulset_status_update_revision
kube_storageclass_annotations
kube_storageclass_created
kube_storageclass_info
kube_storageclass_labels
kube_validatingwebhookconfiguration_info
kube_validatingwebhookconfiguration_metadata_resource_version
kube_volumeattachment_created
kube_volumeattachment_info
kube_volumeattachment_labels
kube_volumeattachment_spec_source_persistentvolume
kube_volumeattachment_status_attached
kube_volumeattachment_status_attachment_metadata

Google Distributed Cloud 指标中记录了完整的 Google Distributed Cloud 指标集。

如需停用优化的 kube-state-metrics 指标（不推荐），请在 Stackdriver 自定义资源中将 optimizedMetrics 字段设置为 false。如需详细了解如何更改 Stackdriver 自定义资源，请参阅配置 Stackdriver 组件资源。Google Distributed Cloud 指标中介绍了所有 Google Distributed Cloud 指标（包括默认排除的指标）。

启用和停用 Stackdriver

您可以通过启用或停用 Stackdriver 自定义资源来完全启用或停用 Logging 和 Monitoring 代理。此功能处于预览阶段。

在停用 Logging 和 Monitoring 代理之前，请参阅支持页面，详细了解停用代理会如何影响 Google Cloud 支持的服务等级协议 (SLA)。

Logging 和 Monitoring 代理会根据您的存储和保留配置捕获本地存储的数据。通过使用有权将数据写入安装时指定的 Google Cloud项目的服务账号，系统会将数据复制到该项目。如前所述，您可以随时停用这些代理。

您还可以管理和删除 Logging 和 Monitoring 代理发送到 Cloud Logging 和 Cloud Monitoring 的数据。如需了解详情，请参阅 Cloud Monitoring 文档。

Logging 和 Monitoring 的配置要求

如需查看 Cloud Logging 和 Cloud Monitoring 数据，您必须配置用于存储要查看的日志和指标的 Google Cloud 项目。此 Google Cloud 项目被称为日志记录和监控项目。

在日志记录和监控项目中启用以下 API：
向日志记录和监控服务账号授予日志记录和监控项目上的以下 IAM 角色。
- logging.logWriter
- monitoring.metricWriter
- stackdriver.resourceMetadata.writer
- monitoring.dashboardEditor
- opsconfigmonitoring.resourceMetadata.writer

日志标记

许多 Google Distributed Cloud 日志都包含 F 标记：

logtag: "F"

此标记表示日志条目是完整的。如需详细了解此标记，请参阅 GitHub 上的 Kubernetes 设计方案中的日志格式。

后续步骤

使用日志记录和监控