使用 Terraform 进行跨项目环境监控

Cloud Composer 3 | Cloud Composer 2 | Cloud Composer 1

本页面介绍如何为同一组织中选定项目内的多个 Cloud Composer 环境实现集成式监控信息中心。

概览

所述解决方案可帮助中央企业平台团队支持其他团队使用的 Cloud Composer 环境。此实现可用于监控所有 Cloud Composer 环境,即使这些环境不是使用 Terraform 创建的。

本指南在 Cloud Composer 中实现了 Cloud Monitoring 信息中心,并提供了可持续报告 Cloud Composer 环境的关键指标并在出现问题时引发突发事件的提醒政策。信息中心会自动扫描所选项目中用于此监控的所有 Cloud Composer 环境。该实现依赖于 Terraform。

该模型使用 Google Cloud 项目作为监控项目,用于监控(只读)部署在多个受监控项目中的 Cloud Composer 环境。中央信息中心使用受监控项目的 Cloud Monitoring 指标来呈现其内容。

此图显示了监控项目(其中包含监控信息中心)和三个受监控的项目(每个项目都包含 Composer 环境)。每个受监控的项目都有一条指向受监控项目的箭头,并带有“指标”标签

该信息中心可监控多个指标并针对这些指标创建提醒,包括环境健康状况:

监控信息中心的屏幕截图,其中显示了环境健康状况、数据库健康状况、Web 服务器健康状况和调度程序心跳

或 CPU 指标:

监控信息中心的屏幕截图,显示了数据库 CPU、调度程序 CPU、工作器 CPU 和 Web 服务器 CPU

将指针悬停在特定线条上,即可查看该线条代表哪个环境。 然后,信息中心会显示项目名称和资源:

监控信息中心的屏幕截图,显示了将鼠标悬停在某条线上时出现的弹出式窗口。该弹出式窗口会显示四种资源,其中一种资源与该行相对应。

如果某项指标超出预定义阈值,系统会引发突发事件,并在与该指标对应的图表中显示相应提醒:

屏幕截图:未结突发事件视图,显示了两个未结突发事件。列出的每个突发事件都包含一个用于查看详细信息的链接。

受监控的指标列表

受监控的指标的完整列表:

  • Cloud Composer 环境健康情况(基于监控 DAG)
  • 数据库健康状况
  • Web 服务器健康状况
  • 调度器检测信号
  • 所有工作器的 CPU 和内存利用率
  • Airflow 数据库的 CPU 和内存利用率
  • 网络服务器的 CPU 和内存利用率
  • Airflow 调度器的 CPU 和内存利用率
  • 环境中已加入队列、已安排、已加入队列或已安排的任务的比例(有助于发现 Airflow 并发配置问题)
  • DAG 解析时间
  • 当前工作器数量与最少工作器数量对比 - 有助于了解工作器稳定性问题或伸缩问题
  • 工作器 Pod 逐出
  • 工作器、调度器、Web 服务器或其他组件在日志中抛出的错误数量(各个图表)

准备工作

如需使用 Cloud Composer 和 Cloud Monitoring,您需要创建 Google Cloud 项目并启用结算功能。 项目必须包含 Cloud Composer 环境。在本指南中,此项目称为监控项目

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Verify that billing is enabled for your Google Cloud project.

  6. 安装 Terraform(如果尚未安装)。
  7. 配置项目的指标范围。默认情况下,项目只能显示或监控其存储的时序数据。如果您想显示或监控存储在多个项目中的数据,请配置相应项目的指标范围。如需了解详情,请参阅指标范围概览
  8. 实施步骤

    1. 在运行 Terraform 的本地计算机上,将 GOOGLE_CLOUD_PROJECT 环境变量设置为监控项目ID

      export GOOGLE_CLOUD_PROJECT=MONITORING_PROJECT_ID
      
    2. 确保您的 Terraform Google 提供程序已通过身份验证,并且有权访问以下权限:

      • 监控项目中的 roles/monitoring.editor 权限
      • roles/monitoring.viewerroles/logging.viewer 在所有受监控的项目
    3. 将以下 main.tf 文件复制到运行 Terraform 的本地计算机。

      点击即可展开

      #   Monitoring for multiple Cloud Composer environments
      #
      #   Usage:
      #       1. Create a new project that you will use for monitoring of Cloud Composer environments in other projects
      #       2. Replace YOUR_MONITORING_PROJECT with the name of this project in the "metrics_scope" parameter that is part of the "Add Monitored Projects to the Monitoring project" section
      #       3. Replace the list of projects to monitor with your list of projects with Cloud Composer environments to be monitored in the "for_each" parameter of the "Add Monitored Projects to the Monitoring project" section
      #       4. Set up your environment and apply the configuration following these steps: https://cloud.google.com/docs/terraform/basic-commands. Your GOOGLE_CLOUD_PROJECT environment variable should be the new monitoring project you just created.
      #
      #   The script creates the following resources in the monitoring project:
      #           1. Adds monitored projects to Cloud Monitoring
      #           2. Creates Alert Policies
      #           3. Creates Monitoring Dashboard
      #
      
      
      
      #######################################################
      #
      # Add Monitored Projects to the Monitoring project
      #
      ########################################################
      
      resource "google_monitoring_monitored_project" "projects_monitored" {
        for_each      = toset(["YOUR_PROJECT_TO_MONITOR_1", "YOUR_PROJECT_TO_MONITOR_2", "YOUR_PROJECT_TO_MONITOR_3"])
        metrics_scope = join("", ["locations/global/metricsScopes/", "YOUR_MONITORING_PROJECT"])
        name          = each.value
      }
      
      
      #######################################################
      #
      # Create alert policies in Monitoring project
      #
      ########################################################
      
      resource "google_monitoring_alert_policy" "environment_health" {
        display_name = "Environment Health"
        combiner     = "OR"
        conditions {
          display_name = "Environment Health"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| {metric 'composer.googleapis.com/environment/dagbag_size'",
              "| group_by 5m, [value_dagbag_size_mean: if(mean(value.dagbag_size) > 0, 1, 0)]",
              "| align mean_aligner(5m)",
              "| group_by [resource.project_id, resource.environment_name],    [value_dagbag_size_mean_aggregate: aggregate(value_dagbag_size_mean)];  ",
              "metric 'composer.googleapis.com/environment/healthy'",
              "| group_by 5m,    [value_sum_signals: aggregate(if(value.healthy,1,0))]",
              "| align mean_aligner(5m)| absent_for 5m }",
              "| outer_join 0",
              "| group_by [resource.project_id, resource.environment_name]",
              "| value val(2)",
              "| align mean_aligner(5m)",
              "| window(5m)",
              "| condition val(0) < 0.9"
            ])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "database_health" {
        display_name = "Database Health"
        combiner     = "OR"
        conditions {
          display_name = "Database Health"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/database_health'",
              "| group_by 5m,",
              "    [value_database_health_fraction_true: fraction_true(value.database_health)]",
              "| every 5m",
              "| group_by 5m,",
              "    [value_database_health_fraction_true_aggregate:",
              "       aggregate(value_database_health_fraction_true)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_database_health_fraction_true_aggregate_aggregate:",
              "       aggregate(value_database_health_fraction_true_aggregate)]",
            "| condition val() < 0.95"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "webserver_health" {
        display_name = "Web Server Health"
        combiner     = "OR"
        conditions {
          display_name = "Web Server Health"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/web_server/health'",
              "| group_by 5m, [value_health_fraction_true: fraction_true(value.health)]",
              "| every 5m",
              "| group_by 5m,",
              "    [value_health_fraction_true_aggregate:",
              "       aggregate(value_health_fraction_true)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_health_fraction_true_aggregate_aggregate:",
              "       aggregate(value_health_fraction_true_aggregate)]",
            "| condition val() < 0.95"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "scheduler_heartbeat" {
        display_name = "Scheduler Heartbeat"
        combiner     = "OR"
        conditions {
          display_name = "Scheduler Heartbeat"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/scheduler_heartbeat_count'",
              "| group_by 10m,",
              "    [value_scheduler_heartbeat_count_aggregate:",
              "      aggregate(value.scheduler_heartbeat_count)]",
              "| every 10m",
              "| group_by 10m,",
              "    [value_scheduler_heartbeat_count_aggregate_mean:",
              "       mean(value_scheduler_heartbeat_count_aggregate)]",
              "| every 10m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_scheduler_heartbeat_count_aggregate_mean_aggregate:",
              "       aggregate(value_scheduler_heartbeat_count_aggregate_mean)]",
            "| condition val() < 80"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "database_cpu" {
        display_name = "Database CPU"
        combiner     = "OR"
        conditions {
          display_name = "Database CPU"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/database/cpu/utilization'",
              "| group_by 10m, [value_utilization_mean: mean(value.utilization)]",
              "| every 10m",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() > 0.8"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "scheduler_cpu" {
        display_name = "Scheduler CPU"
        combiner     = "OR"
        conditions {
          display_name = "Scheduler CPU"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/cpu/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-scheduler-.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "worker_cpu" {
        display_name = "Worker CPU"
        combiner     = "OR"
        conditions {
          display_name = "Worker CPU"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/cpu/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-worker.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "webserver_cpu" {
        display_name = "Web Server CPU"
        combiner     = "OR"
        conditions {
          display_name = "Web Server CPU"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/cpu/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-webserver.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "parsing_time" {
        display_name = "DAG Parsing Time"
        combiner     = "OR"
        conditions {
          display_name = "DAG Parsing Time"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/dag_processing/total_parse_time'",
              "| group_by 5m, [value_total_parse_time_mean: mean(value.total_parse_time)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val(0) > cast_units(30,\"s\")"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "database_memory" {
        display_name = "Database Memory"
        combiner     = "OR"
        conditions {
          display_name = "Database Memory"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/database/memory/utilization'",
              "| group_by 10m, [value_utilization_mean: mean(value.utilization)]",
              "| every 10m",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() > 0.8"])
            duration = "0s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "scheduler_memory" {
        display_name = "Scheduler Memory"
        combiner     = "OR"
        conditions {
          display_name = "Scheduler Memory"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/memory/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-scheduler-.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "0s"
            trigger {
              count = "1"
            }
          }
        }
        documentation {
          content = join("", [
            "Scheduler Memory exceeds a threshold, summed across all schedulers in the environment. ",
          "Add more schedulers OR increase scheduler's memory OR reduce scheduling load (e.g. through lower parsing frequency or lower number of DAGs/tasks running"])
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "worker_memory" {
        display_name = "Worker Memory"
        combiner     = "OR"
        conditions {
          display_name = "Worker Memory"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/memory/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-worker.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "0s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "webserver_memory" {
        display_name = "Web Server Memory"
        combiner     = "OR"
        conditions {
          display_name = "Web Server Memory"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/memory/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-webserver.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "0s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "scheduled_tasks_percentage" {
        display_name = "Scheduled Tasks Percentage"
        combiner     = "OR"
        conditions {
          display_name = "Scheduled Tasks Percentage"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/unfinished_task_instances'",
              "| align mean_aligner(10m)",
              "| every(10m)",
              "| window(10m)",
              "| filter_ratio_by [resource.project_id, resource.environment_name], metric.state = 'scheduled'",
            "| condition val() > 0.80"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "queued_tasks_percentage" {
        display_name = "Queued Tasks Percentage"
        combiner     = "OR"
        conditions {
          display_name = "Queued Tasks Percentage"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/unfinished_task_instances'",
              "| align mean_aligner(10m)",
              "| every(10m)",
              "| window(10m)",
              "| filter_ratio_by [resource.project_id, resource.environment_name], metric.state = 'queued'",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() > 0.95"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "queued_or_scheduled_tasks_percentage" {
        display_name = "Queued or Scheduled Tasks Percentage"
        combiner     = "OR"
        conditions {
          display_name = "Queued or Scheduled Tasks Percentage"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/unfinished_task_instances'",
              "| align mean_aligner(10m)",
              "| every(10m)",
              "| window(10m)",
              "| filter_ratio_by [resource.project_id, resource.environment_name], or(metric.state = 'queued', metric.state = 'scheduled' )",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() > 0.80"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      
      resource "google_monitoring_alert_policy" "workers_above_minimum" {
        display_name = "Workers above minimum (negative = missing workers)"
        combiner     = "OR"
        conditions {
          display_name = "Workers above minimum"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| { metric 'composer.googleapis.com/environment/num_celery_workers'",
              "| group_by 5m, [value_num_celery_workers_mean: mean(value.num_celery_workers)]",
              "| every 5m",
              "; metric 'composer.googleapis.com/environment/worker/min_workers'",
              "| group_by 5m, [value_min_workers_mean: mean(value.min_workers)]",
              "| every 5m }",
              "| outer_join 0",
              "| sub",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() < 0"])
            duration = "0s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "pod_evictions" {
        display_name = "Worker pod evictions"
        combiner     = "OR"
        conditions {
          display_name = "Worker pod evictions"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/worker/pod_eviction_count'",
              "| align delta(1m)",
              "| every 1m",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() > 0"])
            duration = "60s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "scheduler_errors" {
        display_name = "Scheduler Errors"
        combiner     = "OR"
        conditions {
          display_name = "Scheduler Errors"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'logging.googleapis.com/log_entry_count'",
              "| filter (metric.log == 'airflow-scheduler' && metric.severity == 'ERROR')",
              "| group_by 5m,",
              "    [value_log_entry_count_aggregate: aggregate(value.log_entry_count)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_log_entry_count_aggregate_max: max(value_log_entry_count_aggregate)]",
            "| condition val() > 50"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "worker_errors" {
        display_name = "Worker Errors"
        combiner     = "OR"
        conditions {
          display_name = "Worker Errors"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'logging.googleapis.com/log_entry_count'",
              "| filter (metric.log == 'airflow-worker' && metric.severity == 'ERROR')",
              "| group_by 5m,",
              "    [value_log_entry_count_aggregate: aggregate(value.log_entry_count)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_log_entry_count_aggregate_max: max(value_log_entry_count_aggregate)]",
            "| condition val() > 50"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "webserver_errors" {
        display_name = "Web Server Errors"
        combiner     = "OR"
        conditions {
          display_name = "Web Server Errors"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'logging.googleapis.com/log_entry_count'",
              "| filter (metric.log == 'airflow-webserver' && metric.severity == 'ERROR')",
              "| group_by 5m,",
              "    [value_log_entry_count_aggregate: aggregate(value.log_entry_count)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_log_entry_count_aggregate_max: max(value_log_entry_count_aggregate)]",
            "| condition val() > 50"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "other_errors" {
        display_name = "Other Errors"
        combiner     = "OR"
        conditions {
          display_name = "Other Errors"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'logging.googleapis.com/log_entry_count'",
              "| filter",
              "    (metric.log !~ 'airflow-scheduler|airflow-worker|airflow-webserver'",
              "     && metric.severity == 'ERROR')",
              "| group_by 5m, [value_log_entry_count_max: max(value.log_entry_count)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_log_entry_count_max_aggregate: aggregate(value_log_entry_count_max)]",
            "| condition val() > 10"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      
      #######################################################
      #
      # Create Monitoring Dashboard
      #
      ########################################################
      
      
      resource "google_monitoring_dashboard" "Composer_Dashboard" {
        dashboard_json = <<EOF
      {
        "category": "CUSTOM",
        "displayName": "Cloud Composer - Monitoring Platform",
        "mosaicLayout": {
          "columns": 12,
          "tiles": [
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "MARKDOWN"
                },
                "title": "Health"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 0
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.environment_health.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 1
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.database_health.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 1
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.webserver_health.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 5
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.scheduler_heartbeat.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 5
            },
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "Airflow Task Execution and DAG Parsing"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 9
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.scheduled_tasks_percentage.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 10
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.queued_tasks_percentage.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 10
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.queued_or_scheduled_tasks_percentage.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 14
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.parsing_time.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 14
            },
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "Workers presence"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 18
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.workers_above_minimum.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 19
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.pod_evictions.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 19
            },
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "CPU Utilization"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 23
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.database_cpu.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 24
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.scheduler_cpu.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 24
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.worker_cpu.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 28
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.webserver_cpu.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 28
            },
      
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "Memory Utilization"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 32
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.database_memory.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 33
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.scheduler_memory.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 33
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.worker_memory.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 37
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.webserver_memory.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 37
            },
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "Airflow component errors"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 41
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.scheduler_errors.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 42
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.worker_errors.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 42
            },
                  {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.webserver_errors.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 48
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.other_errors.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 48
            },
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "Task errors"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 52
            }
          ]
        }
      }
      EOF
      }
    4. 修改 "google_monitoring_monitored_project" resource 代码块:

      1. for_each 块中的项目列表替换为您的受监控项目
      2. metrics_scope 中的 "YOUR_MONITORING_PROJECT" 替换为您的监控项目的名称。
    5. 查看配置并验证 Terraform 将创建或更新的资源是否符合您的预期。如有必要,请进行更正。

      terraform plan
      
    6. 通过运行以下命令并在提示符处输入 yes 来应用 Terraform 配置:

      terraform apply
      
    7. 监控项目的 Google Cloud 控制台中,前往监控信息中心页面:

      前往 Monitoring 信息中心

    8. 自定义标签页中,找到名为 Cloud Composer - 监控平台的自定义信息中心。

    后续步骤