Projektübergreifendes Umgebungsmonitoring mit Terraform

Cloud Composer 3 | Cloud Composer 2 | Cloud Composer 1

Auf dieser Seite wird beschrieben, wie Sie ein integriertes Monitoring-Dashboard für mehrere Cloud Composer-Umgebungen in ausgewählten Projekten in derselben Organisation implementieren.

Übersicht

Die beschriebene Lösung kann zentralen Unternehmensplattformteams helfen, Cloud Composer-Umgebungen zu unterstützen, die von anderen Teams verwendet werden. Diese Implementierung kann zum Überwachen aller Cloud Composer-Umgebungen verwendet werden, auch solcher, die nicht mit Terraform erstellt wurden.

In dieser Anleitung wird das Cloud Monitoring-Dashboard in Cloud Composer implementiert. Außerdem werden Benachrichtigungsrichtlinien implementiert, mit denen kontinuierlich wichtige Messwerte von Cloud Composer-Umgebungen gemeldet und bei Problemen Vorfälle gemeldet werden. Das Dashboard scannt automatisch alle Cloud Composer-Umgebungen in Projekten, die für dieses Monitoring ausgewählt wurden. Die Implementierung basiert auf Terraform.

Das Modell verwendet ein Google Cloud Projekt als Monitoring-Projekt, mit dem Cloud Composer-Umgebungen, die in mehreren überwachten Projekten bereitgestellt werden, überwacht (schreibgeschützt) werden. Für die Inhalte des zentralen Dashboards werden Cloud Monitoring-Messwerte aus den überwachten Projekten verwendet.

Diagramm mit dem Monitoring-Projekt, das das Monitoring-Dashboard enthält, und drei überwachten Projekten, die jeweils Composer-Umgebungen enthalten. Jedes überwachte Projekt hat einen Pfeil, der auf das überwachte Projekt mit dem Label „Messwerte“ zeigt.

Das Dashboard überwacht mehrere Messwerte und erstellt Benachrichtigungen für sie, einschließlich des Zustands der Umgebung:

Screenshot des Monitoring-Dashboards mit den Werten für Umgebungszustand, Datenbankzustand, Webserverzustand und Scheduler-Heartbeat

oder CPU-Messwerte:

Screenshot des Monitoring-Dashboards mit den Werten für Datenbank-CPU, Scheduler-CPU, Worker-CPU und Webserver-CPU

Bewegen Sie den Mauszeiger auf eine bestimmte Linie, um zu sehen, welche Umgebung sie darstellt. Im Dashboard werden dann ein Projektname und eine Ressource angezeigt:

Screenshot des Monitoring-Dashboards mit dem Pop-up, das angezeigt wird, wenn Sie den Mauszeiger auf eine Zeile bewegen. Im Pop-up-Fenster werden vier Ressourcen angezeigt, von denen eine der Linie entspricht.

Wenn ein Messwert einen vordefinierten Schwellenwert überschreitet, wird ein Vorfall gemeldet und eine entsprechende Benachrichtigung in einem Diagramm für diesen Messwert angezeigt:

Screenshot der Ansicht „Offene Vorfälle“ mit zwei offenen Vorfällen. Jeder aufgeführte Vorfall enthält einen Link, über den Sie die Details aufrufen können.

Liste der überwachten Messwerte

Eine vollständige Liste der überwachten Messwerte:

  • Zustand der Cloud Composer-Umgebung (basierend auf Monitoring-DAG)
  • Datenbankstatus
  • Webserverstatus
  • Planer-Heartbeats
  • CPU- und Arbeitsspeicherauslastung für alle Worker
  • CPU- und Arbeitsspeichernutzung für die Airflow-Datenbank
  • CPU- und Arbeitsspeicherauslastung für den Webserver
  • CPU- und Arbeitsspeichernutzung für Airflow-Planer
  • Anteil der in die Warteschlange gestellten, geplanten oder in die Warteschlange gestellten oder geplanten Aufgaben in einer Umgebung (nützlich, um Probleme mit der Airflow-Parallelitätskonfiguration zu erkennen)
  • DAG-Parsing-Zeit
  • Aktuelle im Vergleich zur Mindestanzahl von Workern – nützlich, um Probleme mit der Worker-Stabilität oder Skalierungsprobleme zu erkennen
  • Bereinigte Worker-Pods
  • Anzahl der Fehler, die in Logs von Workern, Planern, Webservern oder anderen Komponenten ausgegeben werden (einzelne Diagramme)

Hinweise

Damit Sie Cloud Composer und Cloud Monitoring verwenden können, müssen Sie ein Google Cloud Projekt erstellen und die Abrechnung aktivieren. Das Projekt muss eine Cloud Composer-Umgebung enthalten. Dieses Projekt wird in diesem Leitfaden als Überwachungsprojekt bezeichnet.

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Verify that billing is enabled for your Google Cloud project.

  6. Installieren Sie Terraform, falls es noch nicht installiert ist.
  7. Konfigurieren Sie den Messwertbereich Ihres Projekts. Standardmäßig können in einem Projekt nur Zeitachsendaten angezeigt oder überwacht werden, die darin gespeichert sind. Wenn Sie Daten anzeigen oder Daten überwachen möchten, die in mehreren Projekten gespeichert sind, konfigurieren Sie den Messwertbereich des Projekts. Weitere Informationen finden Sie unter Messwertbereiche – Übersicht.
  8. Implementierungsschritte

    1. Legen Sie auf Ihrem lokalen Computer, auf dem Sie Terraform ausführen, die Umgebungsvariable GOOGLE_CLOUD_PROJECT auf die ID Ihres Monitoring-Projekts fest:

      export GOOGLE_CLOUD_PROJECT=MONITORING_PROJECT_ID
      
    2. Prüfen Sie, ob Ihr Terraform-Google-Anbieter authentifiziert ist und Zugriff auf die folgenden Berechtigungen hat:

      • roles/monitoring.editor-Berechtigung im Monitoring-Projekt
      • roles/monitoring.viewer, roles/logging.viewer in allen überwachten Projekten
    3. Kopieren Sie die folgende main.tf-Datei auf den lokalen Computer, auf dem Sie Terraform ausführen.

      Zum Maximieren klicken

      #   Monitoring for multiple Cloud Composer environments
      #
      #   Usage:
      #       1. Create a new project that you will use for monitoring of Cloud Composer environments in other projects
      #       2. Replace YOUR_MONITORING_PROJECT with the name of this project in the "metrics_scope" parameter that is part of the "Add Monitored Projects to the Monitoring project" section
      #       3. Replace the list of projects to monitor with your list of projects with Cloud Composer environments to be monitored in the "for_each" parameter of the "Add Monitored Projects to the Monitoring project" section
      #       4. Set up your environment and apply the configuration following these steps: https://cloud.google.com/docs/terraform/basic-commands. Your GOOGLE_CLOUD_PROJECT environment variable should be the new monitoring project you just created.
      #
      #   The script creates the following resources in the monitoring project:
      #           1. Adds monitored projects to Cloud Monitoring
      #           2. Creates Alert Policies
      #           3. Creates Monitoring Dashboard
      #
      
      
      
      #######################################################
      #
      # Add Monitored Projects to the Monitoring project
      #
      ########################################################
      
      resource "google_monitoring_monitored_project" "projects_monitored" {
        for_each      = toset(["YOUR_PROJECT_TO_MONITOR_1", "YOUR_PROJECT_TO_MONITOR_2", "YOUR_PROJECT_TO_MONITOR_3"])
        metrics_scope = join("", ["locations/global/metricsScopes/", "YOUR_MONITORING_PROJECT"])
        name          = each.value
      }
      
      
      #######################################################
      #
      # Create alert policies in Monitoring project
      #
      ########################################################
      
      resource "google_monitoring_alert_policy" "environment_health" {
        display_name = "Environment Health"
        combiner     = "OR"
        conditions {
          display_name = "Environment Health"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| {metric 'composer.googleapis.com/environment/dagbag_size'",
              "| group_by 5m, [value_dagbag_size_mean: if(mean(value.dagbag_size) > 0, 1, 0)]",
              "| align mean_aligner(5m)",
              "| group_by [resource.project_id, resource.environment_name],    [value_dagbag_size_mean_aggregate: aggregate(value_dagbag_size_mean)];  ",
              "metric 'composer.googleapis.com/environment/healthy'",
              "| group_by 5m,    [value_sum_signals: aggregate(if(value.healthy,1,0))]",
              "| align mean_aligner(5m)| absent_for 5m }",
              "| outer_join 0",
              "| group_by [resource.project_id, resource.environment_name]",
              "| value val(2)",
              "| align mean_aligner(5m)",
              "| window(5m)",
              "| condition val(0) < 0.9"
            ])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "database_health" {
        display_name = "Database Health"
        combiner     = "OR"
        conditions {
          display_name = "Database Health"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/database_health'",
              "| group_by 5m,",
              "    [value_database_health_fraction_true: fraction_true(value.database_health)]",
              "| every 5m",
              "| group_by 5m,",
              "    [value_database_health_fraction_true_aggregate:",
              "       aggregate(value_database_health_fraction_true)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_database_health_fraction_true_aggregate_aggregate:",
              "       aggregate(value_database_health_fraction_true_aggregate)]",
            "| condition val() < 0.95"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "webserver_health" {
        display_name = "Web Server Health"
        combiner     = "OR"
        conditions {
          display_name = "Web Server Health"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/web_server/health'",
              "| group_by 5m, [value_health_fraction_true: fraction_true(value.health)]",
              "| every 5m",
              "| group_by 5m,",
              "    [value_health_fraction_true_aggregate:",
              "       aggregate(value_health_fraction_true)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_health_fraction_true_aggregate_aggregate:",
              "       aggregate(value_health_fraction_true_aggregate)]",
            "| condition val() < 0.95"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "scheduler_heartbeat" {
        display_name = "Scheduler Heartbeat"
        combiner     = "OR"
        conditions {
          display_name = "Scheduler Heartbeat"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/scheduler_heartbeat_count'",
              "| group_by 10m,",
              "    [value_scheduler_heartbeat_count_aggregate:",
              "      aggregate(value.scheduler_heartbeat_count)]",
              "| every 10m",
              "| group_by 10m,",
              "    [value_scheduler_heartbeat_count_aggregate_mean:",
              "       mean(value_scheduler_heartbeat_count_aggregate)]",
              "| every 10m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_scheduler_heartbeat_count_aggregate_mean_aggregate:",
              "       aggregate(value_scheduler_heartbeat_count_aggregate_mean)]",
            "| condition val() < 80"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "database_cpu" {
        display_name = "Database CPU"
        combiner     = "OR"
        conditions {
          display_name = "Database CPU"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/database/cpu/utilization'",
              "| group_by 10m, [value_utilization_mean: mean(value.utilization)]",
              "| every 10m",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() > 0.8"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "scheduler_cpu" {
        display_name = "Scheduler CPU"
        combiner     = "OR"
        conditions {
          display_name = "Scheduler CPU"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/cpu/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-scheduler-.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "worker_cpu" {
        display_name = "Worker CPU"
        combiner     = "OR"
        conditions {
          display_name = "Worker CPU"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/cpu/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-worker.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "webserver_cpu" {
        display_name = "Web Server CPU"
        combiner     = "OR"
        conditions {
          display_name = "Web Server CPU"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/cpu/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-webserver.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "parsing_time" {
        display_name = "DAG Parsing Time"
        combiner     = "OR"
        conditions {
          display_name = "DAG Parsing Time"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/dag_processing/total_parse_time'",
              "| group_by 5m, [value_total_parse_time_mean: mean(value.total_parse_time)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val(0) > cast_units(30,\"s\")"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "database_memory" {
        display_name = "Database Memory"
        combiner     = "OR"
        conditions {
          display_name = "Database Memory"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/database/memory/utilization'",
              "| group_by 10m, [value_utilization_mean: mean(value.utilization)]",
              "| every 10m",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() > 0.8"])
            duration = "0s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "scheduler_memory" {
        display_name = "Scheduler Memory"
        combiner     = "OR"
        conditions {
          display_name = "Scheduler Memory"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/memory/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-scheduler-.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "0s"
            trigger {
              count = "1"
            }
          }
        }
        documentation {
          content = join("", [
            "Scheduler Memory exceeds a threshold, summed across all schedulers in the environment. ",
          "Add more schedulers OR increase scheduler's memory OR reduce scheduling load (e.g. through lower parsing frequency or lower number of DAGs/tasks running"])
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "worker_memory" {
        display_name = "Worker Memory"
        combiner     = "OR"
        conditions {
          display_name = "Worker Memory"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/memory/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-worker.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "0s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "webserver_memory" {
        display_name = "Web Server Memory"
        combiner     = "OR"
        conditions {
          display_name = "Web Server Memory"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/memory/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-webserver.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "0s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "scheduled_tasks_percentage" {
        display_name = "Scheduled Tasks Percentage"
        combiner     = "OR"
        conditions {
          display_name = "Scheduled Tasks Percentage"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/unfinished_task_instances'",
              "| align mean_aligner(10m)",
              "| every(10m)",
              "| window(10m)",
              "| filter_ratio_by [resource.project_id, resource.environment_name], metric.state = 'scheduled'",
            "| condition val() > 0.80"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "queued_tasks_percentage" {
        display_name = "Queued Tasks Percentage"
        combiner     = "OR"
        conditions {
          display_name = "Queued Tasks Percentage"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/unfinished_task_instances'",
              "| align mean_aligner(10m)",
              "| every(10m)",
              "| window(10m)",
              "| filter_ratio_by [resource.project_id, resource.environment_name], metric.state = 'queued'",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() > 0.95"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "queued_or_scheduled_tasks_percentage" {
        display_name = "Queued or Scheduled Tasks Percentage"
        combiner     = "OR"
        conditions {
          display_name = "Queued or Scheduled Tasks Percentage"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/unfinished_task_instances'",
              "| align mean_aligner(10m)",
              "| every(10m)",
              "| window(10m)",
              "| filter_ratio_by [resource.project_id, resource.environment_name], or(metric.state = 'queued', metric.state = 'scheduled' )",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() > 0.80"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      
      resource "google_monitoring_alert_policy" "workers_above_minimum" {
        display_name = "Workers above minimum (negative = missing workers)"
        combiner     = "OR"
        conditions {
          display_name = "Workers above minimum"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| { metric 'composer.googleapis.com/environment/num_celery_workers'",
              "| group_by 5m, [value_num_celery_workers_mean: mean(value.num_celery_workers)]",
              "| every 5m",
              "; metric 'composer.googleapis.com/environment/worker/min_workers'",
              "| group_by 5m, [value_min_workers_mean: mean(value.min_workers)]",
              "| every 5m }",
              "| outer_join 0",
              "| sub",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() < 0"])
            duration = "0s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "pod_evictions" {
        display_name = "Worker pod evictions"
        combiner     = "OR"
        conditions {
          display_name = "Worker pod evictions"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/worker/pod_eviction_count'",
              "| align delta(1m)",
              "| every 1m",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() > 0"])
            duration = "60s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "scheduler_errors" {
        display_name = "Scheduler Errors"
        combiner     = "OR"
        conditions {
          display_name = "Scheduler Errors"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'logging.googleapis.com/log_entry_count'",
              "| filter (metric.log == 'airflow-scheduler' && metric.severity == 'ERROR')",
              "| group_by 5m,",
              "    [value_log_entry_count_aggregate: aggregate(value.log_entry_count)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_log_entry_count_aggregate_max: max(value_log_entry_count_aggregate)]",
            "| condition val() > 50"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "worker_errors" {
        display_name = "Worker Errors"
        combiner     = "OR"
        conditions {
          display_name = "Worker Errors"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'logging.googleapis.com/log_entry_count'",
              "| filter (metric.log == 'airflow-worker' && metric.severity == 'ERROR')",
              "| group_by 5m,",
              "    [value_log_entry_count_aggregate: aggregate(value.log_entry_count)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_log_entry_count_aggregate_max: max(value_log_entry_count_aggregate)]",
            "| condition val() > 50"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "webserver_errors" {
        display_name = "Web Server Errors"
        combiner     = "OR"
        conditions {
          display_name = "Web Server Errors"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'logging.googleapis.com/log_entry_count'",
              "| filter (metric.log == 'airflow-webserver' && metric.severity == 'ERROR')",
              "| group_by 5m,",
              "    [value_log_entry_count_aggregate: aggregate(value.log_entry_count)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_log_entry_count_aggregate_max: max(value_log_entry_count_aggregate)]",
            "| condition val() > 50"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "other_errors" {
        display_name = "Other Errors"
        combiner     = "OR"
        conditions {
          display_name = "Other Errors"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'logging.googleapis.com/log_entry_count'",
              "| filter",
              "    (metric.log !~ 'airflow-scheduler|airflow-worker|airflow-webserver'",
              "     && metric.severity == 'ERROR')",
              "| group_by 5m, [value_log_entry_count_max: max(value.log_entry_count)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_log_entry_count_max_aggregate: aggregate(value_log_entry_count_max)]",
            "| condition val() > 10"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      
      #######################################################
      #
      # Create Monitoring Dashboard
      #
      ########################################################
      
      
      resource "google_monitoring_dashboard" "Composer_Dashboard" {
        dashboard_json = <<EOF
      {
        "category": "CUSTOM",
        "displayName": "Cloud Composer - Monitoring Platform",
        "mosaicLayout": {
          "columns": 12,
          "tiles": [
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "MARKDOWN"
                },
                "title": "Health"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 0
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.environment_health.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 1
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.database_health.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 1
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.webserver_health.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 5
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.scheduler_heartbeat.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 5
            },
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "Airflow Task Execution and DAG Parsing"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 9
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.scheduled_tasks_percentage.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 10
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.queued_tasks_percentage.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 10
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.queued_or_scheduled_tasks_percentage.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 14
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.parsing_time.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 14
            },
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "Workers presence"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 18
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.workers_above_minimum.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 19
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.pod_evictions.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 19
            },
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "CPU Utilization"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 23
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.database_cpu.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 24
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.scheduler_cpu.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 24
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.worker_cpu.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 28
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.webserver_cpu.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 28
            },
      
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "Memory Utilization"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 32
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.database_memory.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 33
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.scheduler_memory.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 33
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.worker_memory.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 37
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.webserver_memory.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 37
            },
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "Airflow component errors"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 41
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.scheduler_errors.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 42
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.worker_errors.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 42
            },
                  {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.webserver_errors.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 48
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.other_errors.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 48
            },
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "Task errors"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 52
            }
          ]
        }
      }
      EOF
      }
    4. Bearbeiten Sie den Block "google_monitoring_monitored_project" resource:

      1. Ersetzen Sie die Liste der Projekte im Block for_each durch Ihre überwachten Projekte.
      2. Ersetzen Sie "YOUR_MONITORING_PROJECT" in metrics_scope durch den Namen Ihres Monitoring-Projekts.
    5. Prüfen Sie die Konfiguration und prüfen Sie, ob die Ressourcen, die Terraform erstellen oder aktualisieren wird, Ihren Erwartungen entsprechen. Nehmen Sie bei Bedarf Korrekturen vor.

      terraform plan
      
    6. Wenden Sie die Terraform-Konfiguration an. Führen Sie dazu den folgenden Befehl aus und geben Sie an der Eingabeaufforderung „yes“ ein:

      terraform apply
      
    7. Rufen Sie in der Google Cloud Console Ihres Monitoring-Projekts die Seite Monitoring-Dashboard auf:

      Zum Monitoring-Dashboard

    8. Suchen Sie auf dem Tab Benutzerdefiniert nach dem benutzerdefinierten Dashboard mit dem Namen Cloud Composer – Monitoring Platform.

    Nächste Schritte