Surveillance de l'environnement multiprojet avec Terraform

Cloud Composer 3 | Cloud Composer 2 | Cloud Composer 1

Cette page explique comment implémenter un tableau de bord de surveillance intégré pour plusieurs environnements Cloud Composer dans les projets sélectionnés de la même organisation.

Présentation

La solution décrite peut aider les équipes de plate-forme d'entreprise centrales à prendre en charge les environnements Cloud Composer utilisés par d'autres équipes. Cette implémentation peut être utilisée pour surveiller tous les environnements Cloud Composer, même ceux qui ne sont pas créés à l'aide de Terraform.

Ce guide implémente le tableau de bord Cloud Monitoring dans Cloud Composer, ainsi que des règles d'alerte qui signalent en continu les métriques clés des environnements Cloud Composer et génèrent des incidents en cas de problème. Le tableau de bord analyse automatiquement tous les environnements Cloud Composer des projets sélectionnés pour cette surveillance. L'implémentation repose sur Terraform.

Le modèle utilise un projet Google Cloud agissant en tant que projet Monitoring, qui est utilisé pour surveiller (en lecture seule) les environnements Cloud Composer déployés dans plusieurs projets surveillés. Le tableau de bord central utilise les métriques Cloud Monitoring des projets surveillés pour afficher son contenu.

Diagramme montrant le projet de surveillance, qui contient le tableau de bord de surveillance, et trois projets surveillés contenant chacun des environnements Composer. Chaque projet surveillé est associé à une flèche pointant vers le projet surveillé et portant le libellé "métriques".

Le tableau de bord surveille plusieurs métriques et crée des alertes pour celles-ci, y compris l'état de l'environnement :

Capture d'écran du tableau de bord de surveillance montrant l'état de l'environnement, de la base de données et du serveur Web, ainsi que le signal de présence du planificateur

ou les métriques de processeur :

Capture d'écran du tableau de bord de surveillance montrant le processeur de la base de données, le processeur du planificateur, le processeur du nœud de calcul et le processeur du serveur Web

Pointez sur une ligne spécifique pour voir l'environnement qu'elle représente. Le tableau de bord affiche ensuite un nom de projet et une ressource :

Capture d'écran du tableau de bord de surveillance montrant le pop-up qui s'affiche lorsque vous pointez sur une ligne. Le pop-up affiche quatre ressources, dont l'une correspond à la ligne.

Si une métrique dépasse un seuil prédéfini, un incident est signalé et une alerte correspondante s'affiche dans un graphique correspondant à cette métrique :

Capture d'écran de la vue des incidents ouverts montrant deux incidents ouverts. Chaque incident listé comporte un lien permettant d'afficher les détails.

Liste des métriques surveillées

Voici la liste complète des métriques surveillées :

  • État de l'environnement Cloud Composer (basé sur le DAG Monitoring)
  • État de la base de données
  • État du serveur Web
  • Pulsations du planificateur
  • Utilisation du processeur et de la mémoire pour tous les nœuds de calcul
  • Utilisation du processeur et de la mémoire pour la base de données Airflow
  • Utilisation du processeur et de la mémoire pour le serveur Web
  • Utilisation du processeur et de la mémoire pour les planificateurs Airflow
  • Proportion des tâches mises en file d'attente, planifiées, ou mises en file d'attente ou planifiées dans un environnement (utile pour identifier les problèmes de configuration de la simultanéité Airflow)
  • Durée d'analyse du DAG
  • Nombre actuel de nœuds de calcul par rapport au nombre minimal : utile pour comprendre les problèmes de stabilité ou de scaling des nœuds de calcul
  • Évictions de pods de nœuds de calcul
  • Nombre d'erreurs générées dans les journaux par les nœuds de calcul, les planificateurs, le serveur Web ou d'autres composants (graphiques individuels)

Avant de commencer

Pour utiliser Cloud Composer et Cloud Monitoring, vous devez créer un projet Google Cloud et activer la facturation. Le projet doit contenir un environnement Cloud Composer. Dans ce guide, ce projet est appelé projet de surveillance.

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Verify that billing is enabled for your Google Cloud project.

  6. Installez Terraform si ce n'est pas déjà fait.
  7. Configurez le champ d'application des métriques de votre projet. Par défaut, un projet ne peut afficher ou surveiller que les données de séries temporelles qu'il stocke. Si vous souhaitez afficher ou surveiller les données stockées dans plusieurs projets, configurez le champ d'application des métriques du projet. Pour en savoir plus, consultez Présentation des champs d'application des métriques.
  8. Étapes de mise en œuvre

    1. Sur l'ordinateur local sur lequel vous exécutez Terraform, définissez la variable d'environnement GOOGLE_CLOUD_PROJECT sur l'ID de votre projet Monitoring :

      export GOOGLE_CLOUD_PROJECT=MONITORING_PROJECT_ID
      
    2. Assurez-vous que votre fournisseur Google Terraform est authentifié et qu'il dispose des autorisations suivantes :

      • Autorisation roles/monitoring.editor dans le projet de surveillance
      • roles/monitoring.viewer, roles/logging.viewer dans tous les projets surveillés
    3. Copiez le fichier main.tf suivant sur l'ordinateur local sur lequel vous exécutez Terraform.

      Cliquez pour développer

      #   Monitoring for multiple Cloud Composer environments
      #
      #   Usage:
      #       1. Create a new project that you will use for monitoring of Cloud Composer environments in other projects
      #       2. Replace YOUR_MONITORING_PROJECT with the name of this project in the "metrics_scope" parameter that is part of the "Add Monitored Projects to the Monitoring project" section
      #       3. Replace the list of projects to monitor with your list of projects with Cloud Composer environments to be monitored in the "for_each" parameter of the "Add Monitored Projects to the Monitoring project" section
      #       4. Set up your environment and apply the configuration following these steps: https://cloud.google.com/docs/terraform/basic-commands. Your GOOGLE_CLOUD_PROJECT environment variable should be the new monitoring project you just created.
      #
      #   The script creates the following resources in the monitoring project:
      #           1. Adds monitored projects to Cloud Monitoring
      #           2. Creates Alert Policies
      #           3. Creates Monitoring Dashboard
      #
      
      
      
      #######################################################
      #
      # Add Monitored Projects to the Monitoring project
      #
      ########################################################
      
      resource "google_monitoring_monitored_project" "projects_monitored" {
        for_each      = toset(["YOUR_PROJECT_TO_MONITOR_1", "YOUR_PROJECT_TO_MONITOR_2", "YOUR_PROJECT_TO_MONITOR_3"])
        metrics_scope = join("", ["locations/global/metricsScopes/", "YOUR_MONITORING_PROJECT"])
        name          = each.value
      }
      
      
      #######################################################
      #
      # Create alert policies in Monitoring project
      #
      ########################################################
      
      resource "google_monitoring_alert_policy" "environment_health" {
        display_name = "Environment Health"
        combiner     = "OR"
        conditions {
          display_name = "Environment Health"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| {metric 'composer.googleapis.com/environment/dagbag_size'",
              "| group_by 5m, [value_dagbag_size_mean: if(mean(value.dagbag_size) > 0, 1, 0)]",
              "| align mean_aligner(5m)",
              "| group_by [resource.project_id, resource.environment_name],    [value_dagbag_size_mean_aggregate: aggregate(value_dagbag_size_mean)];  ",
              "metric 'composer.googleapis.com/environment/healthy'",
              "| group_by 5m,    [value_sum_signals: aggregate(if(value.healthy,1,0))]",
              "| align mean_aligner(5m)| absent_for 5m }",
              "| outer_join 0",
              "| group_by [resource.project_id, resource.environment_name]",
              "| value val(2)",
              "| align mean_aligner(5m)",
              "| window(5m)",
              "| condition val(0) < 0.9"
            ])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "database_health" {
        display_name = "Database Health"
        combiner     = "OR"
        conditions {
          display_name = "Database Health"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/database_health'",
              "| group_by 5m,",
              "    [value_database_health_fraction_true: fraction_true(value.database_health)]",
              "| every 5m",
              "| group_by 5m,",
              "    [value_database_health_fraction_true_aggregate:",
              "       aggregate(value_database_health_fraction_true)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_database_health_fraction_true_aggregate_aggregate:",
              "       aggregate(value_database_health_fraction_true_aggregate)]",
            "| condition val() < 0.95"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "webserver_health" {
        display_name = "Web Server Health"
        combiner     = "OR"
        conditions {
          display_name = "Web Server Health"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/web_server/health'",
              "| group_by 5m, [value_health_fraction_true: fraction_true(value.health)]",
              "| every 5m",
              "| group_by 5m,",
              "    [value_health_fraction_true_aggregate:",
              "       aggregate(value_health_fraction_true)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_health_fraction_true_aggregate_aggregate:",
              "       aggregate(value_health_fraction_true_aggregate)]",
            "| condition val() < 0.95"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "scheduler_heartbeat" {
        display_name = "Scheduler Heartbeat"
        combiner     = "OR"
        conditions {
          display_name = "Scheduler Heartbeat"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/scheduler_heartbeat_count'",
              "| group_by 10m,",
              "    [value_scheduler_heartbeat_count_aggregate:",
              "      aggregate(value.scheduler_heartbeat_count)]",
              "| every 10m",
              "| group_by 10m,",
              "    [value_scheduler_heartbeat_count_aggregate_mean:",
              "       mean(value_scheduler_heartbeat_count_aggregate)]",
              "| every 10m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_scheduler_heartbeat_count_aggregate_mean_aggregate:",
              "       aggregate(value_scheduler_heartbeat_count_aggregate_mean)]",
            "| condition val() < 80"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "database_cpu" {
        display_name = "Database CPU"
        combiner     = "OR"
        conditions {
          display_name = "Database CPU"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/database/cpu/utilization'",
              "| group_by 10m, [value_utilization_mean: mean(value.utilization)]",
              "| every 10m",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() > 0.8"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "scheduler_cpu" {
        display_name = "Scheduler CPU"
        combiner     = "OR"
        conditions {
          display_name = "Scheduler CPU"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/cpu/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-scheduler-.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "worker_cpu" {
        display_name = "Worker CPU"
        combiner     = "OR"
        conditions {
          display_name = "Worker CPU"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/cpu/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-worker.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "webserver_cpu" {
        display_name = "Web Server CPU"
        combiner     = "OR"
        conditions {
          display_name = "Web Server CPU"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/cpu/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-webserver.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "parsing_time" {
        display_name = "DAG Parsing Time"
        combiner     = "OR"
        conditions {
          display_name = "DAG Parsing Time"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/dag_processing/total_parse_time'",
              "| group_by 5m, [value_total_parse_time_mean: mean(value.total_parse_time)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val(0) > cast_units(30,\"s\")"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "database_memory" {
        display_name = "Database Memory"
        combiner     = "OR"
        conditions {
          display_name = "Database Memory"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/database/memory/utilization'",
              "| group_by 10m, [value_utilization_mean: mean(value.utilization)]",
              "| every 10m",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() > 0.8"])
            duration = "0s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "scheduler_memory" {
        display_name = "Scheduler Memory"
        combiner     = "OR"
        conditions {
          display_name = "Scheduler Memory"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/memory/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-scheduler-.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "0s"
            trigger {
              count = "1"
            }
          }
        }
        documentation {
          content = join("", [
            "Scheduler Memory exceeds a threshold, summed across all schedulers in the environment. ",
          "Add more schedulers OR increase scheduler's memory OR reduce scheduling load (e.g. through lower parsing frequency or lower number of DAGs/tasks running"])
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "worker_memory" {
        display_name = "Worker Memory"
        combiner     = "OR"
        conditions {
          display_name = "Worker Memory"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/memory/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-worker.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "0s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "webserver_memory" {
        display_name = "Web Server Memory"
        combiner     = "OR"
        conditions {
          display_name = "Web Server Memory"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/memory/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-webserver.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "0s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "scheduled_tasks_percentage" {
        display_name = "Scheduled Tasks Percentage"
        combiner     = "OR"
        conditions {
          display_name = "Scheduled Tasks Percentage"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/unfinished_task_instances'",
              "| align mean_aligner(10m)",
              "| every(10m)",
              "| window(10m)",
              "| filter_ratio_by [resource.project_id, resource.environment_name], metric.state = 'scheduled'",
            "| condition val() > 0.80"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "queued_tasks_percentage" {
        display_name = "Queued Tasks Percentage"
        combiner     = "OR"
        conditions {
          display_name = "Queued Tasks Percentage"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/unfinished_task_instances'",
              "| align mean_aligner(10m)",
              "| every(10m)",
              "| window(10m)",
              "| filter_ratio_by [resource.project_id, resource.environment_name], metric.state = 'queued'",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() > 0.95"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "queued_or_scheduled_tasks_percentage" {
        display_name = "Queued or Scheduled Tasks Percentage"
        combiner     = "OR"
        conditions {
          display_name = "Queued or Scheduled Tasks Percentage"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/unfinished_task_instances'",
              "| align mean_aligner(10m)",
              "| every(10m)",
              "| window(10m)",
              "| filter_ratio_by [resource.project_id, resource.environment_name], or(metric.state = 'queued', metric.state = 'scheduled' )",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() > 0.80"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      
      resource "google_monitoring_alert_policy" "workers_above_minimum" {
        display_name = "Workers above minimum (negative = missing workers)"
        combiner     = "OR"
        conditions {
          display_name = "Workers above minimum"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| { metric 'composer.googleapis.com/environment/num_celery_workers'",
              "| group_by 5m, [value_num_celery_workers_mean: mean(value.num_celery_workers)]",
              "| every 5m",
              "; metric 'composer.googleapis.com/environment/worker/min_workers'",
              "| group_by 5m, [value_min_workers_mean: mean(value.min_workers)]",
              "| every 5m }",
              "| outer_join 0",
              "| sub",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() < 0"])
            duration = "0s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "pod_evictions" {
        display_name = "Worker pod evictions"
        combiner     = "OR"
        conditions {
          display_name = "Worker pod evictions"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/worker/pod_eviction_count'",
              "| align delta(1m)",
              "| every 1m",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() > 0"])
            duration = "60s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "scheduler_errors" {
        display_name = "Scheduler Errors"
        combiner     = "OR"
        conditions {
          display_name = "Scheduler Errors"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'logging.googleapis.com/log_entry_count'",
              "| filter (metric.log == 'airflow-scheduler' && metric.severity == 'ERROR')",
              "| group_by 5m,",
              "    [value_log_entry_count_aggregate: aggregate(value.log_entry_count)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_log_entry_count_aggregate_max: max(value_log_entry_count_aggregate)]",
            "| condition val() > 50"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "worker_errors" {
        display_name = "Worker Errors"
        combiner     = "OR"
        conditions {
          display_name = "Worker Errors"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'logging.googleapis.com/log_entry_count'",
              "| filter (metric.log == 'airflow-worker' && metric.severity == 'ERROR')",
              "| group_by 5m,",
              "    [value_log_entry_count_aggregate: aggregate(value.log_entry_count)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_log_entry_count_aggregate_max: max(value_log_entry_count_aggregate)]",
            "| condition val() > 50"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "webserver_errors" {
        display_name = "Web Server Errors"
        combiner     = "OR"
        conditions {
          display_name = "Web Server Errors"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'logging.googleapis.com/log_entry_count'",
              "| filter (metric.log == 'airflow-webserver' && metric.severity == 'ERROR')",
              "| group_by 5m,",
              "    [value_log_entry_count_aggregate: aggregate(value.log_entry_count)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_log_entry_count_aggregate_max: max(value_log_entry_count_aggregate)]",
            "| condition val() > 50"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "other_errors" {
        display_name = "Other Errors"
        combiner     = "OR"
        conditions {
          display_name = "Other Errors"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'logging.googleapis.com/log_entry_count'",
              "| filter",
              "    (metric.log !~ 'airflow-scheduler|airflow-worker|airflow-webserver'",
              "     && metric.severity == 'ERROR')",
              "| group_by 5m, [value_log_entry_count_max: max(value.log_entry_count)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_log_entry_count_max_aggregate: aggregate(value_log_entry_count_max)]",
            "| condition val() > 10"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      
      #######################################################
      #
      # Create Monitoring Dashboard
      #
      ########################################################
      
      
      resource "google_monitoring_dashboard" "Composer_Dashboard" {
        dashboard_json = <<EOF
      {
        "category": "CUSTOM",
        "displayName": "Cloud Composer - Monitoring Platform",
        "mosaicLayout": {
          "columns": 12,
          "tiles": [
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "MARKDOWN"
                },
                "title": "Health"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 0
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.environment_health.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 1
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.database_health.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 1
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.webserver_health.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 5
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.scheduler_heartbeat.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 5
            },
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "Airflow Task Execution and DAG Parsing"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 9
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.scheduled_tasks_percentage.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 10
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.queued_tasks_percentage.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 10
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.queued_or_scheduled_tasks_percentage.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 14
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.parsing_time.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 14
            },
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "Workers presence"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 18
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.workers_above_minimum.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 19
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.pod_evictions.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 19
            },
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "CPU Utilization"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 23
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.database_cpu.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 24
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.scheduler_cpu.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 24
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.worker_cpu.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 28
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.webserver_cpu.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 28
            },
      
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "Memory Utilization"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 32
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.database_memory.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 33
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.scheduler_memory.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 33
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.worker_memory.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 37
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.webserver_memory.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 37
            },
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "Airflow component errors"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 41
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.scheduler_errors.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 42
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.worker_errors.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 42
            },
                  {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.webserver_errors.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 48
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.other_errors.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 48
            },
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "Task errors"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 52
            }
          ]
        }
      }
      EOF
      }
    4. Modifiez le bloc resource "google_monitoring_monitored_project" :

      1. Remplacez la liste des projets dans le bloc for_each par vos projets surveillés.
      2. Remplacez "YOUR_MONITORING_PROJECT" dans metrics_scope par le nom de votre projet de surveillance.
    5. Examinez la configuration et vérifiez que les ressources que Terraform va créer ou mettre à jour correspondent à vos attentes. Apportez les corrections nécessaires.

      terraform plan
      
    6. Appliquez la configuration Terraform en exécutant la commande suivante et en saisissant "yes" lorsque vous y êtes invité :

      terraform apply
      
    7. Dans la console Google Cloud de votre projet de surveillance, accédez à la page Tableau de bord de surveillance :

      Accéder au tableau de bord Monitoring

    8. Recherchez votre tableau de bord personnalisé nommé Cloud Composer – Plate-forme de surveillance dans l'onglet Personnalisé.

    Étapes suivantes