Monitoraggio dell'ambiente tra progetti con Terraform

Cloud Composer 3 | Cloud Composer 2 | Cloud Composer 1

Questa pagina mostra come implementare una dashboard di monitoraggio integrata per più ambienti Cloud Composer in progetti selezionati della stessa organizzazione.

Panoramica

La soluzione descritta può aiutare i team della piattaforma aziendale centrale a supportare gli ambienti Cloud Composer utilizzati da altri team. Questa implementazione può essere utilizzata per monitorare tutti gli ambienti Cloud Composer, anche quelli non creati utilizzando Terraform.

Questa guida implementa la dashboard di monitoraggio di Cloud in Cloud Composer, insieme a criteri di avviso che registrano continuamente le metriche chiave degli ambienti Cloud Composer e generano incidenti in caso di problemi. La dashboard analizza automaticamente tutti gli ambienti Cloud Composer nei progetti selezionati per questo monitoraggio. L'implementazione si basa su Terraform.

Il modello utilizza un Google Cloud progetto che funge da progetto di monitoraggio, utilizzato per monitorare (sola lettura) gli ambienti Cloud Composer di cui è stato eseguito il deployment in più progetti monitorati. La dashboard centrale utilizza le metriche di Cloud Monitoring dei progetti monitorati per visualizzare i relativi contenuti.

Diagramma che mostra il progetto di monitoraggio, che contiene la dashboard di monitoraggio, e tre progetti monitorati che contengono ciascuno ambienti Composer. Ogni progetto monitorato ha una freccia che lo indica e che è etichettata come "metriche"

La dashboard monitora e crea avvisi per più metriche, tra cui l'integrità dell'ambiente:

Screenshot della dashboard di monitoraggio che mostra Integrità ambiente, Integrità database, Integrità web server e Heartbeat pianificatore

o le metriche della CPU:

Screenshot della dashboard di monitoraggio che mostra la CPU del database, la CPU dell'scheduler, la CPU del worker e la CPU del server web

Tieni il puntatore sopra una determinata riga per vedere l'ambiente che rappresenta. La dashboard mostra quindi il nome e la risorsa di un progetto:

Screenshot della dashboard di monitoraggio che mostra il popup quando passi il mouse sopra una riga. Il popup mostra quattro risorse, una delle quali corrisponde alla riga.

Se una metrica supera una soglia predefinita, viene generato un incidente e viene mostrato un avviso corrispondente in un grafico corrispondente a questa metrica:

Screenshot della visualizzazione degli incidenti aperti che mostra due incidenti aperti. Ogni incidente elencato ha un link per visualizzarne i dettagli.

Elenco delle metriche monitorate

Un elenco completo delle metriche monitorate:

  • Integrità dell'ambiente Cloud Composer (in base al DAG di monitoraggio)
  • Integrità del database
  • Integrità del server web
  • Heartbeat dello scheduler
  • Utilizzo della CPU e della memoria per tutti i worker
  • Utilizzo della CPU e della memoria per il database Airflow
  • Utilizzo di CPU e memoria per il server web
  • Utilizzo della CPU e della memoria per gli scheduler di Airflow
  • Proporzione di attività in coda, pianificate o in coda in un ambiente (utile per rilevare problemi di configurazione della concorrenza di Airflow)
  • Tempo di analisi del DAG
  • Numero attuale di worker rispetto a quello minimo: utile per comprendere i problemi di stabilità o di scalabilità dei worker
  • Eliminazioni dei pod di worker
  • Numero di errori generati nei log da worker, pianificatori, server web o altri componenti (grafici individuali)

Prima di iniziare

Per utilizzare Cloud Composer e Cloud Monitoring, devi creare un Google Cloud progetto e abilitare la fatturazione. Il progetto deve contenere un ambiente Cloud Composer. In questa guida, questo progetto è indicato come Progetto di monitoraggio.

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Make sure that billing is enabled for your Google Cloud project.

  6. Installa Terraform se non è già installato.
  7. Configura l'ambito delle metriche del progetto. Per impostazione predefinita, un progetto può solo visualizzare o monitorare i dati delle serie temporali che memorizza. Se vuoi visualizzare o monitorare i dati archiviati in più progetti, configura l'ambito delle metriche del progetto. Per ulteriori informazioni, consulta la panoramica degli ambiti delle metriche.

Procedura di implementazione

  1. Sul computer locale su cui esegui Terraform, imposta la variabile di ambiente GOOGLE_CLOUD_PROJECTsull'ID del tuo progetto di monitoraggio:

    export GOOGLE_CLOUD_PROJECT=MONITORING_PROJECT_ID
    
  2. Assicurati che il provider Google di Terraform sia autenticato e abbia accesso alle seguenti autorizzazioni:

    • roles/monitoring.editor nel progetto di monitoraggio
    • roles/monitoring.viewer, roles/logging.viewer in tutti i progetti monitorati
  3. Copia il seguente file main.tf sul computer locale su cui esegui Terraform.

    Fai clic per espandere

    #   Monitoring for multiple Cloud Composer environments
    #
    #   Usage:
    #       1. Create a new project that you will use for monitoring of Cloud Composer environments in other projects
    #       2. Replace YOUR_MONITORING_PROJECT with the name of this project in the "metrics_scope" parameter that is part of the "Add Monitored Projects to the Monitoring project" section
    #       3. Replace the list of projects to monitor with your list of projects with Cloud Composer environments to be monitored in the "for_each" parameter of the "Add Monitored Projects to the Monitoring project" section
    #       4. Set up your environment and apply the configuration following these steps: https://cloud.google.com/docs/terraform/basic-commands. Your GOOGLE_CLOUD_PROJECT environment variable should be the new monitoring project you just created.
    #
    #   The script creates the following resources in the monitoring project:
    #           1. Adds monitored projects to Cloud Monitoring
    #           2. Creates Alert Policies
    #           3. Creates Monitoring Dashboard
    #
    
    
    
    #######################################################
    #
    # Add Monitored Projects to the Monitoring project
    #
    ########################################################
    
    resource "google_monitoring_monitored_project" "projects_monitored" {
      for_each      = toset(["YOUR_PROJECT_TO_MONITOR_1", "YOUR_PROJECT_TO_MONITOR_2", "YOUR_PROJECT_TO_MONITOR_3"])
      metrics_scope = join("", ["locations/global/metricsScopes/", "YOUR_MONITORING_PROJECT"])
      name          = each.value
    }
    
    
    #######################################################
    #
    # Create alert policies in Monitoring project
    #
    ########################################################
    
    resource "google_monitoring_alert_policy" "environment_health" {
      display_name = "Environment Health"
      combiner     = "OR"
      conditions {
        display_name = "Environment Health"
        condition_monitoring_query_language {
          query = join("", [
            "fetch cloud_composer_environment",
            "| {metric 'composer.googleapis.com/environment/dagbag_size'",
            "| group_by 5m, [value_dagbag_size_mean: if(mean(value.dagbag_size) > 0, 1, 0)]",
            "| align mean_aligner(5m)",
            "| group_by [resource.project_id, resource.environment_name],    [value_dagbag_size_mean_aggregate: aggregate(value_dagbag_size_mean)];  ",
            "metric 'composer.googleapis.com/environment/healthy'",
            "| group_by 5m,    [value_sum_signals: aggregate(if(value.healthy,1,0))]",
            "| align mean_aligner(5m)| absent_for 5m }",
            "| outer_join 0",
            "| group_by [resource.project_id, resource.environment_name]",
            "| value val(2)",
            "| align mean_aligner(5m)",
            "| window(5m)",
            "| condition val(0) < 0.9"
          ])
          duration = "120s"
          trigger {
            count = "1"
          }
        }
      }
    
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    resource "google_monitoring_alert_policy" "database_health" {
      display_name = "Database Health"
      combiner     = "OR"
      conditions {
        display_name = "Database Health"
        condition_monitoring_query_language {
          query = join("", [
            "fetch cloud_composer_environment",
            "| metric 'composer.googleapis.com/environment/database_health'",
            "| group_by 5m,",
            "    [value_database_health_fraction_true: fraction_true(value.database_health)]",
            "| every 5m",
            "| group_by 5m,",
            "    [value_database_health_fraction_true_aggregate:",
            "       aggregate(value_database_health_fraction_true)]",
            "| every 5m",
            "| group_by [resource.project_id, resource.environment_name],",
            "    [value_database_health_fraction_true_aggregate_aggregate:",
            "       aggregate(value_database_health_fraction_true_aggregate)]",
          "| condition val() < 0.95"])
          duration = "120s"
          trigger {
            count = "1"
          }
        }
      }
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    resource "google_monitoring_alert_policy" "webserver_health" {
      display_name = "Web Server Health"
      combiner     = "OR"
      conditions {
        display_name = "Web Server Health"
        condition_monitoring_query_language {
          query = join("", [
            "fetch cloud_composer_environment",
            "| metric 'composer.googleapis.com/environment/web_server/health'",
            "| group_by 5m, [value_health_fraction_true: fraction_true(value.health)]",
            "| every 5m",
            "| group_by 5m,",
            "    [value_health_fraction_true_aggregate:",
            "       aggregate(value_health_fraction_true)]",
            "| every 5m",
            "| group_by [resource.project_id, resource.environment_name],",
            "    [value_health_fraction_true_aggregate_aggregate:",
            "       aggregate(value_health_fraction_true_aggregate)]",
          "| condition val() < 0.95"])
          duration = "120s"
          trigger {
            count = "1"
          }
        }
      }
    
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    resource "google_monitoring_alert_policy" "scheduler_heartbeat" {
      display_name = "Scheduler Heartbeat"
      combiner     = "OR"
      conditions {
        display_name = "Scheduler Heartbeat"
        condition_monitoring_query_language {
          query = join("", [
            "fetch cloud_composer_environment",
            "| metric 'composer.googleapis.com/environment/scheduler_heartbeat_count'",
            "| group_by 10m,",
            "    [value_scheduler_heartbeat_count_aggregate:",
            "      aggregate(value.scheduler_heartbeat_count)]",
            "| every 10m",
            "| group_by 10m,",
            "    [value_scheduler_heartbeat_count_aggregate_mean:",
            "       mean(value_scheduler_heartbeat_count_aggregate)]",
            "| every 10m",
            "| group_by [resource.project_id, resource.environment_name],",
            "    [value_scheduler_heartbeat_count_aggregate_mean_aggregate:",
            "       aggregate(value_scheduler_heartbeat_count_aggregate_mean)]",
          "| condition val() < 80"])
          duration = "120s"
          trigger {
            count = "1"
          }
        }
      }
    
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    resource "google_monitoring_alert_policy" "database_cpu" {
      display_name = "Database CPU"
      combiner     = "OR"
      conditions {
        display_name = "Database CPU"
        condition_monitoring_query_language {
          query = join("", [
            "fetch cloud_composer_environment",
            "| metric 'composer.googleapis.com/environment/database/cpu/utilization'",
            "| group_by 10m, [value_utilization_mean: mean(value.utilization)]",
            "| every 10m",
            "| group_by [resource.project_id, resource.environment_name]",
          "| condition val() > 0.8"])
          duration = "120s"
          trigger {
            count = "1"
          }
        }
      }
    
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    resource "google_monitoring_alert_policy" "scheduler_cpu" {
      display_name = "Scheduler CPU"
      combiner     = "OR"
      conditions {
        display_name = "Scheduler CPU"
        condition_monitoring_query_language {
          query = join("", [
            "fetch k8s_container",
            "| metric 'kubernetes.io/container/cpu/limit_utilization'",
            "| filter (resource.pod_name =~ 'airflow-scheduler-.*')",
            "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
            "| every 10m",
            "| group_by [resource.cluster_name],",
            "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
          "| condition val() > 0.8"])
          duration = "120s"
          trigger {
            count = "1"
          }
        }
      }
    
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    resource "google_monitoring_alert_policy" "worker_cpu" {
      display_name = "Worker CPU"
      combiner     = "OR"
      conditions {
        display_name = "Worker CPU"
        condition_monitoring_query_language {
          query = join("", [
            "fetch k8s_container",
            "| metric 'kubernetes.io/container/cpu/limit_utilization'",
            "| filter (resource.pod_name =~ 'airflow-worker.*')",
            "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
            "| every 10m",
            "| group_by [resource.cluster_name],",
            "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
          "| condition val() > 0.8"])
          duration = "120s"
          trigger {
            count = "1"
          }
        }
      }
    
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    resource "google_monitoring_alert_policy" "webserver_cpu" {
      display_name = "Web Server CPU"
      combiner     = "OR"
      conditions {
        display_name = "Web Server CPU"
        condition_monitoring_query_language {
          query = join("", [
            "fetch k8s_container",
            "| metric 'kubernetes.io/container/cpu/limit_utilization'",
            "| filter (resource.pod_name =~ 'airflow-webserver.*')",
            "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
            "| every 10m",
            "| group_by [resource.cluster_name],",
            "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
          "| condition val() > 0.8"])
          duration = "120s"
          trigger {
            count = "1"
          }
        }
      }
    
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    resource "google_monitoring_alert_policy" "parsing_time" {
      display_name = "DAG Parsing Time"
      combiner     = "OR"
      conditions {
        display_name = "DAG Parsing Time"
        condition_monitoring_query_language {
          query = join("", [
            "fetch cloud_composer_environment",
            "| metric 'composer.googleapis.com/environment/dag_processing/total_parse_time'",
            "| group_by 5m, [value_total_parse_time_mean: mean(value.total_parse_time)]",
            "| every 5m",
            "| group_by [resource.project_id, resource.environment_name]",
          "| condition val(0) > cast_units(30,\"s\")"])
          duration = "120s"
          trigger {
            count = "1"
          }
        }
      }
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    resource "google_monitoring_alert_policy" "database_memory" {
      display_name = "Database Memory"
      combiner     = "OR"
      conditions {
        display_name = "Database Memory"
        condition_monitoring_query_language {
          query = join("", [
            "fetch cloud_composer_environment",
            "| metric 'composer.googleapis.com/environment/database/memory/utilization'",
            "| group_by 10m, [value_utilization_mean: mean(value.utilization)]",
            "| every 10m",
            "| group_by [resource.project_id, resource.environment_name]",
          "| condition val() > 0.8"])
          duration = "0s"
          trigger {
            count = "1"
          }
        }
      }
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    resource "google_monitoring_alert_policy" "scheduler_memory" {
      display_name = "Scheduler Memory"
      combiner     = "OR"
      conditions {
        display_name = "Scheduler Memory"
        condition_monitoring_query_language {
          query = join("", [
            "fetch k8s_container",
            "| metric 'kubernetes.io/container/memory/limit_utilization'",
            "| filter (resource.pod_name =~ 'airflow-scheduler-.*')",
            "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
            "| every 10m",
            "| group_by [resource.cluster_name],",
            "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
          "| condition val() > 0.8"])
          duration = "0s"
          trigger {
            count = "1"
          }
        }
      }
      documentation {
        content = join("", [
          "Scheduler Memory exceeds a threshold, summed across all schedulers in the environment. ",
        "Add more schedulers OR increase scheduler's memory OR reduce scheduling load (e.g. through lower parsing frequency or lower number of DAGs/tasks running"])
      }
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    resource "google_monitoring_alert_policy" "worker_memory" {
      display_name = "Worker Memory"
      combiner     = "OR"
      conditions {
        display_name = "Worker Memory"
        condition_monitoring_query_language {
          query = join("", [
            "fetch k8s_container",
            "| metric 'kubernetes.io/container/memory/limit_utilization'",
            "| filter (resource.pod_name =~ 'airflow-worker.*')",
            "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
            "| every 10m",
            "| group_by [resource.cluster_name],",
            "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
          "| condition val() > 0.8"])
          duration = "0s"
          trigger {
            count = "1"
          }
        }
      }
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    resource "google_monitoring_alert_policy" "webserver_memory" {
      display_name = "Web Server Memory"
      combiner     = "OR"
      conditions {
        display_name = "Web Server Memory"
        condition_monitoring_query_language {
          query = join("", [
            "fetch k8s_container",
            "| metric 'kubernetes.io/container/memory/limit_utilization'",
            "| filter (resource.pod_name =~ 'airflow-webserver.*')",
            "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
            "| every 10m",
            "| group_by [resource.cluster_name],",
            "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
          "| condition val() > 0.8"])
          duration = "0s"
          trigger {
            count = "1"
          }
        }
      }
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    resource "google_monitoring_alert_policy" "scheduled_tasks_percentage" {
      display_name = "Scheduled Tasks Percentage"
      combiner     = "OR"
      conditions {
        display_name = "Scheduled Tasks Percentage"
        condition_monitoring_query_language {
          query = join("", [
            "fetch cloud_composer_environment",
            "| metric 'composer.googleapis.com/environment/unfinished_task_instances'",
            "| align mean_aligner(10m)",
            "| every(10m)",
            "| window(10m)",
            "| filter_ratio_by [resource.project_id, resource.environment_name], metric.state = 'scheduled'",
          "| condition val() > 0.80"])
          duration = "300s"
          trigger {
            count = "1"
          }
        }
      }
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    resource "google_monitoring_alert_policy" "queued_tasks_percentage" {
      display_name = "Queued Tasks Percentage"
      combiner     = "OR"
      conditions {
        display_name = "Queued Tasks Percentage"
        condition_monitoring_query_language {
          query = join("", [
            "fetch cloud_composer_environment",
            "| metric 'composer.googleapis.com/environment/unfinished_task_instances'",
            "| align mean_aligner(10m)",
            "| every(10m)",
            "| window(10m)",
            "| filter_ratio_by [resource.project_id, resource.environment_name], metric.state = 'queued'",
            "| group_by [resource.project_id, resource.environment_name]",
          "| condition val() > 0.95"])
          duration = "300s"
          trigger {
            count = "1"
          }
        }
      }
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    resource "google_monitoring_alert_policy" "queued_or_scheduled_tasks_percentage" {
      display_name = "Queued or Scheduled Tasks Percentage"
      combiner     = "OR"
      conditions {
        display_name = "Queued or Scheduled Tasks Percentage"
        condition_monitoring_query_language {
          query = join("", [
            "fetch cloud_composer_environment",
            "| metric 'composer.googleapis.com/environment/unfinished_task_instances'",
            "| align mean_aligner(10m)",
            "| every(10m)",
            "| window(10m)",
            "| filter_ratio_by [resource.project_id, resource.environment_name], or(metric.state = 'queued', metric.state = 'scheduled' )",
            "| group_by [resource.project_id, resource.environment_name]",
          "| condition val() > 0.80"])
          duration = "120s"
          trigger {
            count = "1"
          }
        }
      }
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    
    resource "google_monitoring_alert_policy" "workers_above_minimum" {
      display_name = "Workers above minimum (negative = missing workers)"
      combiner     = "OR"
      conditions {
        display_name = "Workers above minimum"
        condition_monitoring_query_language {
          query = join("", [
            "fetch cloud_composer_environment",
            "| { metric 'composer.googleapis.com/environment/num_celery_workers'",
            "| group_by 5m, [value_num_celery_workers_mean: mean(value.num_celery_workers)]",
            "| every 5m",
            "; metric 'composer.googleapis.com/environment/worker/min_workers'",
            "| group_by 5m, [value_min_workers_mean: mean(value.min_workers)]",
            "| every 5m }",
            "| outer_join 0",
            "| sub",
            "| group_by [resource.project_id, resource.environment_name]",
          "| condition val() < 0"])
          duration = "0s"
          trigger {
            count = "1"
          }
        }
      }
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    resource "google_monitoring_alert_policy" "pod_evictions" {
      display_name = "Worker pod evictions"
      combiner     = "OR"
      conditions {
        display_name = "Worker pod evictions"
        condition_monitoring_query_language {
          query = join("", [
            "fetch cloud_composer_environment",
            "| metric 'composer.googleapis.com/environment/worker/pod_eviction_count'",
            "| align delta(1m)",
            "| every 1m",
            "| group_by [resource.project_id, resource.environment_name]",
          "| condition val() > 0"])
          duration = "60s"
          trigger {
            count = "1"
          }
        }
      }
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    resource "google_monitoring_alert_policy" "scheduler_errors" {
      display_name = "Scheduler Errors"
      combiner     = "OR"
      conditions {
        display_name = "Scheduler Errors"
        condition_monitoring_query_language {
          query = join("", [
            "fetch cloud_composer_environment",
            "| metric 'logging.googleapis.com/log_entry_count'",
            "| filter (metric.log == 'airflow-scheduler' && metric.severity == 'ERROR')",
            "| group_by 5m,",
            "    [value_log_entry_count_aggregate: aggregate(value.log_entry_count)]",
            "| every 5m",
            "| group_by [resource.project_id, resource.environment_name],",
            "    [value_log_entry_count_aggregate_max: max(value_log_entry_count_aggregate)]",
          "| condition val() > 50"])
          duration = "300s"
          trigger {
            count = "1"
          }
        }
      }
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    resource "google_monitoring_alert_policy" "worker_errors" {
      display_name = "Worker Errors"
      combiner     = "OR"
      conditions {
        display_name = "Worker Errors"
        condition_monitoring_query_language {
          query = join("", [
            "fetch cloud_composer_environment",
            "| metric 'logging.googleapis.com/log_entry_count'",
            "| filter (metric.log == 'airflow-worker' && metric.severity == 'ERROR')",
            "| group_by 5m,",
            "    [value_log_entry_count_aggregate: aggregate(value.log_entry_count)]",
            "| every 5m",
            "| group_by [resource.project_id, resource.environment_name],",
            "    [value_log_entry_count_aggregate_max: max(value_log_entry_count_aggregate)]",
          "| condition val() > 50"])
          duration = "300s"
          trigger {
            count = "1"
          }
        }
      }
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    resource "google_monitoring_alert_policy" "webserver_errors" {
      display_name = "Web Server Errors"
      combiner     = "OR"
      conditions {
        display_name = "Web Server Errors"
        condition_monitoring_query_language {
          query = join("", [
            "fetch cloud_composer_environment",
            "| metric 'logging.googleapis.com/log_entry_count'",
            "| filter (metric.log == 'airflow-webserver' && metric.severity == 'ERROR')",
            "| group_by 5m,",
            "    [value_log_entry_count_aggregate: aggregate(value.log_entry_count)]",
            "| every 5m",
            "| group_by [resource.project_id, resource.environment_name],",
            "    [value_log_entry_count_aggregate_max: max(value_log_entry_count_aggregate)]",
          "| condition val() > 50"])
          duration = "300s"
          trigger {
            count = "1"
          }
        }
      }
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    resource "google_monitoring_alert_policy" "other_errors" {
      display_name = "Other Errors"
      combiner     = "OR"
      conditions {
        display_name = "Other Errors"
        condition_monitoring_query_language {
          query = join("", [
            "fetch cloud_composer_environment",
            "| metric 'logging.googleapis.com/log_entry_count'",
            "| filter",
            "    (metric.log !~ 'airflow-scheduler|airflow-worker|airflow-webserver'",
            "     && metric.severity == 'ERROR')",
            "| group_by 5m, [value_log_entry_count_max: max(value.log_entry_count)]",
            "| every 5m",
            "| group_by [resource.project_id, resource.environment_name],",
            "    [value_log_entry_count_max_aggregate: aggregate(value_log_entry_count_max)]",
          "| condition val() > 10"])
          duration = "300s"
          trigger {
            count = "1"
          }
        }
      }
      # uncomment to set an auto close strategy for the alert
      #alert_strategy {
      #    auto_close = "30m"
      #}
    }
    
    
    #######################################################
    #
    # Create Monitoring Dashboard
    #
    ########################################################
    
    
    resource "google_monitoring_dashboard" "Composer_Dashboard" {
      dashboard_json = <<EOF
    {
      "category": "CUSTOM",
      "displayName": "Cloud Composer - Monitoring Platform",
      "mosaicLayout": {
        "columns": 12,
        "tiles": [
          {
            "height": 1,
            "widget": {
              "text": {
                "content": "",
                "format": "MARKDOWN"
              },
              "title": "Health"
            },
            "width": 12,
            "xPos": 0,
            "yPos": 0
          },
          {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.environment_health.name}"
              }
            },
            "width": 6,
            "xPos": 0,
            "yPos": 1
          },
          {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.database_health.name}"
              }
            },
            "width": 6,
            "xPos": 6,
            "yPos": 1
          },
          {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.webserver_health.name}"
              }
            },
            "width": 6,
            "xPos": 0,
            "yPos": 5
          },
          {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.scheduler_heartbeat.name}"
              }
            },
            "width": 6,
            "xPos": 6,
            "yPos": 5
          },
          {
            "height": 1,
            "widget": {
              "text": {
                "content": "",
                "format": "RAW"
              },
              "title": "Airflow Task Execution and DAG Parsing"
            },
            "width": 12,
            "xPos": 0,
            "yPos": 9
          },
          {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.scheduled_tasks_percentage.name}"
              }
            },
            "width": 6,
            "xPos": 0,
            "yPos": 10
          },
          {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.queued_tasks_percentage.name}"
              }
            },
            "width": 6,
            "xPos": 6,
            "yPos": 10
          },
          {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.queued_or_scheduled_tasks_percentage.name}"
              }
            },
            "width": 6,
            "xPos": 0,
            "yPos": 14
          },
          {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.parsing_time.name}"
              }
            },
            "width": 6,
            "xPos": 6,
            "yPos": 14
          },
          {
            "height": 1,
            "widget": {
              "text": {
                "content": "",
                "format": "RAW"
              },
              "title": "Workers presence"
            },
            "width": 12,
            "xPos": 0,
            "yPos": 18
          },
          {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.workers_above_minimum.name}"
              }
            },
            "width": 6,
            "xPos": 0,
            "yPos": 19
          },
          {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.pod_evictions.name}"
              }
            },
            "width": 6,
            "xPos": 6,
            "yPos": 19
          },
          {
            "height": 1,
            "widget": {
              "text": {
                "content": "",
                "format": "RAW"
              },
              "title": "CPU Utilization"
            },
            "width": 12,
            "xPos": 0,
            "yPos": 23
          },
          {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.database_cpu.name}"
              }
            },
            "width": 6,
            "xPos": 0,
            "yPos": 24
          },
          {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.scheduler_cpu.name}"
              }
            },
            "width": 6,
            "xPos": 6,
            "yPos": 24
          },
          {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.worker_cpu.name}"
              }
            },
            "width": 6,
            "xPos": 0,
            "yPos": 28
          },
          {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.webserver_cpu.name}"
              }
            },
            "width": 6,
            "xPos": 6,
            "yPos": 28
          },
    
          {
            "height": 1,
            "widget": {
              "text": {
                "content": "",
                "format": "RAW"
              },
              "title": "Memory Utilization"
            },
            "width": 12,
            "xPos": 0,
            "yPos": 32
          },
          {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.database_memory.name}"
              }
            },
            "width": 6,
            "xPos": 0,
            "yPos": 33
          },
          {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.scheduler_memory.name}"
              }
            },
            "width": 6,
            "xPos": 6,
            "yPos": 33
          },
          {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.worker_memory.name}"
              }
            },
            "width": 6,
            "xPos": 0,
            "yPos": 37
          },
          {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.webserver_memory.name}"
              }
            },
            "width": 6,
            "xPos": 6,
            "yPos": 37
          },
          {
            "height": 1,
            "widget": {
              "text": {
                "content": "",
                "format": "RAW"
              },
              "title": "Airflow component errors"
            },
            "width": 12,
            "xPos": 0,
            "yPos": 41
          },
          {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.scheduler_errors.name}"
              }
            },
            "width": 6,
            "xPos": 0,
            "yPos": 42
          },
          {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.worker_errors.name}"
              }
            },
            "width": 6,
            "xPos": 6,
            "yPos": 42
          },
                {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.webserver_errors.name}"
              }
            },
            "width": 6,
            "xPos": 0,
            "yPos": 48
          },
          {
            "height": 4,
            "widget": {
              "alertChart": {
                "name": "${google_monitoring_alert_policy.other_errors.name}"
              }
            },
            "width": 6,
            "xPos": 6,
            "yPos": 48
          },
          {
            "height": 1,
            "widget": {
              "text": {
                "content": "",
                "format": "RAW"
              },
              "title": "Task errors"
            },
            "width": 12,
            "xPos": 0,
            "yPos": 52
          }
        ]
      }
    }
    EOF
    }
  4. Modifica il blocco "google_monitoring_monitored_project" resource:

    1. Sostituisci l'elenco di progetti nel blocco for_each con i tuoi progetti monitorati.
    2. Sostituisci "YOUR_MONITORING_PROJECT" in metrics_scope con il nome del tuo progetto di monitoraggio.
  5. Rivedi la configurazione e verifica che le risorse che Terraform sta per creare o aggiornare corrispondano alle tue aspettative. Apporta le correzioni necessarie.

    terraform plan
    
  6. Applica la configurazione di Terraform eseguendo il seguente comando e inserendo yes al prompt:

    terraform apply
    
  7. Nella console Google Cloud del tuo progetto di monitoraggio, vai alla pagina Dashboard di monitoraggio:

    Vai alla dashboard di monitoraggio.

  8. Trova la dashboard personalizzata denominata Cloud Composer - Monitoring Platform nella scheda Personalizzata.

Passaggi successivi