Monitorização do ambiente entre projetos com o Terraform

Cloud Composer 3 | Cloud Composer 2 | Cloud Composer 1

Esta página mostra como implementar um painel de controlo de monitorização integrado para vários ambientes do Cloud Composer em projetos selecionados na mesma organização.

Vista geral

A solução descrita pode ajudar as equipas de plataformas empresariais centrais a suportar ambientes do Cloud Composer usados por outras equipas. Esta implementação pode ser usada para monitorizar todos os ambientes do Cloud Composer, mesmo os que não são criados com o Terraform.

Este guia implementa o painel de controlo do Cloud Monitoring no Cloud Composer, juntamente com políticas de alerta que comunicam continuamente as principais métricas dos ambientes do Cloud Composer e geram incidentes em caso de problemas. O painel de controlo analisa automaticamente todos os ambientes do Cloud Composer nos projetos selecionados para esta monitorização. A implementação baseia-se no Terraform.

O modelo usa um Google Cloud projeto que atua como um projeto de monitorização, que é usado para monitorizar (apenas leitura) ambientes do Cloud Composer implementados em vários projetos monitorizados. O painel de controlo central usa métricas do Cloud Monitoring dos projetos monitorizados para renderizar o respetivo conteúdo.

Diagrama que mostra o projeto de monitorização, que contém o painel de controlo de monitorização, e três projetos monitorizados que contêm ambientes do Composer. Cada projeto monitorizado tem uma seta a apontar para o projeto monitorizado com a etiqueta "métricas"

O painel de controlo monitoriza e cria alertas para várias métricas, incluindo o estado do ambiente:

Captura de ecrã do painel de controlo de monitorização que mostra o estado do ambiente, o estado da base de dados, o estado do servidor Web e o sinal de pulsação do agendador

ou métricas da CPU:

Captura de ecrã do painel de controlo de monitorização que mostra a CPU da base de dados, a CPU do programador, a CPU do trabalhador e a CPU do servidor Web

Mantenha o ponteiro sobre uma linha específica para ver o ambiente que representa. Em seguida, o painel de controlo apresenta um nome do projeto e um recurso:

Captura de ecrã do painel de controlo de monitorização a mostrar o pop-up quando passa o cursor do rato sobre uma linha. O pop-up mostra quatro recursos, um dos quais corresponde à linha.

Se uma métrica exceder um limite predefinido, é gerado um incidente e é apresentado um alerta respetivo num gráfico correspondente a esta métrica:

Captura de ecrã da vista de incidentes abertos a mostrar dois incidentes abertos. Cada incidente apresentado tem um link para ver os detalhes.

Lista de métricas monitorizadas

Uma lista completa de métricas monitorizadas:

  • Estado de funcionamento do ambiente do Cloud Composer (com base no DAG de monitorização)
  • Estado da base de dados
  • Estado do servidor Web
  • Pulsações do programador
  • Utilização da CPU e da memória para todos os trabalhadores
  • Utilização da CPU e da memória para a base de dados do Airflow
  • Utilização da CPU e da memória para o servidor Web
  • Utilização da CPU e da memória para programadores do Airflow
  • Proporção de tarefas em fila, agendadas, em fila ou agendadas num ambiente (útil para detetar problemas de configuração de concorrência do Airflow)
  • Tempo de análise do DAG
  • Número atual versus mínimo de trabalhadores: útil para compreender problemas de estabilidade dos trabalhadores ou problemas de escalabilidade
  • Despejos de agrupamentos de trabalhadores
  • Número de erros gerados nos registos por trabalhadores, programadores, servidor Web ou outros componentes (gráficos individuais)

Antes de começar

Para usar o Cloud Composer e o Cloud Monitoring, tem de criar um Google Cloud projeto e ativar a faturação. O projeto tem de conter um ambiente do Cloud Composer. Este projeto é denominado projeto de monitorização neste guia.

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Verify that billing is enabled for your Google Cloud project.

  6. Instale o Terraform se ainda não estiver instalado.
  7. Configure o âmbito das métricas do seu projeto. Por predefinição, um projeto só pode apresentar ou monitorizar dados de séries cronológicas que armazena. Se quiser apresentar dados ou monitorizar dados armazenados em vários projetos, então configure o âmbito das métricas do projeto. Para mais informações, consulte o artigo Vista geral dos âmbitos das métricas.
  8. Passos de implementação

    1. No computador local onde executa o Terraform, defina a variável de ambiente GOOGLE_CLOUD_PROJECTpara o ID do seu projeto de monitorização:

      export GOOGLE_CLOUD_PROJECT=MONITORING_PROJECT_ID
      
    2. Certifique-se de que o seu fornecedor Google do Terraform está autenticado e tem acesso às seguintes autorizações:

      • roles/monitoring.editor no projeto de monitorização
      • roles/monitoring.viewer, roles/logging.viewer em todos os Projetos monitorizados
    3. Copie o ficheiro main.tf seguinte para o computador local onde executa o Terraform.

      Clique para expandir

      #   Monitoring for multiple Cloud Composer environments
      #
      #   Usage:
      #       1. Create a new project that you will use for monitoring of Cloud Composer environments in other projects
      #       2. Replace YOUR_MONITORING_PROJECT with the name of this project in the "metrics_scope" parameter that is part of the "Add Monitored Projects to the Monitoring project" section
      #       3. Replace the list of projects to monitor with your list of projects with Cloud Composer environments to be monitored in the "for_each" parameter of the "Add Monitored Projects to the Monitoring project" section
      #       4. Set up your environment and apply the configuration following these steps: https://cloud.google.com/docs/terraform/basic-commands. Your GOOGLE_CLOUD_PROJECT environment variable should be the new monitoring project you just created.
      #
      #   The script creates the following resources in the monitoring project:
      #           1. Adds monitored projects to Cloud Monitoring
      #           2. Creates Alert Policies
      #           3. Creates Monitoring Dashboard
      #
      
      
      
      #######################################################
      #
      # Add Monitored Projects to the Monitoring project
      #
      ########################################################
      
      resource "google_monitoring_monitored_project" "projects_monitored" {
        for_each      = toset(["YOUR_PROJECT_TO_MONITOR_1", "YOUR_PROJECT_TO_MONITOR_2", "YOUR_PROJECT_TO_MONITOR_3"])
        metrics_scope = join("", ["locations/global/metricsScopes/", "YOUR_MONITORING_PROJECT"])
        name          = each.value
      }
      
      
      #######################################################
      #
      # Create alert policies in Monitoring project
      #
      ########################################################
      
      resource "google_monitoring_alert_policy" "environment_health" {
        display_name = "Environment Health"
        combiner     = "OR"
        conditions {
          display_name = "Environment Health"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| {metric 'composer.googleapis.com/environment/dagbag_size'",
              "| group_by 5m, [value_dagbag_size_mean: if(mean(value.dagbag_size) > 0, 1, 0)]",
              "| align mean_aligner(5m)",
              "| group_by [resource.project_id, resource.environment_name],    [value_dagbag_size_mean_aggregate: aggregate(value_dagbag_size_mean)];  ",
              "metric 'composer.googleapis.com/environment/healthy'",
              "| group_by 5m,    [value_sum_signals: aggregate(if(value.healthy,1,0))]",
              "| align mean_aligner(5m)| absent_for 5m }",
              "| outer_join 0",
              "| group_by [resource.project_id, resource.environment_name]",
              "| value val(2)",
              "| align mean_aligner(5m)",
              "| window(5m)",
              "| condition val(0) < 0.9"
            ])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "database_health" {
        display_name = "Database Health"
        combiner     = "OR"
        conditions {
          display_name = "Database Health"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/database_health'",
              "| group_by 5m,",
              "    [value_database_health_fraction_true: fraction_true(value.database_health)]",
              "| every 5m",
              "| group_by 5m,",
              "    [value_database_health_fraction_true_aggregate:",
              "       aggregate(value_database_health_fraction_true)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_database_health_fraction_true_aggregate_aggregate:",
              "       aggregate(value_database_health_fraction_true_aggregate)]",
            "| condition val() < 0.95"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "webserver_health" {
        display_name = "Web Server Health"
        combiner     = "OR"
        conditions {
          display_name = "Web Server Health"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/web_server/health'",
              "| group_by 5m, [value_health_fraction_true: fraction_true(value.health)]",
              "| every 5m",
              "| group_by 5m,",
              "    [value_health_fraction_true_aggregate:",
              "       aggregate(value_health_fraction_true)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_health_fraction_true_aggregate_aggregate:",
              "       aggregate(value_health_fraction_true_aggregate)]",
            "| condition val() < 0.95"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "scheduler_heartbeat" {
        display_name = "Scheduler Heartbeat"
        combiner     = "OR"
        conditions {
          display_name = "Scheduler Heartbeat"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/scheduler_heartbeat_count'",
              "| group_by 10m,",
              "    [value_scheduler_heartbeat_count_aggregate:",
              "      aggregate(value.scheduler_heartbeat_count)]",
              "| every 10m",
              "| group_by 10m,",
              "    [value_scheduler_heartbeat_count_aggregate_mean:",
              "       mean(value_scheduler_heartbeat_count_aggregate)]",
              "| every 10m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_scheduler_heartbeat_count_aggregate_mean_aggregate:",
              "       aggregate(value_scheduler_heartbeat_count_aggregate_mean)]",
            "| condition val() < 80"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "database_cpu" {
        display_name = "Database CPU"
        combiner     = "OR"
        conditions {
          display_name = "Database CPU"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/database/cpu/utilization'",
              "| group_by 10m, [value_utilization_mean: mean(value.utilization)]",
              "| every 10m",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() > 0.8"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "scheduler_cpu" {
        display_name = "Scheduler CPU"
        combiner     = "OR"
        conditions {
          display_name = "Scheduler CPU"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/cpu/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-scheduler-.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "worker_cpu" {
        display_name = "Worker CPU"
        combiner     = "OR"
        conditions {
          display_name = "Worker CPU"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/cpu/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-worker.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "webserver_cpu" {
        display_name = "Web Server CPU"
        combiner     = "OR"
        conditions {
          display_name = "Web Server CPU"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/cpu/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-webserver.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
      
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "parsing_time" {
        display_name = "DAG Parsing Time"
        combiner     = "OR"
        conditions {
          display_name = "DAG Parsing Time"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/dag_processing/total_parse_time'",
              "| group_by 5m, [value_total_parse_time_mean: mean(value.total_parse_time)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val(0) > cast_units(30,\"s\")"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "database_memory" {
        display_name = "Database Memory"
        combiner     = "OR"
        conditions {
          display_name = "Database Memory"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/database/memory/utilization'",
              "| group_by 10m, [value_utilization_mean: mean(value.utilization)]",
              "| every 10m",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() > 0.8"])
            duration = "0s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "scheduler_memory" {
        display_name = "Scheduler Memory"
        combiner     = "OR"
        conditions {
          display_name = "Scheduler Memory"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/memory/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-scheduler-.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "0s"
            trigger {
              count = "1"
            }
          }
        }
        documentation {
          content = join("", [
            "Scheduler Memory exceeds a threshold, summed across all schedulers in the environment. ",
          "Add more schedulers OR increase scheduler's memory OR reduce scheduling load (e.g. through lower parsing frequency or lower number of DAGs/tasks running"])
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "worker_memory" {
        display_name = "Worker Memory"
        combiner     = "OR"
        conditions {
          display_name = "Worker Memory"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/memory/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-worker.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "0s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "webserver_memory" {
        display_name = "Web Server Memory"
        combiner     = "OR"
        conditions {
          display_name = "Web Server Memory"
          condition_monitoring_query_language {
            query = join("", [
              "fetch k8s_container",
              "| metric 'kubernetes.io/container/memory/limit_utilization'",
              "| filter (resource.pod_name =~ 'airflow-webserver.*')",
              "| group_by 10m, [value_limit_utilization_mean: mean(value.limit_utilization)]",
              "| every 10m",
              "| group_by [resource.cluster_name],",
              "    [value_limit_utilization_mean_mean: mean(value_limit_utilization_mean)]",
            "| condition val() > 0.8"])
            duration = "0s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "scheduled_tasks_percentage" {
        display_name = "Scheduled Tasks Percentage"
        combiner     = "OR"
        conditions {
          display_name = "Scheduled Tasks Percentage"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/unfinished_task_instances'",
              "| align mean_aligner(10m)",
              "| every(10m)",
              "| window(10m)",
              "| filter_ratio_by [resource.project_id, resource.environment_name], metric.state = 'scheduled'",
            "| condition val() > 0.80"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "queued_tasks_percentage" {
        display_name = "Queued Tasks Percentage"
        combiner     = "OR"
        conditions {
          display_name = "Queued Tasks Percentage"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/unfinished_task_instances'",
              "| align mean_aligner(10m)",
              "| every(10m)",
              "| window(10m)",
              "| filter_ratio_by [resource.project_id, resource.environment_name], metric.state = 'queued'",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() > 0.95"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "queued_or_scheduled_tasks_percentage" {
        display_name = "Queued or Scheduled Tasks Percentage"
        combiner     = "OR"
        conditions {
          display_name = "Queued or Scheduled Tasks Percentage"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/unfinished_task_instances'",
              "| align mean_aligner(10m)",
              "| every(10m)",
              "| window(10m)",
              "| filter_ratio_by [resource.project_id, resource.environment_name], or(metric.state = 'queued', metric.state = 'scheduled' )",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() > 0.80"])
            duration = "120s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      
      resource "google_monitoring_alert_policy" "workers_above_minimum" {
        display_name = "Workers above minimum (negative = missing workers)"
        combiner     = "OR"
        conditions {
          display_name = "Workers above minimum"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| { metric 'composer.googleapis.com/environment/num_celery_workers'",
              "| group_by 5m, [value_num_celery_workers_mean: mean(value.num_celery_workers)]",
              "| every 5m",
              "; metric 'composer.googleapis.com/environment/worker/min_workers'",
              "| group_by 5m, [value_min_workers_mean: mean(value.min_workers)]",
              "| every 5m }",
              "| outer_join 0",
              "| sub",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() < 0"])
            duration = "0s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "pod_evictions" {
        display_name = "Worker pod evictions"
        combiner     = "OR"
        conditions {
          display_name = "Worker pod evictions"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'composer.googleapis.com/environment/worker/pod_eviction_count'",
              "| align delta(1m)",
              "| every 1m",
              "| group_by [resource.project_id, resource.environment_name]",
            "| condition val() > 0"])
            duration = "60s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "scheduler_errors" {
        display_name = "Scheduler Errors"
        combiner     = "OR"
        conditions {
          display_name = "Scheduler Errors"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'logging.googleapis.com/log_entry_count'",
              "| filter (metric.log == 'airflow-scheduler' && metric.severity == 'ERROR')",
              "| group_by 5m,",
              "    [value_log_entry_count_aggregate: aggregate(value.log_entry_count)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_log_entry_count_aggregate_max: max(value_log_entry_count_aggregate)]",
            "| condition val() > 50"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "worker_errors" {
        display_name = "Worker Errors"
        combiner     = "OR"
        conditions {
          display_name = "Worker Errors"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'logging.googleapis.com/log_entry_count'",
              "| filter (metric.log == 'airflow-worker' && metric.severity == 'ERROR')",
              "| group_by 5m,",
              "    [value_log_entry_count_aggregate: aggregate(value.log_entry_count)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_log_entry_count_aggregate_max: max(value_log_entry_count_aggregate)]",
            "| condition val() > 50"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "webserver_errors" {
        display_name = "Web Server Errors"
        combiner     = "OR"
        conditions {
          display_name = "Web Server Errors"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'logging.googleapis.com/log_entry_count'",
              "| filter (metric.log == 'airflow-webserver' && metric.severity == 'ERROR')",
              "| group_by 5m,",
              "    [value_log_entry_count_aggregate: aggregate(value.log_entry_count)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_log_entry_count_aggregate_max: max(value_log_entry_count_aggregate)]",
            "| condition val() > 50"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      resource "google_monitoring_alert_policy" "other_errors" {
        display_name = "Other Errors"
        combiner     = "OR"
        conditions {
          display_name = "Other Errors"
          condition_monitoring_query_language {
            query = join("", [
              "fetch cloud_composer_environment",
              "| metric 'logging.googleapis.com/log_entry_count'",
              "| filter",
              "    (metric.log !~ 'airflow-scheduler|airflow-worker|airflow-webserver'",
              "     && metric.severity == 'ERROR')",
              "| group_by 5m, [value_log_entry_count_max: max(value.log_entry_count)]",
              "| every 5m",
              "| group_by [resource.project_id, resource.environment_name],",
              "    [value_log_entry_count_max_aggregate: aggregate(value_log_entry_count_max)]",
            "| condition val() > 10"])
            duration = "300s"
            trigger {
              count = "1"
            }
          }
        }
        # uncomment to set an auto close strategy for the alert
        #alert_strategy {
        #    auto_close = "30m"
        #}
      }
      
      
      #######################################################
      #
      # Create Monitoring Dashboard
      #
      ########################################################
      
      
      resource "google_monitoring_dashboard" "Composer_Dashboard" {
        dashboard_json = <<EOF
      {
        "category": "CUSTOM",
        "displayName": "Cloud Composer - Monitoring Platform",
        "mosaicLayout": {
          "columns": 12,
          "tiles": [
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "MARKDOWN"
                },
                "title": "Health"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 0
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.environment_health.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 1
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.database_health.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 1
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.webserver_health.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 5
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.scheduler_heartbeat.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 5
            },
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "Airflow Task Execution and DAG Parsing"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 9
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.scheduled_tasks_percentage.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 10
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.queued_tasks_percentage.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 10
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.queued_or_scheduled_tasks_percentage.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 14
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.parsing_time.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 14
            },
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "Workers presence"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 18
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.workers_above_minimum.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 19
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.pod_evictions.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 19
            },
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "CPU Utilization"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 23
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.database_cpu.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 24
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.scheduler_cpu.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 24
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.worker_cpu.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 28
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.webserver_cpu.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 28
            },
      
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "Memory Utilization"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 32
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.database_memory.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 33
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.scheduler_memory.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 33
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.worker_memory.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 37
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.webserver_memory.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 37
            },
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "Airflow component errors"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 41
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.scheduler_errors.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 42
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.worker_errors.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 42
            },
                  {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.webserver_errors.name}"
                }
              },
              "width": 6,
              "xPos": 0,
              "yPos": 48
            },
            {
              "height": 4,
              "widget": {
                "alertChart": {
                  "name": "${google_monitoring_alert_policy.other_errors.name}"
                }
              },
              "width": 6,
              "xPos": 6,
              "yPos": 48
            },
            {
              "height": 1,
              "widget": {
                "text": {
                  "content": "",
                  "format": "RAW"
                },
                "title": "Task errors"
              },
              "width": 12,
              "xPos": 0,
              "yPos": 52
            }
          ]
        }
      }
      EOF
      }
    4. Edite o bloco "google_monitoring_monitored_project" resource:

      1. Substitua a lista de projetos no bloco for_each pelos seus Projetos monitorizados.
      2. Substitua "YOUR_MONITORING_PROJECT" em metrics_scope pelo nome do seu projeto de monitorização.
    5. Reveja a configuração e verifique se os recursos que o Terraform vai criar ou atualizar correspondem às suas expetativas. Faça correções se necessário.

      terraform plan
      
    6. Aplique a configuração do Terraform executando o seguinte comando e introduzindo yes no comando:

      terraform apply
      
    7. Na Google Cloud consola do seu projeto de monitorização, aceda à página Painel de controlo de monitorização:

      Aceda ao painel de controlo de monitorização

    8. Encontre o seu painel de controlo personalizado denominado Cloud Composer - Monitoring Platform no separador Personalizado.

    O que se segue?