Cloud Composer 1 | Cloud Composer 2 | Cloud Composer 3
You can use Cloud Monitoring and Cloud Logging with Cloud Composer.
Cloud Monitoring provides visibility into the performance, uptime, and overall health of cloud-powered applications. Cloud Monitoring collects and ingests metrics, events, and metadata from Cloud Composer to generate insights in dashboards and charts. You can use Cloud Monitoring to understand the performance and health of your Cloud Composer environments and Airflow metrics.
Logging captures logs produced by the scheduler and worker containers in your environment's cluster. These logs contain system-level and Airflow dependency information to help with debugging. For information about viewing logs, see View Airflow logs.
Before you begin
The following permissions are required to access logs and metrics for your Cloud Composer environment:
- Read-only access to logs and metrics:
logging.viewer
andmonitoring.viewer
- Read-only access to logs, including private logs:
logging.privateLogViewer
- Read/write access to metrics:
monitoring.editor
For more information about other permissions and roles for Cloud Composer, see Access control.
- Read-only access to logs and metrics:
To avoid duplicate logging, Cloud Logging for Google Kubernetes Engine is disabled.
Cloud Logging produces an entry for each status and event that occurs in your Google Cloud project. You can use exclusion filters to reduce the volume of logs, including the logs that Cloud Logging produces for Cloud Composer.
Excluding logs from
jobs.py
can cause health check failures andCrashLoopBackOff
errors. You must include-jobs.py
in exclusion filters to prevent it from being excluded.Monitoring cannot plot the count values for DAGs and tasks that execute more than once per minute, and does not plot metrics for failed tasks.
Environment metrics
You can use environment metrics to check the resource usage and health of your Cloud Composer environments.
Environment health
To check the health of your environment, you can use the following
health status metric: composer.googleapis.com/environment/healthy
.
Cloud Composer runs a liveness DAG named airflow_monitoring
,
which runs on a schedule and reports environment health as follows:
- If the liveness DAG run finishes successfully, the health
status is
True
. - If the liveness DAG run fails, the health status is
False
.
The liveness DAG is stored in the dags/
folder and visible in
the Airflow UI. The frequency and contents of the liveness DAG are immutable
and must not be modified. Changes to the liveness DAG do not persist.
Environment's dependencies checks
Cloud Composer periodically checks that the environment can reach the services required for its operation and that it has enough permissions to interact with them. Examples of services required for the environment's operation are Artifact Registry, Cloud Logging, and Cloud Monitoring.
The following metrics are available for the environment's dependencies checks:
Dependency metric | API | Description |
---|---|---|
Number of dependency checks | composer.googleapis.com/environment/health/dependency_check_count |
This metric tracks the number of times reachability checks are performed on services required for the environment's operation. |
Number of dependency permissions checks | composer.googleapis.com/environment/health/dependency_permissions_check_count |
This metric tracks the number of times permission checks are performed on services required for the environment's operation. |
Database health
To check the health of your database, you can use the following
health status metric: composer.googleapis.com/environment/database_health
.
The Airflow monitoring pod pings the database every minute and reports health
status as True
if a SQL connection can be established or False
if not.
Database metrics
The following environment metrics are available for the Airflow metadata database used by Cloud Composer environments. You can use these metrics to monitor the performance and resource usage of your environment's database instance.
For example, you might want to increase the size of your environment if your environment approaches resource limits. Or you might want to optimize the size of your database by doing a database cleanup.
Database metric | API | Description |
---|---|---|
Database CPU usage |
composer.googleapis.com/environment/database/cpu/usage_time
|
|
Database CPU cores |
composer.googleapis.com/environment/database/cpu/reserved_cores
|
|
Database CPU utilization |
composer.googleapis.com/environment/database/cpu/utilization
|
|
Database Memory usage |
composer.googleapis.com/environment/database/memory/bytes_used
|
|
Database Memory quota |
composer.googleapis.com/environment/database/memory/quota
|
|
Database Memory utilization |
composer.googleapis.com/environment/database/memory/utilization
|
|
Database Disk usage |
composer.googleapis.com/environment/database/disk/bytes_used
|
|
Database Disk quota |
composer.googleapis.com/environment/database/disk/quota
|
|
Database Disk utilization |
composer.googleapis.com/environment/database/disk/utilization
|
|
Database Connections limit |
composer.googleapis.com/environment/database/network/max_connections
|
|
Database Connections |
composer.googleapis.com/environment/database/network/connections
|
|
Database available for failover |
composer.googleapis.com/environment/database/available_for_failover
|
Is True if the environment's Cloud SQL instance is in the high
availability mode and is ready for failover. |
Database automatic failover requests count |
composer.googleapis.com/environment/database/auto_failover_request_count
|
Total number of auto-failover requests of the environment's Cloud SQL instance. |
Worker metrics
The following environment metric is available for the Airflow workers used by Cloud Composer 2 environments.
This metric is used to automatically scale the number of workers in your environment. The Horizontal Pod Autoscaler sets this metric, then the Airflow Worker Set Controller environment component uses this metric to scale the number of Airflow workers up or down, depending on the value of this metric.
Worker metric | API |
---|---|
Scale Factor Target |
composer.googleapis.com/environment/worker/scale_factor_target |
Scheduler metrics
Name | API | Description |
---|---|---|
Active schedulers |
composer.googleapis.com/environment/active_schedulers
|
Number of active scheduler instances. |
Triggerer metrics
The following triggerer metrics are provided exclusively for Cloud Composer:
Name | API | Description |
---|---|---|
Active triggerers |
composer.googleapis.com/environment/active_triggerers
|
The number of active triggerer instances. |
Additionally, the following Airflow metrics are available via Cloud Composer metrics:
Name | API | Name in Airflow | Description |
---|---|---|---|
Total number of running triggers |
composer.googleapis.com/workload/triggerer/num_running_triggers
|
triggers.running
|
The number of running triggers per triggerer instance. |
Blocking triggers |
composer.googleapis.com/environment/trigger/blocking_count
|
triggers.blocked_main_thread
|
Number of triggers that blocked the main thread (likely because of not being fully asynchronous). |
Failed triggers |
composer.googleapis.com/environment/trigger/failed_count
|
triggers.failed
|
Number of triggers that failed with an error before they could fire an event. |
Succeeded triggers |
composer.googleapis.com/environment/trigger/succeeded_count
|
triggers.succeeded
|
Number of triggers that have fired at least one event. |
Web server metrics
The following environment metrics are available for the Airflow web server used by Cloud Composer environments. You can use these metrics to check the performance and resource usage of your environment's Airflow web server instance.
For example, you might want to increase the web server scale and performance parameters if it constantly approaches resource limits.
Name | API | Description |
---|---|---|
Web server CPU usage |
composer.googleapis.com/environment/web_server/cpu/usage_time
|
|
Web server CPU quota |
composer.googleapis.com/environment/web_server/cpu/reserved_cores
|
|
Web server memory usage |
composer.googleapis.com/environment/web_server/memory/bytes_used
|
|
Web server memory quota |
composer.googleapis.com/environment/web_server/memory/quota
|
|
Active web servers |
composer.googleapis.com/environment/active_webservers
|
Number of active web server instances. |
DAG metrics
To help you monitor the efficiency of your DAG runs and identify tasks that cause high latency, the following DAG metrics are available.
DAG metric | API |
---|---|
Number of DAG runs |
composer.googleapis.com/workflow/run_count |
Duration of each DAG run |
composer.googleapis.com/workflow/run_duration |
Number of task runs |
composer.googleapis.com/workflow/task/run_count |
Duration of each task run |
composer.googleapis.com/workflow/task/run_duration |
Cloud Monitoring shows only the metrics for completed workflow and task runs (success or failure). No Data displays when there is no workflow activity and for in-progress workflow and task runs.
Celery Executor metrics
The following Celery Executor metrics are available. These metrics can help you determine if there are sufficient worker resources in your environment.
Celery Executor metric | API |
---|---|
Number of tasks in the queue |
composer.googleapis.com/environment/task_queue_length |
Number of online Celery workers |
composer.googleapis.com/environment/num_celery_workers |
Airflow metrics
The following Airflow metrics are available. These metrics correspond to metrics provided by Airflow.
Name | API | Name in Airflow | Description |
---|---|---|---|
Celery task non-zero exit codes |
composer.googleapis.com/environment/celery/execute_command_failure_count
|
celery.execute_command.failure
|
Number of non-zero exit code from Celery tasks. |
Celery task publish timeouts |
composer.googleapis.com/environment/celery/task_timeout_error_count
|
celery.task_timeout_error
|
Number of AirflowTaskTimeout errors raised when publishing Task to Celery Broker. |
Serialized DAG fetch duration |
composer.googleapis.com/environment/collect_db_dag_duration
|
collect_db_dags
|
Time taken for fetching all Serialized DAGs from the database. |
DAG refresh errors |
composer.googleapis.com/environment/dag_callback/exception_count
|
dag.callback_exceptions
|
Number of exceptions raised from DAG callbacks. When this happens, it means that a DAG callback is not working. |
DAG refresh errors |
composer.googleapis.com/environment/dag_file/refresh_error_count
|
dag_file_refresh_error
|
Number of failures when loading any DAG files. |
DAG file load time |
composer.googleapis.com/environment/dag_processing/last_duration
|
dag_processing.last_duration.<dag_file>
|
Time taken to load a specific DAG file. |
Time since DAG file processing |
composer.googleapis.com/environment/dag_processing/last_run_elapsed_time
|
dag_processing.last_run.seconds_ago.<dag_file>
|
Seconds since a DAG file was last processed. |
DagFileProcessorManager stall count |
composer.googleapis.com/environment/dag_processing/manager_stall_count
|
dag_processing.manager_stalls
|
Number of stalled DagFileProcessorManager processes. |
DAG parsing errors |
composer.googleapis.com/environment/dag_processing/parse_error_count
|
dag_processing.import_errors
|
Number of errors generated when parsing DAG files. |
Running DAG parsing processes |
composer.googleapis.com/environment/dag_processing/processes
|
dag_processing.processes
|
Number of currently running DAG parsing processes. |
Processor timeouts |
composer.googleapis.com/environment/dag_processing/processor_timeout_count
|
dag_processing.processor_timeouts
|
Number of file processors that were killed because of taking too long. |
Time taken to scan and import all DAG files |
composer.googleapis.com/environment/dag_processing/total_parse_time
|
dag_processing.total_parse_time
|
Total time taken to scan and import all DAG files once. |
Current DAG bag size |
composer.googleapis.com/environment/dagbag_size
|
dagbag_size
|
Number of DAGs found when the scheduler ran a scan based on its configuration. |
Failed SLA miss email notifications |
composer.googleapis.com/environment/email/sla_notification_failure_count
|
sla_email_notification_failure
|
Number of failed SLA miss email notification attempts. |
Open slots on executor |
composer.googleapis.com/environment/executor/open_slots
|
executor.open_slots
|
Number of open slots on the executor. |
Queued tasks on executor |
composer.googleapis.com/environment/executor/queued_tasks
|
executor.queued_tasks
|
Number of queued tasks on the executor. |
Running tasks on executor |
composer.googleapis.com/environment/executor/running_tasks
|
executor.running_tasks
|
Number of running tasks on the executor. |
Task instance successes/failures |
composer.googleapis.com/environment/finished_task_instance_count
|
ti_failures , ti_successes
|
Overall task instance successes/failures. |
Started/finished jobs |
composer.googleapis.com/environment/job/count
|
<job_name>_start , <job_name>_end
|
Number of started/finished jobs, such as SchedulerJob, LocalTaskJob. |
Job heartbeat failures |
composer.googleapis.com/environment/job/heartbeat_failure_count
|
<job_name>_heartbeat_failure
|
Number of failed heartbeats for a job. |
Tasks created per operator |
composer.googleapis.com/environment/operator/created_task_instance_count
|
task_instance_created-<operator_name>
|
Number of tasks instances created for a given operator. |
Operator executions |
composer.googleapis.com/environment/operator/finished_task_instance_count
|
operator_failures_<operator_name> , operator_successes_<operator_name>
|
Number of finished task instances per operator |
Open slots in the pool |
composer.googleapis.com/environment/pool/open_slots
|
pool.open_slots.<pool_name>
|
Number of open slots in the pool. |
Queued slots in the pool |
composer.googleapis.com/environment/pool/queued_slots
|
pool.queued_slots.<pool_name>
|
Number of queued slots in the pool. |
Running slots in the pool |
composer.googleapis.com/environment/pool/running_slots
|
pool.running_slots.<pool_name>
|
Number of running slots in the pool. |
Starving tasks in the pool |
composer.googleapis.com/environment/pool/starving_tasks
|
pool.starving_tasks.<pool_name>
|
Number of starving tasks in the pool. |
Time spent in scheduler's critical section |
composer.googleapis.com/environment/scheduler/critical_section_duration
|
scheduler.critical_section_duration
|
Time spent in the critical section of scheduler loop. Only a single scheduler can enter this loop at a time. |
Critical section lock failures |
composer.googleapis.com/environment/scheduler/critical_section_lock_failure_count
|
scheduler.critical_section_busy
|
Count of times a scheduler process tried to get a lock on the critical section (needed to send tasks to the executor) and found it locked by another process. |
Externally killed tasks |
composer.googleapis.com/environment/scheduler/task/externally_killed_count
|
scheduler.tasks.killed_externally
|
Number of tasks killed externally. |
Orphaned tasks |
composer.googleapis.com/environment/scheduler/task/orphan_count
|
scheduler.orphaned_tasks.cleared , scheduler.orphaned_tasks.adopted
|
Number of orphaned tasks cleared/adopted by the scheduler. |
Running/starving/executable tasks |
composer.googleapis.com/environment/scheduler/tasks
|
scheduler.tasks.running , scheduler.tasks.starving , scheduler.tasks.executable
|
Number of running/starving/executable tasks. |
Scheduler heartbeats |
composer.googleapis.com/environment/scheduler_heartbeat_count
|
scheduler_heartbeat
|
Scheduler heartbeats. |
Failed SLA callback notifications |
composer.googleapis.com/environment/sla_callback_notification_failure_count
|
sla_callback_notification_failure
|
Number of failed SLA miss callback notification attempts. |
Smart sensor poking exception failures |
composer.googleapis.com/environment/smart_sensor/exception_failures
|
smart_sensor_operator.exception_failures
|
Number of failures caused by exception in the previous smart sensor poking loop. |
Smart sensor poking infrastructure failures |
composer.googleapis.com/environment/smart_sensor/infra_failures
|
smart_sensor_operator.infra_failures
|
Number of infrastructure failures in the previous smart sensor poking loop. |
Smart sensor poking exceptions |
composer.googleapis.com/environment/smart_sensor/poked_exception
|
smart_sensor_operator.poked_exception
|
Number of exceptions in the previous smart sensor poking loop. |
Smart sensor successfully poked tasks |
composer.googleapis.com/environment/smart_sensor/poked_success
|
smart_sensor_operator.poked_success
|
Number of newly succeeded tasks poked by the smart sensor in the previous poking loop. |
Smart sensor poked tasks |
composer.googleapis.com/environment/smart_sensor/poked_tasks
|
smart_sensor_operator.poked_tasks
|
Number of tasks poked by the smart sensor in the previous poking loop. |
Previously succeeded task instances |
composer.googleapis.com/environment/task_instance/previously_succeeded_count
|
previously_succeeded
|
Number of previously succeeded task instances. |
Killed zombie tasks |
composer.googleapis.com/environment/zombie_task_killed_count
|
zombies_killed
|
Number of killed zombie tasks. |
DAG run duration |
composer.googleapis.com/workflow/dag/run_duration
|
dagrun.duration.success.<dag_id> , dagrun.duration.failed.<dag_id>
|
Time taken for a DagRun to reach success/failed state. |
DAG dependency check duration |
composer.googleapis.com/workflow/dependency_check_duration
|
dagrun.dependency-check.<dag_id>
|
Time taken to check DAG dependencies. This metric is different from the environment's dependency and permission checks metrics and applies to DAGs |
DAG run schedule delay |
composer.googleapis.com/workflow/schedule_delay
|
dagrun.schedule_delay.<dag_id>
|
Time of delay between the scheduled DagRun start date and the actual DagRun start date. |
Finished tasks |
composer.googleapis.com/workflow/task_instance/finished_count
|
ti.finish.<dag_id>.<task_id>.<state>
|
Number of completed tasks in a given DAG. |
Task instance run duration |
composer.googleapis.com/workflow/task_instance/run_duration
|
dag.<dag_id>.<task_id>.duration
|
Time taken to finish a task. |
Started tasks |
composer.googleapis.com/workflow/task_instance/started_count
|
ti.start.<dag_id>.<task_id>
|
Number of started tasks in a given DAG. |
Task instance queued duration |
composer.googleapis.com/workflow/task_instance/queued_duration
|
dag.<dag_id>.<task_id>.queued_duration
|
Time that a task spends in the Queued state before it switches into the Running state. |
Tasks removed from DAG |
composer.googleapis.com/workflow/task/removed_from_dag_count
|
task_removed_from_dag.<dag_id>
|
Number of tasks removed for a given DAG (that is, task no longer exists in DAG). |
Tasks restored to DAG |
composer.googleapis.com/workflow/task/restored_to_dag_count
|
task_restored_to_dag.<dag_id>
|
Number of tasks restored for a given DAG (that is, task instance which was previously in REMOVED state in the DB is added to DAG file). |
Task schedule delay |
composer.googleapis.com/workflow/task/schedule_delay
|
dagrun.schedule_delay.<dag_id>
|
Time elapsed between first task start_date and dagrun expected start. |
Using Monitoring for Cloud Composer environments
Console
You can use Metrics Explorer to display metrics related to your environments and DAGs:
Cloud Composer Environment resource contains metrics for environments.
To show metrics for a specific environment, filter metrics by the
environment_name
label. You can also filter by other labels, such as environment's location or image version.Cloud Composer Workflow resource contains metrics for DAGs.
To show metrics for a specific DAG or task, filter metrics by the
workflow_name
andtask_name
labels. You can also filter by other labels, such as task status or Airflow operator name.
API and gcloud
You can create and manage custom dashboards and the widgets through the
Cloud Monitoring API and gcloud monitoring dashboards
command. For more information, see
Manage dashboards by API.
For more information about resources, metrics, and filters, see the reference for Cloud Monitoring API:
Using Cloud Monitoring alerts
You can create alerting policies to monitor the values of metrics and to notify you when those metrics violate a condition.
-
In the Google Cloud console, go to the notifications Alerting page:
If you use the search bar to find this page, then select the result whose subheading is Monitoring.
- If you haven't created your notification channels and if you want to be notified, then click Edit Notification Channels and add your notification channels. Return to the Alerting page after you add your channels.
- From the Alerting page, select Create policy.
- To select the metric, expand the Select a metric menu and then do the following:
- To limit the menu to relevant entries, enter
Cloud Composer
into the filter bar. If there are no results after you filter the menu, then disable the Show only active resources & metrics toggle. - For the Resource type, select Cloud Composer Environment or Cloud Composer Workflow.
- Select a Metric category and a Metric, and then select Apply.
- To limit the menu to relevant entries, enter
- Click Next.
- The settings in the Configure alert trigger page determine when the alert is triggered. Select a condition type and, if necessary, specify a threshold. For more information, see Create metric-threshold alerting policies.
- Click Next.
- Optional: To add notifications to your alerting policy, click Notification channels. In the dialog, select one or more notification channels from the menu, and then click OK.
- Optional: Update the Incident autoclose duration. This field determines when Monitoring closes incidents in the absence of metric data.
- Optional: Click Documentation, and then add any information that you want included in a notification message.
- Click Alert name and enter a name for the alerting policy.
- Click Create Policy.