This page describes how to access and use the monitoring dashboard for a Cloud Composer environment. This dashboard contains metrics and charts for monitoring trends in the DAG runs in your environment, and identifing issues with Airflow components and Cloud Composer resources.
Accessing the monitoring dashboard
Open the Environments page in the Cloud Console.
Find the name of the environment you want to monitor in the list. Click the environment name to open the monitoring tab of the Environment details page.
Selecting a time range
You can select a time range for the data in the dashboard using the list of ranges in the top-right area of the page.
You can also zoom in on a specific time range by clicking and dragging on any chart. The new time range will be applied to all metrics. Reset the zoom by clicking the RESET ZOOM button to the left of the time ranges.
Setting up alerts
You can set up alerts for a metric by clicking the bell icon in the corner of the monitoring card.
Viewing a metric in Monitoring
You can get a closer look at a metric by viewing it in Monitoring. To navigate there from the Cloud Composer monitoring dashboard, click the three dots in the upper-right corner of a metric card and select View in Metrics explorer.
Each Cloud Composer environment has its own monitoring dashboard. The metrics below only track the DAG runs, Airflow components, and environment details for the currently selected environment.
|CPU usage per node||A chart showing the usage of CPU cores aggregated over all running Pods in the node, measured as a core time usage ratio. This does not include CPU usage of the App Engine instance used for the Airflow UI or Cloud SQL instance. High CPU usage is often the root cause of Worker Pod evictions. If you see very high usage, consider scaling out your Composer environment or changing the schedule of your DAG runs.|
|Memory usage per node||Memory usage per node in the GKE cluster. The does not include the memory usage of the App Engine instance used for the Airflow UI or Cloud SQL instance. High memory usage is often the root cause of Worker Pod evictions, which may lead to DAG failures.|
|Environment health||A timeline showing the health of the Composer deployment. Green status doesn't mean that all Airflow components were operational and DAGs were able to be run--it only reflects the status of the Composer deployment.|
|Database health||A timeline showing the status of the connection to Composer Cloud SQL instance.|
|Web server health||A timeline showing the status of the Airflow UI web server. This is generated based on HTTP status codes returned by the UI server.|
|Scheduler heartbeat||A timeline showing when the Airflow scheduler was providing a healthy heartbeat (ie. when it was responding). Check for red areas to identify Airflow scheduler issues.|
|Active workers||A chart showing the number of active workers over the selected time range. By default, this should be equal to the number of nodes in the Airflow cluster, but may grow if the environment is scaled out. If the number of active workers drops, this may indicate worker process failures (see the Worker Pod evictions chart).|
|Worker Pod evictions1||A chart showing the number of GKE Worker Pod evictions over time. Pod evictions are often caused by GKE resource exhaustion. See the CPU/Memory usage per node chart for more details.|
|Zombie tasks killed1||A chart showing the number of zombie tasks killed in a small time window. Zombie tasks are often caused by the external termination of Airflow processes. The Airflow scheduler kills zombie tasks periodically, which should be reflected in this chart.|
|DAG run metric||Description|
|Successful DAG runs||The total number of successful runs for all DAGs in the environment during the selected time range. If this drops below expected levels, this could indicate failures (see Failed DAG runs) or a scheduling issue.|
|Failed DAG runs||The total number of failed runs for all DAGs in the environment during the selected time range.|
|Failed tasks1||The total number of tasks that failed in the environment during the selected time range. Failed tasks don't always cause a DAG run to fail, but they can be a useful signal for troubleshooting DAG errors.|
|Completed DAG runs||A bar chart showing the number of DAG successes and failures for intervals in the selected time range. This can help identify transient issues with DAG runs and correlate them with other events, like Worker Pod evictions.|
|Median DAG run duration||A chart showing the median duration of DAG runs that completed during a small time window. This chart can help identify performance problems and spot trends in DAG duration.|
|Completed tasks1||A chart showing the number of tasks completed in the environment in a small time window with a breakdown of successful and failed tasks.|
|Running1 and queued tasks||A chart showing the number of tasks running and queued and at a given time. Consult the number of queued tasks to identify performance bottlenecks or excessive loads; the queue grows longer when tasks can't be executed immediately. Consult the number of running tasks to spot scheduling issues; for example, if the number of running tasks drops significantly, this may suggest a scheduling issue.|
|DAG bag size1||A chart showing the number of DAGs deployed to a Cloud Storage bucket and processed by Airflow at a given time. This can be helpful when analyzing performance bottlenecks. For example, an increased number of DAG deployments may degrade performance due to excessive load.|
|DAG file import errors1||A chart showing the number of DAG parsing errors in a small time window. This can help identify when corrupted DAGs were processed by Airflow, pointing out issues in DAG source code.|
|Total parse time for all DAGS1||A chart showing the total time required for Airflow to process all DAGs in the environment. Increased parsing time can affect scheduling efficiency.|
1 Available for environments using Composer version 1.10.0 or higher and Airflow version 1.10.6 or higher.