Scalable alerting for Apache Airflow to improve data orchestration reliability and performance
Christian Yarros
Strategic Cloud Engineer, Google
About
Apache Airflow is a popular tool for orchestrating data workflows. Google Cloud offers a managed Airflow service called Cloud Composer, a fully managed workflow orchestration service built on Apache Airflow that enables you to author, schedule, and monitor pipelines. And when running Cloud Composer, it’s important to have a robust logging and alerting setup to monitor your DAGs (Directed Acyclic Graphs) and minimize downtime in your data pipelines.
In this guide, we will review the hierarchy of alerting on Cloud Composer and the various alerting options available to Google Cloud engineers using Cloud Composer and Apache Airflow.
Getting started
Hierarchy of alerting on Cloud Composer
Composer environment
Cloud Composer environments are self-contained Airflow deployments based on Google Kubernetes Engine. They work with other Google Cloud services using connectors built into Airflow.
Cloud Composer provisions Google Cloud services that run your workflows and all Airflow components. The main components of an environment are GKE cluster, Airflow web server, Airflow database, and Cloud Storage bucket. For more information, check out Cloud Composer environment architecture.
Alerts at this level primarily consist of cluster and Airflow component performance and health.
Airflow DAG Runs
A DAG Run is an object representing an instantiation of the DAG at a point in time. Any time the DAG is executed, a DAG Run is created and all tasks inside it are executed. The status of the DAG Run depends on the task’s state. Each DAG Run is run separately from one another, meaning that you can have many runs of a DAG at the same time.
Alerts at this level primarily consist of DAG Run state changes such as Success and Failure, as well as SLA Misses. Airflow’s Callback functionality can trigger code to send these alerts.
Airflow Task instances
A Task is the basic unit of execution in Airflow. Tasks are arranged into DAGs, and then have upstream and downstream dependencies set between them in order to express the order they should run in. Airflow tasks include Operators and Sensors.
Like Airflow DAG Runs, Airflow Tasks can utilize Airflow Callbacks to trigger code to send alerts.
Summary
To summarize Airflow’s alerting hierarchy: Google Cloud → Cloud Composer Service → Cloud Composer Environment → Airflow Components (Worker) → Airflow DAG Run → Airflow Task Instance.
Any production-level implementation of Cloud Composer should have alerting and monitoring capabilities at each level in the hierarchy. Our Cloud Composer engineering team has extensive documentation around monitoring and alerting at the service/environment level.
Airflow Alerting on Google Cloud
Now, let’s consider three options for alerting at the Airflow DAG Run and Airflow Task level.
Option 1: Log-based alerting policies
Google Cloud offers native tools for logging and alerting within your Airflow environment. Cloud Logging centralizes logs from various sources, including Airflow, while Cloud Monitoring lets you set up alerting policies based on specific log entries or metrics thresholds.
You can configure an alerting policy to notify you whenever a specific message appears in your included logs. For example, if you want to know when an audit log records a particular data-access message, you can get notified when the message appears. These types of alerting policies are called log-based alerting policies. Check out Configure log-based alerting policies | Cloud Logging to learn more.
These services combine nicely with Airflow’s Callback feature previously mentioned above. To accomplish this:
-
Define a Callback function and set at the DAG or Task level.
-
Use Python’s native logging library to write a specific log message to Cloud Logging.
-
Define a log-based alerting policy triggered by the specific log message and sends alerts to a notification channel.
Pros and cons
Pros:
-
Lightweight, minimal setup: no third party tools, no email server set up, no additional Airflow providers required
-
Integration with Logs Explorer and Log-based metrics for deeper insights and historical analysis
-
Multiple notification channel options
Cons:
-
Email alerts contain minimal info
-
Learning curve and overhead for setting up log sinks and alerting policies
-
Costs associated with Cloud Logging and Cloud Monitoring usage
Sample code
Airflow DAG Callback:
This Airflow DAG uses a Python operator to miss a defined SLA and/or raise an Airflow Exception. If the DAG Run enters a failed state it triggers the log_on_dag_failure
callback function and if it misses an SLA it triggers the log_on_sla_miss
callback function. Both of these callbacks log a specific message string "Airflow DAG Failure:" and "Airflow SLA Miss:" respectively. These are the messages that the log-based alerting catches and uses to send an alert to the defined notification channel.
Airflow Task callback:
In this example, the task instance itself calls back to log_on_task_failure. Since you can set specific callback functions at the task-level, you have great flexibility on when and how you send alerts based on a given task.
Option 2: Email alerts via SendGrid
SendGrid is an SMTP service provider and Cloud Composer’s email notification service of choice. For more information, check out how to Configure email notifications on Cloud Composer.
Pros and cons
Pros:
-
Widely supported and reliable notification method
-
Detailed emails with formatted log snippets for analysis
-
Uses native Airflow EmailOperator
-
Flexible recipient lists on a per-task basis
Cons:
-
Can be overwhelming with a high volume of alerts
-
Requires configuring an external email provider (SendGrid) and managing email templates
-
Might get lost in inboxes if not prioritized or filtered correctly
-
Costs associated with SendGrid
Sample code
EmailOperator
Option 3: Third-party tools such as Slack, Pagerduty
Since Airflow is open-source, there are other providers to choose from that can handle alerting and notifications for you, such as Slack or Pagerduty.
Pros and cons
Pros:
-
Real-time notifications in a familiar communication channel
-
Customizable formatting and the ability to send messages to specific channels or users
-
Third-party options integrate with your team's existing communication workflow. Alerts can be discussed directly, keeping the context and resolution steps together. This promotes faster troubleshooting and knowledge sharing compared to isolated emails or logging entries.
Cons:
-
Requires a third-party workspace , webhook, and API token setup
-
Requires management of additional Airflow connections
-
Might lead to notification fatigue if not used judiciously
-
Potential security concerns if the webhook or API token is compromised
-
Potentially limited long-term log storage within third-party message history
-
Costs associated with third-party tools
Sample code
Slack:
Pagerduty:
Opinionated guidance and next steps
Considering the pros and cons, we recommend log-based alerting policies (Option 1) for Airflow alerting in production environments. This approach offers scalable log collection, simple threshold-based alerting, diverse notification channels, metric exploration, and integration with other Google Cloud services. Logging is intuitive and integrated with Cloud Composer, eliminating the need for extra provider packages.
By incorporating logging and alerting into your Airflow DAGs, you proactively monitor your data pipelines and leverage the full potential of Google Cloud.
To learn more about Cloud Composer, Apache Airflow, and the alerting mechanisms discussed in this guide, consider exploring the following resources:
Also check out some of our other Cloud Composer-related Google Cloud blogs: