Data Analytics

Optimize Cloud Composer via Better Airflow DAGs

January 20, 2023

Christian Yarros

Strategic Cloud Engineer

Try Google Cloud

Start building on Google Cloud with $300 in free credits and 20+ always free products.

Hosting, orchestrating, and managing data pipelines is a complex process for any business. Google Cloud offers Cloud Composer - a fully managed workflow orchestration service - enabling businesses to create, schedule, monitor, and manage workflows that span across clouds and on-premises data centers. Cloud Composer is built on the popular Apache Airflow open source project and operates using the Python programming language. Apache Airflow allows users to create directed acyclic graphs (DAGs) of tasks, which can be scheduled to run at specific intervals or triggered by external events.

This guide contains a generalized checklist of activities when authoring Apache Airflow DAGs. These items follow best practices determined by Google Cloud and the open source community. A collection of performant DAGs will enable Cloud Composer to work optimally and standardized authoring will help developers manage hundreds or even thousands of DAGs. Each item will benefit your Cloud Composer environment and your development process.

Get Started

1. Standardize file names. Help other developers browse your collection of DAG files.
a. ex) team_project_workflow_version.py

2. DAGs should be deterministic.
a. A given input will always produce the same output.

3. DAGs should be idempotent.
a. Triggering the DAG multiple times has the same effect/outcome.

4. Tasks should be atomic and idempotent.
a. Each task should be responsible for one operation that can be re-run independently of the others. In an atomized task, a success in part of the task means a success of the entire task.

5. Simplify DAGs as much as possible.
a. Simpler DAGs with fewer dependencies between tasks tend to have better scheduling performance because they have less overhead. A linear structure (e.g. A -> B -> C) is generally more efficient than a deeply nested tree structure with many dependencies.

6. Implement Python docstring conventions to the top of each file and for each function.
a. Python docstring conventions help other developers and platform engineers understand your Airflow DAGs.
b. Document BashOperators the same way you would document a function. For developers unfamiliar with the bash scripts referenced in a DAG, it’s difficult to troubleshoot without documentation on each bash script’s intent.

Standardize DAG Creation

7. Add an owner to your default_args.
a. Determine whether you’d prefer the email address / id of a developer, or a distribution list / team name.

8. Use with DAG() as dag: instead of dag = DAG()a. Prevent the need to pass the dag object to every operator or task group.

9. Set a version in the DAG ID.
a. Update the version after any code change in the DAG.
b. This prevents deleted Task logs from vanishing from the UI, no-status tasks generated for old dag runs, and general confusion of when DAGs have changed.
c. Airflow open-source has plans to implement versioning in the future.

10. Add tags to your DAGs.
a. Help developers navigate the Airflow UI via tag filtering.
b. Group DAGs by organization, team, project, application, etc.

11. Add a DAG description.
a. Help other developers understand your DAG.

12. Pause your DAGs on creation.
a. This will help avoid accidental DAG runs that add load to the Cloud Composer environment.

13. Set catchup=False to avoid automatic catch ups overloading your Cloud Composer Environment.

14. Set a dagrun_timeout to avoid dags not finishing, and holding Cloud Composer Environment resources or introducing collisions on retries.

15. Ensure all tasks have the same start_date by default by passing arg to DAG during instantiation.

16. Use a static start_date with your DAGs.
a. A dynamic start_date is misleading, and can cause failures when clearing out failed task instances and missing DAG runs.

17. Set retries as a default_arg applied at the DAG level and get more granular for specific tasks only where necessary.

a. A good range is 1–4 retries. Too many retries will add unnecessary load to the Cloud Composer environment.

import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator

# Define default_args dictionary to specify default parameters of the DAG, such as the start date, frequency, and other settings
default_args = {
    'owner': 'me',
    'retries': 2, # 2-4 retries max
    'retry_delay': timedelta(minutes=5)
}

# Use the `with` statement to define the DAG object and specify the unique DAG ID and default_args dictionary
with DAG(
    'dag_id_v1_0_0', #versioned ID
    default_args=default_args,
    description='This is a detailed description of the DAG', #detailed description
    start_date=datetime(2022, 1, 1),  # Static start date
    dagrun_timeout=timedelta(minutes=10), #timeout specific to this dag

is_paused_upon_creation= True,
    catchup= False,
    tags=['example', 'versioned_dag_id'], # tags specific to this dag
    schedule_interval=None,
) as dag:
    # Define a task using the BashOperator
    task = BashOperator(
        task_id='bash_task',
        bash_command='echo "Hello World"'
    )

18. Define what should occur for each callback function. (send an email, log a context, message slack channel, etc.). Depending on the DAG you may be comfortable doing nothing.
a. success
b. failure
c. sla_miss
d. retry

Example:

from airflow import DAG
from airflow.operators.python_operator import PythonOperator

default_args = {
    'owner': 'me',
    'retries': 2, # 2-4 retries max
    'retry_delay': timedelta(minutes=5)
}

def on_success_callback(context):
    # when a task in the DAG succeeds
    print(f"Task {context['task_instance_key_str']} succeeded!")

def sla_miss_callback(context):
    # when a task in the DAG misses its SLA
    print(f"Task {context['task_instance_key_str']} missed its SLA!")

def on_retry_callback(context):
    # when a task in the DAG retries
    print(f"Task {context['task_instance_key_str']} retrying...")

def on_failure_callback(context):
    # when a task in the DAG fails
    print(f"Task {context['task_instance_key_str']} failed!")

# Create a DAG and set the callbacks
with DAG(
    'dag_id_v1_0_0',
    default_args=default_args,
    description='This is a detailed description of the DAG',
    start_date=datetime(2022, 1, 1), 
    dagrun_timeout=timedelta(minutes=10),
   
    tags=['example', 'versioned_dag_id'],
    is_paused_upon_creation= True,
    catchup= False,
    schedule_interval=None,
    on_success_callback=on_success_callback, # what to do on success
    sla_miss_callback=on_sla_miss_callback, # what to do on sla miss
    on_retry_callback=on_retry_callback, # what to do on retry
    on_failure_callback=on_failure_callback # what to do on failure
) as dag:

def example_task(**kwargs):
        # This is an example task that will be part of the DAG
        print(f"Running example task with context: {kwargs}")

# Create a task and add it to the DAG
    task = PythonOperator(
        task_id="example_task",
        python_callable=example_task,
        provide_context=True,
    )

19. Use Task Groups to organize Tasks.

Example:

20. Operator variable name and task_id argument should match.

21. Add SLAs to your tasks.
a. Get notifications when tasks run longer than expected by setting SLAs at the task level.

Reduce the Load on Your Composer Environment

22. Use Jinja Templating / Macros instead of python functions.
a. Airflow's template fields allow you to incorporate values from environment variables and jinja templates into your DAGs. This helps make your DAGs idempotent (meaning multiple invocations do not change the result) and prevents unnecessary function execution during Scheduler heartbeats.
b. The Airflow engine passes a few variables by default that are accessible in all templates.

Contrary to best practices, the following example defines variables based on datetime Python functions:

If this code is in a DAG file, these functions execute on every Scheduler heartbeat, which may not be performant. Even more importantly, this doesn't produce an idempotent DAG. You can't rerun a previously failed DAG run for a past date because datetime.today() is relative to the current date, not the DAG execution date.

A better way of implementing this is by using an Airflow Variable as such:

23. Avoid creating your own additional Airflow Variables.
a. The metadata database stores these variables and requires database connections to retrieve them. This can affect the performance of the Cloud Composer Environment. Use Environment Variables or Google Cloud Secrets instead.

24. Avoid running all DAGs on the exact same schedules (disperse workload as much as possible).
a. Prefer to use cron expressions for schedule intervals compared to airflow macros or time_deltas. This allows a more rigid schedule and it’s easier to spread out workloads throughout the day, making it easier on your Cloud Composer environment.
b. Crontab.guru can help with generating specific cron expression schedules. Check out the examples here.

Examples:

25. Avoid XComs except for small amounts of data.
a. These add storage and introduce more connections to the database.
b. Use JSON dicts as values if absolutely necessary. (one connection for many values inside dict)

26. Avoid adding unnecessary objects in the dags/ Google Cloud Storage path.
a. If you must, add an .airflowignore file to GCS paths that the Airflow Scheduler does not need to parse. (sql, plug-ins, etc.)

27. Set execution timeouts for tasks.

Example:

28. Use Deferrable Operators over Sensors when possible.
a. A deferrable operator can suspend itself and free up the worker when it knows it has to wait, and hand off the job of resuming it to a Trigger. As a result, while it suspends (defers), it is not taking up a worker slot and your cluster will have fewer/lesser resources wasted on idle Operators or Sensors.

Example:

29. When using Sensors, always define mode, poke_interval, and timeout.
a. Sensors require Airflow workers to run.
b. Sensor checking every n seconds (i.e. poke_interval < 60)? Use mode=poke. A sensor in mode=poke will continuously poll every n seconds and hold Airflow worker resources.
c. Sensor checking every n minutes (i.e. poke_interval >= 60)? Use mode=reschedule. A sensor in mode=reschedule will free up Airflow worker resources between poke intervals.

Example:

30. Offload processing to external services (BigQuery, Dataproc, Cloud Functions, etc.) to minimize load on the Cloud Composer environment.
a. These services usually have their own Airflow Operators for you to utilize.

31. Do not use sub-DAGs.
a. Sub-DAGs were a feature in older versions of Airflow that allowed users to create reusable groups of tasks within DAGs. However, Airflow 2.0 deprecated sub-DAGs because they caused performance and functional issues.

32. Use Pub/Sub for DAG-to-DAG dependencies.
a. Here is an example for multi-cluster / dag-to-dag dependencies.

33. Make DAGs load faster.
a. Avoid unnecessary “Top-level” Python code. DAGs with many imports, variables, functions outside of the DAG will introduce greater parse times for the Airflow Scheduler and in turn reduce the performance and scalability of Cloud Composer / Airflow.
b. Moving imports and functions within the DAG can reduce parse time (in the order of seconds).
c. Ensure that developed DAGs do not increase DAG parse times too much.

Example:

Improve Development and Testing

34. Implement “self-checks” (via Sensors or Deferrable Operators).
a. To ensure that tasks are functioning as expected, you can add checks to your DAG. For example, if a task pushes data to a BigQuery partition, you can add a check in the next task to verify that the partition generates and that the data is correct.

Example:

# ------------------------------------------------------------
   # Transform source data and transfer to partitioned table
   # ------------------------------------------------------------

create_or_replace_partitioned_table_job = BigQueryInsertJobOperator(
       task_id="create_or_replace_comments_partitioned_query_job",
       configuration={
           "query": {
               "query": 'sql/create_or_replace_comments_partitioned.sql',
               "useLegacySql": False,
           }
       },
       location="US",
   )

create_or_replace_partitioned_table_job_error = dummy_operator.DummyOperator(
       task_id="create_or_replace_partitioned_table_job_error",
       trigger_rule="one_failed",
   )

create_or_replace_partitioned_table_job_ok = dummy_operator.DummyOperator(
       task_id="create_or_replace_partitioned_table_job_ok",      trigger_rule="one_success"
   )

# ------------------------------------------------------------
   # Determine if today's partition exists in comments_partitioned
   # ------------------------------------------------------------

table_partition_sensor = BigQueryTablePartitionExistenceSensor(
       project_id="{{ project_id }}",
       task_id="bq_check_table_partition",
       dataset_id="{{ dataset }}",
       table_id="comments_partitioned",
       partition_id="{{ ds_nodash }}",
       mode="reschedule"
       poke_interval=60,
       timeout=60 * 5
   )

create_or_replace_partitioned_table_job >> [
       create_or_replace_partitioned_table_job_error,
       create_or_replace_partitioned_table_job_ok,
   ]
   create_or_replace_partitioned_table_job_ok >> table_partition_sensor

35. Look for opportunities to dynamically generate similar tasks/task groups/DAGs via Python code.
a. This can simplify and standardize the development process for DAGs.

Example:

36. Implement unit-testing for your DAGs

Example:

37. Perform local development via the Composer Local Development CLI Tool.
a. Composer Local Development CLI tool streamlines Apache Airflow DAG development for Cloud Composer 2 by running an Airflow environment locally. This local Airflow environment uses an image of a specific Cloud Composer version.

38. If possible, keep a staging Cloud Composer Environment to fully test the complete DAG run before deploying in the production.
a. Parameterize your DAG to change the variables, e.g., the output path of Google Cloud Storage operation or the database used to read the configuration. Do not hard code values inside the DAG and then change them manually according to the environment.

39. Use a Python linting tool such as Pylint or Flake8 for standardized code.

40. Use a Python formatting tool such as Black or YAPF for standardized code.

Next Steps

In summary, this blog provides a comprehensive checklist of best practices for developing Airflow DAGs for use in Google Cloud Composer. By following these best practices, developers can help ensure that Cloud Composer is working optimally and that their DAGs are well-organized and easy to manage.

For more information about Cloud Composer, check out the following related blog posts and documentation pages:

Posted in

Data Analytics

How to reduce costs with Managed Service for Apache Kafka: CUDs, compression and more

By Qiqi Wu • 5-minute read

Data Analytics

How to use gen AI for better data schema handling, data quality, and data generation

By Deb Lee • 9-minute read

Data Analytics

BigQuery ML is now compatible with open-source gen AI models

By Vaibhav Sethi • 3-minute read

Data Analytics

Introducing BigQuery metastore, a unified metadata service with Apache Iceberg support

By Yuri Volobuev • 4-minute read

Optimize Cloud Composer via Better Airflow DAGs

Christian Yarros

Try Google Cloud

Get Started

Standardize DAG Creation

Reduce the Load on Your Composer Environment

Improve Development and Testing

Next Steps

Related articles

How to reduce costs with Managed Service for Apache Kafka: CUDs, compression and more

How to use gen AI for better data schema handling, data quality, and data generation

BigQuery ML is now compatible with open-source gen AI models

Introducing BigQuery metastore, a unified metadata service with Apache Iceberg support