Troubleshooting DAGs (workflows)

This page provides troubleshooting steps and information for common workflow issues.

Troubleshooting workflow

To begin troubleshooting:

  1. Check the Airflow logs.
  2. Review the Google Cloud's operations suite.
  3. In the Cloud Console, check for errors on the pages for the Google Cloud components running your environment.
  4. In the Airflow web interface, check in the DAG's Graph View for failed task instances.

    Tip: To navigate through a large DAG to look for failed task instances, change the graph view orientation from LR to RL by overriding the web server's default dag_orientation configuration:

    Section Key Value
    webserver dag_orientation LR, TB, RL, or BT

Debugging operator failures

To debug an operator failure:

  1. Check for task-specific errors.
  2. Check the Airflow logs.
  3. Review the Google Cloud's operations suite.
  4. Check the operator-specific logs.
  5. Fix the errors.
  6. Upload the DAG to the dags/ folder.
  7. In the Airflow web interface, clear the past states for the DAG.
  8. Resume or run the DAG.

Common issues

The following sections describe symptoms and potential fixes for some common DAG issues.

Task fails without emitting logs

Logs are buffered. If a worker dies before the buffer flushes, logs are not emitted. Task failure without logs is an indication that the Airflow workers are restarted due to out-of-memory (OOM).

DAG execution is RAM limited. Each task execution starts with two Airflow processes: task execution and monitoring. Each node can take up to 6 concurrent tasks (approximately 12 processes loaded with Airflow modules). More memory can be consumed, depending on the nature of the DAG.

Symptom

  1. In the Cloud Console, go to the Kubernetes Engine -> Workloads panel.

    Open the Workloads panel

  2. If there are airflow-worker pods that show Evicted, click each evicted pod and look for the The node was low on resource: memory message at the top of the window.

Fix

DAG load import timeout

Symptom:

  • Airflow web UI: At the top of the DAGs list page, a red alert box shows Broken DAG: [/path/to/dagfile] Timeout.
  • Google Cloud's operations suite: The airflow-scheduler logs contain entries similar to:
    • “ERROR - Process timed out”
    • “ERROR - Failed to import: /path/to/dagfile”
    • “AirflowTaskTimeout: Timeout”

Fix:

Override the dagbag_import_timeout Airflow configuration option and allow more time for DAG parsing:

Section Key Value
core dagbag_import_timeout New timeout value

Increased network traffic to and from the Airflow database

The amount of traffic network between your environment's GKE cluster and the Airflow database depends on the number of DAGs, number of tasks in DAGs, and the way DAGs access data in the Airflow database. The following factors might influence the network usage:

  • Queries to the Airflow database. If your DAGs do a lot of queries, they generate large amounts of traffic. Examples: checking the status of tasks before proceeding with other tasks, querying the XCom table, dumping Airflow database content.

  • Large number of tasks. The more tasks are there to schedule, the more network traffic is generated. This consideration applies both to the total number of tasks in your DAGs and to the scheduling frequency. When the Airflow scheduler schedules DAG runs, it makes queries to the Airflow database and generates traffic.

  • Airflow web interface generates network traffic because it makes queries to the Airflow database. Intensively using pages with graphs, tasks, and diagrams can generate large volumes of network traffic.

DAG crashes the Airflow web server or causes it to return a 502 gateway timeout error

Web server failures can occur for several different reasons. Check the airflow-webserver logs in Cloud Logging to determine the cause of the 502 gateway timeout error.

Heavyweight computation

Avoid running heavyweight computation at DAG parse time. Unlike the worker and scheduler nodes, whose machine types can be customized to have greater CPU and memory capacity, the web server uses a fixed machine type, which can lead to DAG parsing failures if the parse-time computation is too heavyweight.

Note that the web server has 2 vCPUs and 2 GB of memory. The default value for core-dagbag_import_timeout is 30 seconds. This timeout value defines the upper limit for how long Airflow spends loading a Python module in the dags/ folder.

Incorrect permissions

The web server does not run under the same service account as the workers and scheduler. As such, the workers and scheduler might be able to access user-managed resources that the web server cannot access.

We recommend that you avoid accessing non-public resources during DAG parsing. Sometimes, this is unavoidable, and you will need to grant permissions to the web server's service account. The service account name is derived from your web server domain. For example, if the domain is foo-tp.appspot.com, the service account is foo-tp@appspot.gserviceaccount.com.

DAG errors

The web server runs on App Engine and is separate from your environment's GKE cluster. The web server parses the DAG definition files, and a 502 gateway timeout can occur if there are errors in the DAG. Airflow works normally without a functional web server—if the problematic DAG is not breaking any processes running in GKE. In this case, you can use gcloud composer environments run to retrieve details from your environment and as a workaround if the web server becomes unavailable.

In other cases, you can run DAG parsing in GKE and look for DAGs that throw fatal Python exceptions or that time out (default 30 seconds). To troubleshoot, connect to a remote shell in an Airflow worker container and test for syntax errors. For more information, see Testing DAGs.

Lost connection to MySQL server during query exception is thrown during the task execution or right after it

Lost connection to MySQL server during query exceptions often happen when the following conditions are met:

  • Your DAG uses PythonOperator or a custom operator.
  • Your DAG makes queries to the Airflow database.

If several queries are made from a callable function, tracebacks might incorrectly point to self.refresh_from_db(lock_for_update=True) line in the Airflow code; it is the first database query after the task execution. The actual cause of the exception happens before this, when an SQLAlchemy session is not properly closed.

SQLAlchemy sessions are scoped to a thread and created in a callable function session can be later continued inside the Airflow code. If there are significant delays between queries within one session, the connection might be already closed by the MySQL or PostgreSQL server. The connection timeout in Cloud Composer environments is set to approximately 10 minutes.

Fix:

  • Use the airflow.utils.db.provide_session decorator. This decorator provides a valid session to the Airflow database in the session parameter and correctly closes the session at the end of the function.
  • Do not use a single long-running function. Instead, move all database queries to separate functions, so that there are multiple functions with the airflow.utils.db.provide_session decorator. In this case, sessions are automatically closed after retrieving query results.