This page provides troubleshooting steps and information for common workflow issues.
To begin troubleshooting:
- Check the Airflow logs.
- Review the Google Cloud's operations suite.
- In the Cloud Console, check for errors on the pages for the Google Cloud components running your environment.
Tip: To navigate through a large DAG to look for failed task instances, change the graph view orientation from LR to RL by overriding the web server's default
Section Key Value
Debugging operator failures
To debug an operator failure:
- Check for task-specific errors.
- Check the Airflow logs.
- Review the Google Cloud's operations suite.
- Check the operator-specific logs.
- Fix the errors.
- Upload the DAG
- In the Airflow web interface, clear the past states for the DAG.
- Resume or run the DAG.
The following sections describe symptoms and potential fixes for some common DAG issues.
Task fails without emitting logs
Google Kubernetes Engine pods are subject to the Kubernetes Pod Lifecycle and pod eviction. Task spikes and co-scheduling of workers are two most common causes for pod eviction in Cloud Composer.
Pod eviction can occur when a particular pod overuses resources of a node, relative to the configured resource consumption expectations for the node. For example, eviction might happen when several memory-heavy tasks run in a pod, and their combined load causes the node where this pod runs to exceed the memory consumption limit.
If an Airflow worker pod is evicted, all task instances running on that pod are interrupted, and later marked as failed by Airflow.
Logs are buffered. If a worker pod is evicted before the buffer flushes, logs are not emitted. Task failure without logs is an indication that the Airflow workers are restarted due to out-of-memory (OOM). Some logs might be present in Cloud Logging even though the Airflow logs were not emitted. You can view the logs, for example, by selecting your environment in Google Cloud Console, navigating to the Logs tab, and viewing the logs of individual workers under All logs -> Airflow logs -> Workers -> (individual worker).
DAG execution is RAM limited. Each task execution starts with two Airflow processes: task execution and monitoring. Each node can take up to 6 concurrent tasks (approximately 12 processes loaded with Airflow modules). More memory can be consumed, depending on the nature of the DAG.
In the Cloud Console, go to the Kubernetes Engine -> Workloads panel.
If there are
airflow-workerpods that show
Evicted, click each evicted pod and look for the
The node was low on resource: memorymessage at the top of the window.
- Create a new Cloud Composer environment with a larger machine type than the current machine type.
- Check logs from airflow-worker pods for possible eviction causes. For more information about fetching logs from individual pods, see Troubleshooting issues with deployed workloads.
- Ensure that the tasks in the DAG are idempotent and retriable.
- Configure task retries.
DAG load import timeout
- Airflow web UI: At the top of the DAGs list page, a red alert box
Broken DAG: [/path/to/dagfile] Timeout.
- Google Cloud's operations suite: The
airflow-schedulerlogs contain entries similar to:
- “ERROR - Process timed out”
- “ERROR - Failed to import: /path/to/dagfile”
- “AirflowTaskTimeout: Timeout”
dagbag_import_timeout Airflow configuration option
and allow more time for DAG parsing:
||New timeout value|
Increased network traffic to and from the Airflow database
The amount of traffic network between your environment's GKE cluster and the Airflow database depends on the number of DAGs, number of tasks in DAGs, and the way DAGs access data in the Airflow database. The following factors might influence the network usage:
Queries to the Airflow database. If your DAGs do a lot of queries, they generate large amounts of traffic. Examples: checking the status of tasks before proceeding with other tasks, querying the XCom table, dumping Airflow database content.
Large number of tasks. The more tasks are there to schedule, the more network traffic is generated. This consideration applies both to the total number of tasks in your DAGs and to the scheduling frequency. When the Airflow scheduler schedules DAG runs, it makes queries to the Airflow database and generates traffic.
Airflow web interface generates network traffic because it makes queries to the Airflow database. Intensively using pages with graphs, tasks, and diagrams can generate large volumes of network traffic.
DAG crashes the Airflow web server or causes it to return a
502 gateway timeout error
Web server failures can occur for several different reasons. Check
the airflow-webserver logs in
to determine the cause of the
502 gateway timeout error.
Avoid running heavyweight computation at DAG parse time. Unlike the worker and scheduler nodes, whose machine types can be customized to have greater CPU and memory capacity, the web server uses a fixed machine type, which can lead to DAG parsing failures if the parse-time computation is too heavyweight.
Note that the web server has 2 vCPUs and 2 GB of memory.
The default value for
core-dagbag_import_timeout is 30 seconds. This timeout
value defines the upper limit for how long Airflow spends loading a
Python module in the
The web server does not run under the same service account as the workers and scheduler. As such, the workers and scheduler might be able to access user-managed resources that the web server cannot access.
We recommend that you avoid accessing non-public resources during
DAG parsing. Sometimes, this is unavoidable, and you will need to grant
permissions to the web server's service account. The service
account name is derived from your web server domain. For example, if the domain
foo-tp.appspot.com, the service account is
The web server runs on App Engine and is separate from
your environment's GKE cluster. The web server parses the DAG
definition files, and a
502 gateway timeout can occur if there are errors
in the DAG. Airflow works normally without a functional web server—if the
problematic DAG is not breaking any processes running in GKE.
In this case, you can use
gcloud composer environments run to retrieve
details from your environment and as a workaround if the web server becomes unavailable.
In other cases, you can run DAG parsing in GKE and look for DAGs that throw fatal Python exceptions or that time out (default 30 seconds). To troubleshoot, connect to a remote shell in an Airflow worker container and test for syntax errors. For more information, see Testing DAGs.
Lost connection to MySQL server during query exception is thrown during the task execution or right after it
Lost connection to MySQL server during query exceptions often happen when the
following conditions are met:
- Your DAG uses
PythonOperatoror a custom operator.
- Your DAG makes queries to the Airflow database.
If several queries are made from a callable function, tracebacks might
incorrectly point to
self.refresh_from_db(lock_for_update=True) line in the
Airflow code; it is the first database query after the task execution. The
actual cause of the exception happens before this, when an SQLAlchemy session
is not properly closed.
SQLAlchemy sessions are scoped to a thread and created in a callable function session can be later continued inside the Airflow code. If there are significant delays between queries within one session, the connection might be already closed by the MySQL or PostgreSQL server. The connection timeout in Cloud Composer environments is set to approximately 10 minutes.
- Use the
airflow.utils.db.provide_sessiondecorator. This decorator provides a valid session to the Airflow database in the
sessionparameter and correctly closes the session at the end of the function.
- Do not use a single long-running function. Instead, move all database
queries to separate functions, so that there are multiple functions with
airflow.utils.db.provide_sessiondecorator. In this case, sessions are automatically closed after retrieving query results.