Troubleshooting DAGs (workflows)

This page provides troubleshooting steps and information for common workflow issues.

Troubleshooting workflow

To begin troubleshooting:

  1. Check the Airflow logs.
  2. Review the Stackdriver.
  3. In the GCP Console, check for errors on the pages for the GCP components running your environment.
  4. In the Airflow web interface, check in the DAG's Graph View for failed task instances.

    Tip: To navigate through a large DAG to look for failed task instances, change the graph view orientation from LR to RL by overriding the web server's default dag_orientation configuration.

Debugging operator failures

To debug an operator failure:

  1. Check for task-specific errors.
  2. Check the Airflow logs.
  3. Review the Stackdriver.
  4. Check the operator-specific logs.
  5. Fix the errors.
  6. Upload the DAG to the dags/ folder.
  7. In the Airflow web interface, clear the past states for the DAG.
  8. Resume or run the DAG.

Common issues

The following sections describe symptoms and potential fixes for some common workflow issues.

Task fails without emitting logs

Logs are buffered. If a worker dies before the buffer flushes, logs are not emitted. Task failure without logs is an indication that the Airflow workers are restarted due to out-of-memory (OOM).

DAG execution is RAM limited. Each task execution starts with two Airflow processes: task execution and monitoring. Currently, each node can take up to 6 concurrent tasks (approximately 12 processes loaded with Airflow modules). More memory can be consumed, depending on the size of the DAG.

Symptom

  1. In the GCP Console, go to the GKE workloads panel.
  2. If there are ‘airflow-worker’ pods that show ‘Evicted’, click each evicted pod and look for the The node was low on resource: memory message at the top of the window.

Fix

  1. Create a new Cloud Composerenvironment with a larger machine type than the current machine type.
  2. Ensure that the tasks in the DAG are idempotent and retriable.
  3. Configure task retries.

DAG load import timeout

Symptom

  • Airflow web UI: At the top of the DAGs list page, a red alert box shows Broken DAG: [/path/to/dagfile] Timeout.
  • Stackdriver: The airflow-scheduler logs contain entries similar to:
    • “ERROR - Process timed out”
    • “ERROR - Failed to import: /path/to/dagfile”
    • “AirflowTaskTimeout: Timeout”

Fix

Override the Airflow configuration for core-dagbag_import_timeout and allow more time for DAG parsing.

DAG crashes the Airflow web server or causes it to return a 502 gateway timeout error

Web server failures can occur for a few reasons. If you are running composer-1.5.2 or later, check the airflow-webserver logs in Stackdriver Logging to debug the 502 gateway timeouterror.

Heavyweight computation

Avoid running heavyweight computation at DAG parse time. Unlike the worker and scheduler nodes, whose machine types can be customized to have greater CPU and memory capacity, the web server uses a fixed machine type, which can lead to DAG parsing failures if the parse-time computation is too heavyweight.

Note that the web server has 2 vCPUs and 2 GB of memory. The default value for core-dagbag_import_timeout is 30 seconds. This timeout value defines the upper limit for how long Airflow spends loading a Python module in the dags/ folder.

Incorrect permissions

The web server does not run under the same service account as the workers and scheduler. As such, the workers and scheduler might be able to access user-managed resources that the web server cannot access.

We recommend that you avoid accessing non-public resources during DAG parsing. Sometimes, this is unavoidable, and you will need to grant permissions to the web server's service account. The service account name is derived from your web server domain. For example, if the domain is foo-tp.appspot.com, the service account is foo-tp@appspot.gserviceaccount.com.

DAG errors

The web server runs on App Engine and is separate from your environment's GKE cluster. The web server parses the DAG definition files, and a 502 gateway timeout can occur if there are errors in the DAG. Airflow works normally without a functional web server—if the problematic DAG is not breaking any processes running in GKE. In this case, you can use gcloud composer environments run to retrieve details from your environment and as a workaround if the web server becomes unavailable.

In other cases, you can run DAG parsing in GKE and look for DAGs that throw fatal Python exceptions or that time out (default 30 seconds). To troubleshoot, connect to a remote shell in an Airflow worker container and test for syntax errors. For more information, see Testing DAGS.

Esta página foi útil? Conte sua opinião sobre:

Enviar comentários sobre…