Troubleshooting KubernetesExecutor tasks

Cloud Composer 3 | Cloud Composer 2 | Cloud Composer 1

This page describes how to troubleshoot issues with tasks run by KubernetesExecutor and provides solutions for common issues.

General approach to troubleshooting KubernetesExecutor

To troubleshoot issues with a task executed with KubernetesExecutor, do the following actions in the listed order:

  1. Check logs of the task in the DAG UI or Airflow UI.

  2. Check scheduler logs in Google Cloud console:

    1. In Google Cloud console, go to the Environments page.

      Go to Environments

    2. In the list of environments, click the name of your environment. The Environment details page opens.

    3. Go to the Logs tab and check the Airflow logs > Scheduler section.

    4. For a given time range, inspect the KubernetesExecutor worker pod that was running the task. If the pod no longer exists, skip this step. The pod has the airflow-k8s-worker prefix and a DAG or a task name in its name. Look for any reported issues such as a failed task or the task being unschedulable.

Common troubleshooting scenarios for KubernetesExecutor

This section lists common troublehooting scenarions that you might encounter with KubernetesExecutor.

The task gets to the Running state, then fails during the execution.

Symptoms:

  • There are logs for the task in Airflow UI and on the Logs tab in the Workers section.

Solution: The task logs indicate the problem.

Task instance gets to the queued state, then it is marked as UP_FOR_RETRY or FAILED after some time.

Symptoms:

  • There are no logs for task in Airflow UI and on the Logs tab in the Workers section.
  • There are logs on the Logs tab in the Scheduler section with a message that the task is marked as UP_FOR_RETRY or FAILED.

Solution:

  • Inspect scheduler logs for any details of the issue.

Possible causes:

  • If the scheduler logs contain the Adopted tasks were still pending after... message followed by the printed task instance, check that CeleryKubernetesExecutor is enabled in your environment.

The task instance gets to the Queued state and is immediately marked as UP_FOR_RETRY or FAILED

Symptoms:

  • There are no logs for the task in Airflow UI and on the Logs tab in the Workers section.
  • The scheduler logs on the Logs tab in the Scheduler section has the Pod creation failed with reason ... Failing task message, and the message that the task is marked as UP_FOR_RETRY or FAILED.

Solution:

  • Check scheduler logs for the exact response and failure reason.

Possible reason:

If the error message is quantities must match the regular expression ..., then the issue is most-likely caused by a custom values set for k8s resources (requests/limits) of task worker pods.

KubernetesExecutor tasks fail without logs when a large number of tasks is executed

When your environment executes a large number of tasks with KubernetesExecutor or KubernetesPodOperator at the same time, Cloud Composer 3 doesn't accept new tasks until some of the existing tasks are finished. Extra tasks are marked as failed, and Airflow retries them later, if you define retries for the tasks (Airflow does this by default).

Symptom: Tasks executed with KubernetesExecutor or KubernetesPodOperator fail without task logs in Airflow UI or DAG UI. In the scheduler's logs, you can see error messages similar to the following:

pods \"airflow-k8s-worker-*\" is forbidden: exceeded quota: k8s-resources-quota,
requested: pods=1, used: pods=*, limited: pods=*","reason":"Forbidden"

Possible solutions:

  • Adjust the DAG run schedule so that tasks are distributed more evenly over time.
  • Reduce the number of tasks by consolidating small tasks.

Workaround:

If you prefer tasks to stay in the scheduled state until your environment can execute them, you can define an Airflow pool with the limited number of slots in the Airflow UI and then associate all container-based tasks with this pool. We recommend to set the number of slots in the pool to 50 or less. Extra tasks will stay in the scheduled state until the Airflow pool has a free slot to execute them. If you use this workaround without applying possible solutions, you can still experience a large queue of tasks in the Airflow pool.

What's next