Composer handling KubernetesPodOperator Failures

Problem

When using the KubernetesPodOperator with custom pods there can be failures that affect the DAGs using these custom pods.

Environment

  • Directed Acyclic Graph (DAG) using KubernetesPodOperator
  • Composer v1 and v2

Solution

  1. Write the DAGs to have tasks that are idempotent so that if a pod sometimes fail and the corresponding Airflow task is marked as failed the task can be safely retried.
  2. Configure the retries property in the operator so that the pod is retried before declaring the task as failed.
  3. Handle failures of this task through a downstream task triggered only on the parent task failures – the child task should have trigger_rule set to all_failed.
    • We should consider logging warning messages in the child task.
    • This way we can continue the DAG execution and have the whole DAG run marked as success if that is the desired operational outcome
  4. If we do not want to mark these tasks as failed in the event of a KubernetesPodOperator pod failure, then we should consider creating a custom operator derived from KubernetesPodOperator that would handle the exception differently (i.e., log warning and ignore), or have the child task change the status of the parent task (not recommended, but is an option suitable for some edge cases).

Cause

Custom pods used with KubernetesPodOperator can fail for any number of reasons. Handling these events should be determined based on the needs of the DAGs.