Troubleshooting environment updates and upgrades

Cloud Composer 1 | Cloud Composer 2

This page provides troubleshooting information for problems that you might encounter while updating or upgrading Cloud Composer environments.

For troubleshooting information related to creating environments, see Troubleshooting environment creation.

When Cloud Composer environments are updated, the majority of issues happen because of the following reasons:

  • Service account permission problems
  • PyPI dependency issues
  • Size of the Airflow database

Insufficient permissions to update or upgrade an environment

If Cloud Composer cannot update or upgrade an environment because of insufficient permissions, it outputs the following error message:

ERROR: (gcloud.composer.environments.update) PERMISSION_DENIED: The caller does not have permission

Solution: Assign roles to both to your account and to the service account of your environment as described in Access control.

The service account of the environment has insufficient permissions

When creating a Cloud Composer environment, you specify a service account that runs the environment's GKE cluster nodes. If this service account does not have enough permissions for the requested operation, Cloud Composer outputs an error:

    UPDATE operation on this environment failed 3 minutes ago with the
    following error message:
    Composer Backend timed out. Currently running tasks are [stage:
    CP_COMPOSER_AGENT_RUNNING
    description: "No agent response published."
    response_timestamp {
      seconds: 1618203503
      nanos: 291000000
    }
    ].

Solution: Assign roles to both to your account and to the service account of your environment as described in Access control.

The size of the Airflow database is too big to perform the operation

A Cloud Composer upgrade operation might not succeed because the size of the Airflow database is too large for upgrade operations to succeed.

If the size of the Airflow database is more than 16 GB, Cloud Composer outputs the following error:

Airflow database uses more than 16 GB. Please clean the database before upgrading.

Solution: Perform the Airflow database cleanup, as described in Airflow database maintenance.

An upgrade to a new Cloud Composer version fails because of PyPI package conflicts

When you upgrade an environment with installed custom PyPI packages, you might encounter errors related to PyPI package conflicts. This might happen because the new Cloud Composer image contains newer versions of preinstalled packages that cause dependency conflicts with PyPI packages that you installed in your environment.

Solution:

  • To get detailed information about package conflicts, run an upgrade check.
  • Loosen version constraints for installed custom PyPI packages. For example, instead of specifying a version as ==1.0.1, specify it as >=1.0.1.
  • For more information about changing version requirements to resolve conflicting dependencies, see pip documentation.

Lack of connectivity to DNS can cause problems while performing upgrades or updates

Such connectivity problems might result in the log entries like this:

WARNING - Compute Engine Metadata server unavailable attempt 1 of 5. Reason: [Errno -3] Temporary failure in name resolution Error

It usually means that there is no route to DNS so make sure that metadata.google.internal DNS name can be resolved to IP address from within Cluster, Pods and Services networks. Check if you have Private Google Access turned on within VPC (in host or service project) where your environment is created.

More information:

Triggerer CPU exceeds the 1 vCPU limit

Cloud Composer 2 in versions 2.4.4 and higher introduces a different triggerer resource allocation strategy to improve performance scaling. If you encounter an error related to triggerer CPU when performing an environment update, it means that your current triggerers are configured to use more than 1 vCPU per triggerer.

Solution:

Inspect failed migration warnings

When upgrading Airflow to a later version, sometimes new constraints are applied to the Airflow database. If these constraints cannot be applied, Airflow creates new tables to store the rows for which the constraints couldn't be applied. Airflow UI displays a warning message until the moved data tables are renamed or dropped.

Solution:

You can use the following two DAGs to inspect the moved data and rename the tables.

The list_moved_tables_after_upgrade_dag DAG lists rows that were moved from every table where constraints could not be applied. Inspect the data and decide whether you want to keep it. To keep it, you need to manually fix the data in the Airflow database. For example, by adding the rows back with the correct data.

If you don't need the data or if you already fixed it, then you can run the rename_moved_tables_after_upgrade_dag DAG. This DAG renames the moved tables. The tables and their data are not deleted, so you can review the data at a later point.

"""
When upgrading Airflow to a newer version,
it might happen that some data cannot be migrated,
often because of constraint changes in the metadata base.
This file contains 2 DAGs:

1. 'list_moved_tables_after_upgrade_dag'
  Prints the rows which failed to be migrated.
2. 'rename_moved_tables_after_upgrade_dag'
  Renames the table which contains the failed migrations. This will remove the
  warning message from airflow.
"""

import datetime
import logging

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.postgres.hooks.postgres import PostgresHook
from airflow.settings import AIRFLOW_MOVED_TABLE_PREFIX


def get_moved_tables():
    hook = PostgresHook(postgres_conn_id="airflow_db")
    return hook.get_records(
        "SELECT schemaname, tablename FROM pg_catalog.pg_tables WHERE tablename"
        f" LIKE '{AIRFLOW_MOVED_TABLE_PREFIX}_%'"
    )


def list_moved_records():
    tables = get_moved_tables()
    if not tables:
        logging.info("No moved tables found")
        return

    hook = PostgresHook(postgres_conn_id="airflow_db")
    for schema, table in tables:
        df = hook.get_pandas_df(f"SELECT * FROM {schema}.{table}")
        logging.info(df.to_markdown())


def rename_moved_tables():
    tables = get_moved_tables()
    if not tables:
        return

    hook = PostgresHook(postgres_conn_id="airflow_db")
    for schema, table in tables:
        hook.run(f"ALTER TABLE {schema}.{table} RENAME TO _abandoned_{table}")


with DAG(
    dag_id="list_moved_tables_after_upgrade_dag",
    start_date=datetime.datetime(2023, 1, 1),
    schedule_interval=None,
    catchup=False,
):
    t1 = PythonOperator(
        task_id="list_moved_records", python_callable=list_moved_records
    )

with DAG(
    dag_id="rename_moved_tables_after_upgrade_dag",
    start_date=datetime.datetime(2023, 1, 1),
    schedule_interval=None,
    catchup=False,
) as dag:
    t1 = PythonOperator(
        task_id="rename_moved_tables", python_callable=rename_moved_tables
    )

What's next