Known issues

Stay organized with collections Save and categorize content based on your preferences.

Cloud Composer 1 | Cloud Composer 2

This page lists known Cloud Composer issues. Some fixes for these issues are in-progress, and will be available in future versions.

Some issues affect older versions, and can be fixed by upgrading your environment.

Non-RFC 1918 address ranges are partially supported for Pods and Services

Cloud Composer depends on GKE to deliver support for non-RFC 1918 addresses for Pods and Services. Currently, only the following list of Non-RFC 1918 ranges is supported in Cloud Composer:

  • 100.64.0.0/10
  • 192.0.0.0/24
  • 192.0.2.0/24
  • 192.88.99.0/24
  • 198.18.0.0/15
  • 198.51.100.0/24
  • 203.0.113.0/24
  • 240.0.0.0/4

Airflow UI does not show tasks logs when DAG Serialization is on in Composer 1.10.2 and Composer 1.10.3

Enabling DAG serialization in environments using Composer versions 1.10.2 and 1.10.3 prevents logs from showing in the Airflow web server. Upgrade to version 1.10.4 (or later) to fix this issue.

GKE Workload Identity is not supported

This issue applies only to Cloud Composer 1 environments. Cloud Composer 2 environments use Workload Identity.

You cannot turn on Workload Identity for Cloud Composer environment clusters. As a result, you may see the WORKLOAD_IDENTITY_DISABLED finding in Security Command Center.

GKE Pod Security Policies are not supported

You cannot turn on GKE Pod Security Policies for Cloud Composer environment clusters. As a result, you may see the POD_SECURITY_POLICY_DISABLED finding in Security Command Center.

Environment labels added during an update are not fully propagated

Updated environment labels are not applied to Compute Engine VMs. As a workaround, those labels can be applied manually.

GKE upgrades in the context of CVE-2020-14386 problem

We are working on addressing the CVE-2020-14386 vulnerability for all Cloud Composer environments. As part of the fix, all existing Cloud Composer's GKE clusters will get updated to a newer version.

Customers that decide to address the vulnerability immediately, can upgrade Composer GKE Cluster by following these instructions with the following considerations:

Step 1. If you're running a Cloud Composer version earlier than 1.7.2 then upgrade to a newer version of Cloud Composer. If you already have version 1.7.2 or later, please, go to the next point.

Step 2. Upgrade GKE cluster (master and nodes) to the latest 1.15 patch version containing the fix for this vulnerability.

Airflow tasks logs are unavailable in the Airflow web server after upgrading from Airflow 1.9.0 to Airflow 1.10.x

Airflow 1.10.x introduced backwards-incompatible changes to the naming convention for log files. Zone information is now added to the log names for Airflow tasks.

Airflow 1.9.0 stores and expects the log names to be in the following format: BUCKET/logs/DAG/2020-03-30T10:29:06/1.log Airflow 1.10.x stores and expects the log names to be in the following format: BUCKET/logs/DAG/2020-03-30T10:29:06+00:00/1.log

As a result, if you upgrade from Airflow 1.9.0 to Airflow 1.10.x and would like to read the log for a task executed with Airflow 1.9.0, the Airflow Web server will show the following error message: Unable to read remote log from BUCKET/logs/DAG/2020-03-30T10:29:06+00:00/1.log

Workaround: Rename the logs generated by Airflow 1.9.0 in the Cloud Storage bucket using the format: BUCKET/logs/DAG/2020-03-30T10:29:06+00:00/1.log

Cannot create Cloud Composer environments with the organization policy constraints/compute.disableSerialPortLogging enforced

Cloud Composer environment creation will fail if constraints/compute.disableSerialPortLogging is enforced on the target project.

Diagnosis

To determine if you're impacted by this issue, follow this procedure:

Go the GKE menu in Google Cloud console. Visit the GKE menu

Then, select your newly created cluster. Check for the following error:

Not all instances running in IGM after 123.45s.
Expect <number of desired instances in IGM>. Current errors:

Constraint constraints/compute.disableSerialPortLogging violated for
project <target project number>.

Workarounds:

  1. Disable the organization policy on the project where the Cloud Composer environment will be created.

    An organization policy can always be disabled at the project level even if the parent resources (organization or folder) has it enabled. See the Customizing policies for boolean constraints page for more details.

  2. Use exclusion filters

    Using an exclusion filter for serial port logs. accomplishes the same goal as the disabling the org policy, as there will be serial console logs in Logging. For more details, see the Exclusion filters page.

Usage of Deployment Manager to manage GCP resources protected by VPC Service Controls

Composer uses Deployment Manager to create components of Cloud Composer environments.

In December 2020, you may have received information that you may need to perform additional VPC Service Controls configuration to be able to use Deployment Manager to manage resources protected by VPC Service Controls.

We would like to clarify that no action is required on your side if you are using Composer and you are not using Deployment Manager directly to manage GCP resources mentioned in the Deployment Manager's announcement.

Cannot delete an environment after its GKE cluster is deleted

If you delete your environment's cluster before the environment itself, then attempts to delete your environment result in the following error:

 Got error "" during CP_DEPLOYMENT_DELETING [Rerunning Task. ]

To delete an environment when its GKE cluster is already deleted:

  1. Open the Deployment Manager page in the Google Cloud console.

    Open the Deployment Manager page

  2. Find all deployments marked with labels:

    • goog-composer-environment:<environment-name>
    • goog-composer-location:<environment-location>.

    You should see two deployments that are marked with the described labels:

    • A deployment named <environment-location>-<environment-name-prefix>-<hash>-sd
    • A deployment named addons-<uuid>
  3. Manually delete resources that are still listed in these two deployments and exist in the project (for example, Pub/Sub topics and subscriptions). To do so:

    1. Select the deployments.

    2. Click Delete.

    3. Select the Delete 2 deployments and all resources created by them, such as VMs, load balancers and disks option and click Delete all.

    The deletion operation fails, but the leftover resources are deleted.

  4. Delete the deployments using one of these options:

    • In Google Cloud console, select both deployments again. Click Delete, then select the Delete 2 deployments, but keep resources created by them option.

    • Run a gcloud command to delete the deployments with the ABANDON policy:

      gcloud deployment-manager deployments delete addons-<uuid> \
          --delete-policy=ABANDON
      
      gcloud deployment-manager deployments delete <location>-<env-name-prefix>-<hash>-sd \
          --delete-policy=ABANDON
      
  5. Delete your Cloud Composer environment.

Deployment Manager displays information about an unsupported feature

You might see the following warning in the Deployment Manager tab:

The deployment uses actions, which are an unsupported feature. We recommend
that you avoid using actions.

For Deployment Manager's deployments owned by Cloud Composer, you can ignore this warning.

Warnings about duplicate entries of 'echo' task belonging to the 'echo-airflow_monitoring' DAG

You might see the following entry in the Airflow logs:

in _query db.query(q) File "/opt/python3.6/lib/python3.6/site-packages/MySQLdb/
connections.py", line 280, in query _mysql.connection.query(self, query)
_mysql_exceptions.IntegrityError: (1062, "Duplicate entry
'echo-airflow_monitoring-2020-10-20 15:59:40.000000' for key 'PRIMARY'")

You can ignore these log entries, because this error doesn't impact Airflow DAG and task processing.

We work on improving Cloud Composer service to remove these warnings from Airflow logs.

Cloud Composer 1 environment creation fails when the compute.requireOsLogin policy is enabled

If compute.requireOsLogin policy is set to true in your project, then Cloud Composer 1 v1 environment creation operations fails.

To create Cloud Composer 1 environments, disable this policy in your project.

For more information about this organizational policy, see Organization policy constraints.

Cloud Composer environment creation fails when the compute.vmExternalIpAccess is disabled

Cloud Composer-owned GKE clusters configured in the Public IP mode require external connectivity for their VMs. Because of this, the compute.vmExternalIpAccess policy cannot forbid the creation of VMs with external IP addresses. For more information about this organizational policy, see Organization policy constraints.

Cloud Composer environment creation fails when the compute.vmCanIpForward policy is disabled

Cloud Composer 1 environments created in the non-VPC-Native (using alias IP) mode require this policy to allow the creation of VMs with the enabled "IP Forwarding" feature. For more information about this organizational policy, see Organization policy constraints.

First DAG run for an uploaded DAG file has several failed tasks

When you upload a DAG file, sometimes the first few tasks from the first DAG run for it fail with the Unable to read remote log... error. This problem happens because the DAG file is synchronized between your environment's bucket, Airflow workers, and Airflow schedulers of your environment. These synchronizations are done independently. If the scheduler gets the DAG file and schedules it to be executed by a worker, and if the worker does not have the DAG file yet, then the task execution fails.

As a workaround, Airflow 2 environments in Cloud Composer 1.17.0-preview.9 and later versions are configured to perform two retries for a failed task by default. If a task fails, it is retried twice with 5 minute intervals.

To use the workaround for this problem in Airflow 1, override the core-default_task_retries Airflow configuration option and set it to a number greater than or equal to 2.

Task fails with 'OSError: [Errno 5] Input/output error' in Airflow 1.10.15 or earlier versions

A bug in Airflow 1 versions causes tasks to be put to the Redis queue twice in some rare cases.

Sometimes it can lead to a race condition on the log file, and a subsequent task failure. Tasks fail with OSError: [Errno 5] Input/output error in Cloud Logging and Task is in the 'running' state which is not a valid state for execution. in the task attempt log.

This bug is fixed in Airflow 2. If you encounter this issue in Airflow 1 in a long-running task, increase the value of the [celery_broker_transport_options]visibility_timeout Airflow configuration option (the default value is 604800 for Composer 1.17.0, 21600 for older environments). For short-running tasks, consider adding additional retries to the affected tasks or migrating your environment to Airflow 2.

Dataproc/Dataflow operators fail with Negsignal.SIGSEGV

This is an intermittent issue of grcpio library, when used from a Celery worker. This issue affects Airflow 1.10.14 and later versions.

The workaround is to change the grpcio polling strategy by adding the following environment variable to your environment: GRPC_POLL_STRATEGY=epoll1. This workaround is already applied in Cloud Composer 1.17.1 and later versions.

Announcements about the removal of support for deprecated Beta APIs from GKE versions

Cloud Composer manages underlying Cloud Composer-owned GKE clusters. Unless you explicitly use such APIs in your DAGs and your code, you can ignore announcements about GKE API deprecations. Cloud Composer takes care of any migrations, if necessary.

GKE upgrades in the context of CVE-2021-25741 security issue

All existing Cloud Composer's GKE clusters will be auto-upgraded to newer GKE versions with a fix for the issues described in CVE-2021-25741.

If you want to address this vulnerability immediately, upgrade the GKE cluster of your environment by following the instructions for upgrading a cluster,

  • If you have a Cloud Composer 1 environment and GKE version 1.18.x or earlier then upgrade to 1.18.20-gke.4501.

  • If you have a Cloud Composer 1 environment and GKE version 1.19.x then upgrade to 1.19.14-gke.301.

  • If you have a Cloud Composer 2 environment and GKE version 1.21.x then upgrade to 1.21.4-gke.301.

Cloud Composer should not be impacted by Apache Log4j 2 Vulnerability (CVE-2021-44228)

In response to Apache Log4j 2 Vulnerability (CVE-2021-44228), Cloud Composer has conducted a detailed investigation and we believe that Cloud Composer is not vulnerable to this exploit.

Cloud Composer 2: Airflow workers or schedulers might experience issues when accessing Cloud Storage buckets

In some sporadic situations, in case of Cloud Composer 2 environments when Airflow worker or Airflow scheduler is restarted then it may malfunction and experience issues when accessing Cloud Storage bucket content.

In such a situation, you might see errors starting with: Transport endpoint is not connected in Airflow logs.

For example, error log for Airflow worker might look like this:

[Errno 107] Transport endpoint is not connected: '/home/airflow/gcs/logs/airflow_monitoring/echo/2022-01-11T22:50:48+00:00'

Solution:

  • Upgrade to Cloud Composer 2.0.26 or more recent version

Airflow UI might sometimes not re-load a plugin once it is changed

If a plugin consists of many files that import other modules, then the Airflow UI might not be able to recognize the fact that a plugin should be re-loaded. In such a case, one needs to trigger a restart of the Airflow webserver. You can do that by adding an environment variable or via installation or uninstallation of PYPI dependencies. You can also restart the Airflow web server.

Intermittent issues when communicating to Airflow Metadata Database

This known issue applies only to Cloud Composer 1.

Some older Cloud Composer 1 environments (1.16.3 or earlier) created before August 12, 2021 might experience transient issues related to communication to Airflow Metadata DBs.

If you experience this issue then in the Airflow tasks logs you will see the following error message:

"Can't connect to MySQL server on 'airflow-sqlproxy-service.default.svc.cluster.local' (104)"

Cloud Composer team works on resolving this problem. In the meantime, if you believe that you are highly impacted by this issue you can do the following to eliminate it:

  1. In Google Cloud console, go to the Environment Configuration page of the impacted Cloud Composer environments.
  2. Follow the view cluster details link to navigate to the environment's underlying GKE cluster.
  3. Navigate to the Nodes tab and click on the default-pool visible in Node Pools section. select default-pool
  4. Click Edit at the top of the page.
  5. Change the image type to be Container-Optimized OS with containerd and save the configuration as shown below. Change Node Pool image type from Docker to containerd
  6. Once the change is submitted, your default-pool node pool will be reconfigured to use containerd as its container runtime. Some of your Airflow tasks might fail while the node pool is reconfiguring. If those tasks have retries configured, they will be re-run by Airflow once the operation on the node pool has completed.

The environment's cluster has workloads in the Unschedulable state

This known issue applies only to Cloud Composer 2.

In Cloud Composer 2, after an environment is created, several workloads in the environment's cluster remain in the Unschedulable state. This is a known issue. Unschedulable DaemonSet workloads named composer-gcsfuse and composer-fluentd do not affect your environment.

Error 504 when accessing the Airflow UI

You can get the 504 Gateway Timeout error when accessing the Airflow UI. This error can have several causes:

  • Transient communication issue. In this case, attempt to access the Airflow UI later.
  • (Cloud Composer 2 only) Connectivity issue. If Airflow UI is permanently unavailable, and timeout or 504 errors are generated, make sure that your environment can access *.composer.cloud.google.com. If you use Private Google Access and send traffic over private.googleapis.com Virtual IPs, or VPC Service Controls and send traffic over restricted.googleapis.com Virtual IPs, make sure that your Cloud DNS is configured also for *.composer.cloud.google.com domain names.
  • Unresponsive Airflow web server. If the error 504 persists, but you can still access the Airflow UI at certain times, then the Airflow web server might be unresponsive because it's overwhelmed. Attempt to increase the scale and performance parameters of the web server.

Error 502 when accessing Airflow UI

The error 502 Internal server exception indicates that Airflow UI cannot serve incoming requests. This error can have several causes:

  • Transient communication issue. Try to access Airflow UI later.

  • Failure to start the web server. In order to start, the web server requires configuration files to be synchronized first. Check web server logs for log entries that look similar to: GCS sync exited with 1: gsutil -m cp gs://<bucket-name>/airflow.cfg /home/airflow/gcs/airflow.cfg.tmp or GCS sync exited with 1: gsutil -m cp gs://<bucket-name>/env_var.json.cfg /home/airflow/gcs/env_var.json.tmp. If you see these errors, check if files mentioned in error messages are still present in the environment's bucket.

    In case of their accidental removal (for example, because a retention policy was configured), you can restore them:

    1. Set a new environment variable in your environment. You can use use any variable name and value.

    2. Override an Airflow configuration option. You can use a non-existent Airflow configuration option.

Hovering over task instance in Tree view throws uncaught TypeError

In Airflow 2, the Tree view in the Airflow UI might sometimes not work properly when a non-default timezone is used. As a workaround for this issue, configure the timezone explicitly in the Airflow UI.

Airflow UI in Airflow 2.2.3 or earlier versions is vulnerable to CVE-2021-45229

As pointed out in CVE-2021-45229, the "Trigger DAG with config" screen was susceptible to XSS attacks through the origin query argument.

Recommendation: Upgrade to the latest Cloud Composer version that supports Airflow 2.2.5.

Workers in Airflow 2.3 require more memory than in previous Airflow versions

Symptom: In your Cloud Composer 2 environment with Airflow 2.3, all environment's cluster workloads of Airflow workers are in the CrashLoopBackOff status and do not execute tasks. You can also see OOMKilling warnings that are generated if you are impacted by this issue. This issue happens shortly after an environment is created or upgraded to Cloud Composer 2 and Airflow 2.3, or after changing CPU and memory parameters of Airflow workers.

Cause: Airflow worker memory requirements in Airflow 2.3 are 30% higher compared to workers in Airflow 2.2 or Airflow 2.1. If you use custom settings for Airflow worker CPU and memory, you might experience this issue. Cloud Composer 2 calculates worker_concurrency based on the CPU of Airflow workers. If Airflow workers in your environment do not have enough memory for the specified number of CPUs, then this issue occurs.

Solution:

  • If you create a new environment with Airflow 2.3 and use custom resource settings for Airflow workers, increase the amount of memory available to workers. As an alternative, you can override the [celery]worker_concurrency Airflow configuration option and set it to a lower value than the automatically calculated value. If your environment is impacted by this issue, then do not set [celery]worker_concurrency higher than 8 per 1GB of memory given to an Airflow worker.
  • If you upgrade from Airflow 2.1 or Airflow 2.2, then apply one of the proposed solutions, and then upgrade to Airflow 2.3.

DAG triggering via private networks using Cloud Functions

Triggering DAGs with Cloud Functions via private networks with the use of VPC Connector is not supported by Cloud Composer.

Recommendation: Use Cloud Functions to publish messages on Pub/Sub. Such events can actuate Pub/Sub Sensors to trigger Airflow DAGs or implement an approach based on deferrable operators.

Problem with gcloud composer commands in 410.0.0 version

In the 410.0.0 version of gcloud, the following Cloud Composer commands:

  • gcloud composer environments run
  • gcloud composer environments list-packages

return a non-zero error code and display this error message:

  (ERROR: gcloud crashed (TypeError): 'NoneType' object is not callable)

This behaviour occurs in addition to regular output produced by the gcloud commands and doesn't impact their functionality.

If this issue doesn't affect your operations, you can continue using the 410.0.0 version and ignore the incorrect error message. If you need to use 410.0.0 version and you use gcloud command in a programmatic way then, please, implement additional logic to ignore non-zero error code and info about the error stacktrace in the output. You can also take a look at the Solution section for any other workarounds.

Solution

What's next