Data Stored in Cloud Storage

Cloud Composer 1 | Cloud Composer 2

This page describes what data Cloud Composer stores for your environment in Cloud Storage.

When you create an environment, Cloud Composer creates a Cloud Storage bucket and associates the bucket with your environment. The name of the bucket is based on the environment region, name, and a random ID such as us-central1-b1-6efannnn-bucket.

Cloud Composer expects the source code for your workflows (DAGs) and their dependencies in specific folders in Cloud Storage and uses Cloud Storage FUSE to map the folders to the Airflow components in your Cloud Composer environment.

Folders in the Cloud Storage bucket

Folder Storage path Mapped directory Description
DAG gs://bucket-name/dags /home/airflow/gcs/dags Stores the DAGs for your environment. Only the DAGs in this folder are scheduled for your environment.
Plugins gs://bucket-name/plugins /home/airflow/gcs/plugins Stores your custom plugins, such as custom in-house Airflow operators, hooks, sensors, or interfaces.
Data gs://bucket-name/data /home/airflow/gcs/data Stores the data that tasks produce and use. This folder is mounted on all worker nodes.
Logs gs://bucket-name/logs Stores the Airflow logs for tasks. Logs are also available in the Airflow web interface and in Logs Tab in Cloud Composer UI.

Capacity considerations

Data from dags/, plugins/ and data/ folders are synced to Airflow scheduler(s) and workers.

  • In Airflow 2, the content of plugins/ folder is also synced to Airflow webserver.

  • In Airflow 1, the content dags/ and plugins/ folders is synced to Airflow webserver only if DAG Serialization is turned off. Otherwise, the synchronization is not performed.

The more data is put into these, folders the more space is occupied in local storage of Airflow components. Saving too much data in dags/ and plugins/ can disrupt your operations and lead to issues such as:

  • a worker or a scheduler runs out of local storage and is evicted due insufficient space on the local disk of the component

  • Syncing files from dags/ and plugins/ folders to workers and schedulers takes a long time.

  • Syncing files from dags/ and plugins/ folders to workers and schedulers becomes impossible. For example, you store a 2 GB file in the dags/ folder, but the local disk of your component can only accommodate 1 GB. During synchronization, your component runs out of local storage and synchronization can't complete.

DAGs and plugins

To avoid workflow failures, store your DAGs, plugins, and Python modules in the dags/ or plugins/ folders, even if your Python modules don't contain DAGs or plugins.

For example, you use a DataFlowPythonOperator that references a py_file dataflow pipeline. That py_file doesn't contain DAGs or plugins, but you should still store it in the dags/ or plugins/ folder.

Data

There are scenarios when certain files from the data/ folder are synced to a specific Airflow component. For example, when Cloud Composer attempts to read a given file for the first time during:

  • DAG parsing When a file is read for the first time during DAG parsing, Cloud Composer synchronizes it to the scheduler that parses the DAG.

  • DAG execution When a file is read for the first time during DAG execution, Cloud Composer synchronizes it to the worker running the execution.

Airflow components have limited local storage, so consider deleting downloaded files to free up disk space in your components. Notice that local storage usage can also temporarily go up if you have concurrent tasks that download the same file to a single Airflow worker.

Logs

The logs/ folder is synced from Airflow workers to the bucket using the Cloud Storage API.

Cloud Storage API quota is calculated by the amount of data moved, so the number of Airflow tasks your system runs can increase your Cloud Storage API usage: the more tasks you run, the bigger your log files.

Webserver, dags/, plugins/, and data/ folders

  • Airflow 2 uses DAG serialization out of the box. The plugins/ folder is automatically synced to the Webserver so that plugins can be loaded by Airflow UI. You can't turn off DAG serialization in Airflow 2.

  • In Airflow 1, DAG serialization is supported and is turned on by default in Cloud Composer.

    • When DAG serialization is turned on, the files from dags/ and plugins/ folders aren't synced to the Webserver.
    • When DAG serialization is turned off, the files from dags/ and plugins/ are synced to the Webserver.

Data synchronization

When you modify DAGs or plugins in the Cloud Storage bucket, synchronizes the data across all the nodes in the cluster.

Cloud Composer synchronizes the dags/ and plugins/ folders uni-directionally by copying locally. Unidirectional synching means that local changes in these folders are overwritten.

The data/ and logs/ folders synchronize bi-directionally by using Cloud Storage FUSE.

Differences between Cloud Composer 1 and Cloud Composer 2

Cloud Composer 1 and Cloud Composer 2 allocate different amounts of CPU and memory to the components that perform data synchronization.

Depending on your configuration, the speed of content synchronization for dags/, plugins/, and data/ folders between the Cloud Storage bucket and Airflow components might differ if you compare Cloud Composer 1 and Cloud Composer 2.

What's next