Cloud Composer 1 | Cloud Composer 2
This page describes what data Cloud Composer stores for your environment in Cloud Storage.
When you create an environment, Cloud Composer creates a
Cloud Storage bucket and associates the bucket
with your environment. The name of the bucket is based on the environment region, name,
and a random ID such as us-central1-b1-6efannnn-bucket
.
Cloud Composer expects the source code for your workflows (DAGs) and their dependencies in specific folders in Cloud Storage and uses Cloud Storage FUSE to map the folders to the Airflow components in your Cloud Composer environment.
Folders in the Cloud Storage bucket
Folder | Storage path | Mapped directory | Description |
---|---|---|---|
DAG | gs://bucket-name/dags |
/home/airflow/gcs/dags |
Stores the DAGs for your environment. Only the DAGs in this folder are scheduled for your environment. |
Plugins | gs://bucket-name/plugins |
/home/airflow/gcs/plugins |
Stores your custom plugins, such as custom in-house Airflow operators, hooks, sensors, or interfaces. |
Data | gs://bucket-name/data |
/home/airflow/gcs/data |
Stores the data that tasks produce and use. This folder is mounted on all worker nodes. |
Logs | gs://bucket-name/logs |
Stores the Airflow logs for tasks. Logs are also available in the Airflow web interface and in Logs Tab in Cloud Composer UI. |
Capacity considerations
Data from dags/
, plugins/
and data/
folders are synced to Airflow
scheduler(s) and workers.
In Airflow 2, the content of
plugins/
folder is also synced to Airflow webserver.In Airflow 1, the content
dags/
andplugins/
folders is synced to Airflow webserver only if DAG Serialization is turned off. Otherwise, the synchronization is not performed.
The more data is put into these, folders the more space is occupied
in local storage of Airflow components.
Saving too much data in dags/
and plugins/
can disrupt your operations
and lead to issues such as:
a worker or a scheduler runs out of local storage and is evicted due insufficient space on the local disk of the component
Syncing files from
dags/
andplugins/
folders to workers and schedulers takes a long time.Syncing files from
dags/
andplugins/
folders to workers and schedulers becomes impossible. For example, you store a 100GB file in thedags/
folder, but the local disk of your component can only accommodate 50GB. During synchronization, your component runs out of local storage and synchronization can't complete.
DAGs and plugins
To avoid workflow failures, store your DAGs, plugins,
and Python modules in the dags/
or plugins/
folders,
even if your Python modules don't contain DAGs or plugins.
For example, you use a DataFlowPythonOperator
that
references a py_file
dataflow pipeline. That py_file
doesn't contain DAGs or plugins, but you should still
store it in the dags/
or plugins/
folder.
Data
In general, the contents of data/
folder are by default
not synchronized to Airflow components.
There are scenarios when certain files from the data/
folder
are synced to a specific Airflow component.
For example, when Cloud Composer attempts to read
a given file for the first time during:
DAG parsing When a file is read for the first time during DAG parsing, Cloud Composer synchronizes it to the scheduler that parses the DAG.
DAG execution When a file is read for the first time during DAG execution, Cloud Composer synchronizes it to the worker running the execution.
Airflow components have limited local storage, so consider deleting downloaded files to free up disk space in your components. Notice that local storage usage can also temporarily go up if you have concurrent tasks that download the same file to a single Airflow worker.
Logs
The logs/
folder is synced from Airflow workers
to the bucket using the Cloud Storage API.
Cloud Storage API quota is calculated by the amount of data moved, so the number of Airflow tasks your system runs can increase your Cloud Storage API usage: the more tasks you run, the bigger your log files.
Webserver, dags/
, plugins/
, and data/
folders***
Airflow 2 uses DAG serialization out of the box. The
plugins/
folder is automatically synced to the Webserver so that plugins can be loaded by Airflow UI. You can't turn off DAG serialization in Airflow 2.In Airflow 1, DAG serialization is supported and is turned on by default in Cloud Composer.
- When DAG serialization is turned on, the files from
dags/
andplugins/
folders aren't synced to the Webserver. - When DAG serialization is turned off, the files from
dags/
andplugins/
are synced to the Webserver.
- When DAG serialization is turned on, the files from
In both Airflow 1 and Airflow 2, the contents of data/
folder aren't
synced to the Webserver.
Data synchronization
When you modify DAGs or plugins in the Cloud Storage bucket, synchronizes the data across all the nodes in the cluster.
Cloud Composer synchronizes the dags/
and plugins/
folders
uni-directionally by copying locally. Unidirectional synching means that local
changes in these folders are overwritten.
The data/
and logs/
folders synchronize
bi-directionally by using Cloud Storage FUSE.
Differences between Cloud Composer 1 and Cloud Composer 2
Cloud Composer 1 and Cloud Composer 2 allocate different amounts of CPU and memory to the components that perform data synchronization.
Depending on your configuration, the speed of content synchronization for
dags/
, plugins/
, and data/
folders between the Cloud Storage
bucket and Airflow components might differ if you compare
Cloud Composer 1 and Cloud Composer 2.