This page describes what data Cloud Composer stores for your environment in Cloud Storage.
When you create an environment, Cloud Composer creates a
Cloud Storage bucket and associates the bucket
with your environment. The name of the bucket is based on the environment region, name,
and a random ID such as
Cloud Composer stores the source code for your workflows (DAGs) and their dependencies in specific folders in Cloud Storage and uses Cloud Storage FUSE to map the folders to the Airflow instances in your Cloud Composer environment.
Folders in the Cloud Storage bucket
|Folder||Description||Storage path||Mapped directory|
|DAG||Stores the DAGs for your environment. Only the DAGs in this folder are scheduled for your environment.||
|Plugins||Stores your custom plugins, such as custom in-house Airflow operators, hooks, sensors, or interfaces.||
|Data||Stores the data that tasks produce and use. This folder is mounted on all worker nodes.||
|Logs||Stores the Airflow logs for tasks. Logs are also available in the Airflow web interface.||
DAGs and plugins: By default, Cloud Composer provisions
100 GB capacity for your environment, the
dags/ folder, and the
To avoid workflow failures, only store the DAGs and plugins for your
environment in these folders.
Data and logs: The
data/ folder and
logs/ folder are not subject to
capacity limits. However, to avoid a webserver error, make sure that data the
webserver needs to parse a DAG (not run) is available in the
Otherwise, the webserver can't access the data or load the Airflow web interface.
Currently, you cannot change the storage capacity.
When you modify DAGs or plugins in the Cloud Storage bucket,
Cloud Composer synchronizes the data across all the nodes in the
cluster. Cloud Composer synchronizes the
uni-directionally by copying locally and synchronizes
bi-directionally by using Cloud Storage FUSE.
Data is not synchronized to the webserver because of limited capacity and because the webserver parses but doesn't run DAGs in a Cloud Composer environment. The workers run the DAGs.
Place any custom Python libraries in a DAG's ZIP archive in a nested directory. Do not place libraries at the top level of the DAGs directory.
When Airflow scans the
dags/folder, Airflow only checks for DAGs in Python modules that are in the top-level of the DAGs folder and in the top level of a ZIP archive also located in the top-level
dags/folder. If Airflow encounters a Python module in a ZIP archive that does not contain both
DAGsubstrings, Airflow stops processing the ZIP archive. Airflow returns only the DAGs found up to that point.
For fault tolerance, do not define multiple DAG objects in the same Python module.
Do not define subDAGs as top-level objects.
In general, Airflow picks up DAG objects in the global namespace of a module in the
dags/directory as top-level DAGs. Any subDags defined as top-level objects execute on their own schedules in addition to the schedules of other DAGs that embed the subDags.
Place files that are required at DAG parse time in the
dags/directory not in the
data/directory is not mounted in the webserver.