Data Stored in Cloud Storage

This page describes what data Cloud Composer stores for your environment in Cloud Storage.

When you create an environment, Cloud Composer creates a Cloud Storage bucket and associates the bucket with your environment. The name of the bucket is based on the environment region, name, and a random ID such as us-central1-b1-6efannnn-bucket.

Cloud Composer stores the source code for your workflows (DAGs) and their dependencies in specific folders in Cloud Storage and uses Cloud Storage FUSE to map the folders to the Airflow instances in your Cloud Composer environment.

Folders in the Cloud Storage bucket

FolderDescriptionStorage pathMapped directory
DAG Stores the DAGs for your environment. Only the DAGs in this folder are scheduled for your environment. gs://bucket-name/dags /home/airflow/gcs/dags
Plugins Stores your custom plugins, such as custom in-house Airflow operators, hooks, sensors, or interfaces. gs://bucket-name/plugins /home/airflow/gcs/plugins
Data Stores the data that tasks produce and use. This folder is mounted on all worker nodes. gs://bucket-name/data /home/airflow/gcs/data
Logs Stores the Airflow logs for tasks. Logs are also available in the Airflow web interface. gs://bucket-name/logs /home/airflow/gcs/logs

Capacity considerations

DAGs and plugins: By default, Cloud Composer provisions 100 GB capacity for your environment, the dags/ folder, and the plugins/ folder.

To avoid a workflow failure, store your DAGs, plugins, and Python modules in the dags/ or plugins/ folders—even if your Python modules do not contain DAGs or plugins. For example, you should store the py_file that a DataFlowPythonOperator references in dags/ or plugins/.

Data and logs: The data/ folder and logs/ folder are not subject to capacity limits.

To avoid a webserver error, make sure that data the webserver needs to parse a DAG (not run) is available in the dags/ folder. Otherwise, the webserver can't access the data or load the Airflow web interface.

Data synchronization

When you modify DAGs or plugins in the Cloud Storage bucket, Cloud Composer synchronizes the data across all the nodes in the cluster. Cloud Composer synchronizes the dags/ and plugins/ folders uni-directionally by copying locally and synchronizes data/ and logs/ folders bi-directionally by using Cloud Storage FUSE.

Data is not synchronized to the webserver because of limited capacity and because the webserver parses but doesn't run DAGs in a Cloud Composer environment. The workers run the DAGs.

Guidelines

  1. Place any custom Python libraries in a DAG's ZIP archive in a nested directory. Do not place libraries at the top level of the DAGs directory.

    When Airflow scans the dags/ folder, Airflow only checks for DAGs in Python modules that are in the top-level of the DAGs folder and in the top level of a ZIP archive also located in the top-level dags/ folder. If Airflow encounters a Python module in a ZIP archive that does not contain both airflow and DAG substrings, Airflow stops processing the ZIP archive. Airflow returns only the DAGs found up to that point.

  2. For fault tolerance, do not define multiple DAG objects in the same Python module.

  3. Do not define subDAGs as top-level objects.

    In general, Airflow picks up DAG objects in the global namespace of a module in the dags/ directory as top-level DAGs. Any subDags defined as top-level objects execute on their own schedules in addition to the schedules of other DAGs that embed the subDags.

  4. Place files that are required at DAG parse time in the dags/ directory not in the data/ directory. The data/ directory is not mounted in the webserver.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Composer