Overview of Cloud Composer

This page provides an overview of a Cloud Composer environment and the Google Cloud Platform products used for an Apache Airflow deployment.

Cloud Composer is a managed workflow orchestration service that is built on Airflow. Similar to an on-premises deployment, Cloud Composer deploys multiple components to run Airflow. This page describes the GCP components, their functions, and how you run your workflows.

Also similar to on premises, Cloud Composer relies on certain configurations to successfully execute your workflows. Altering configurations that Cloud Composer relies on for connections or communications can have unintended consequences or break your Airflow deployment. This page identifies the important environment configurations.

Environments

Airflow is a micro-service architected framework. To deploy Airflow in a distributed setup, Cloud Composer provisions several GCP components, which are collectively known as a Cloud Composer environment.

Environments are a core concept in Cloud Composer. You can create one or more Cloud Composer environments inside of a project. Environments are self-contained Airflow deployments based on Google Kubernetes Engine. These environments work with GCP services through connectors that are built into Airflow.

You create Cloud Composer environments in supported regions, and the environments run within a Compute Engine zone. For simple use cases, you can create one environment in one region. For complex use cases, you can create multiple environments within a single region or across multiple regions. Airflow communicates with other GCP products through the products' public APIs.

Architecture

When you create an environment, Cloud Composer distributes the environment's resources between a Google-managed tenant project and a customer project, as shown in the following diagram:

Cloud Composer environment resources in the tenant project and the customer project (click to enlarge)
Cloud Composer Environment Resources (click to enlarge)

Tenant project resources

For unified Cloud Identity and Access Management access control and an additional layer of data security, Cloud Composer deploys Cloud SQL and App Engine in the tenant project.

Cloud SQL

Cloud SQL stores the Airflow metadata. To protect sensitive connection and workflow information, Cloud Composer limits database access to the default or the specified custom service account used to create the environment. Cloud Composer backs up the Airflow metadata daily to minimize potential data loss.

The service account used to create the Cloud Composer environment is the only account that can access your data in the Cloud SQL database. To remotely authorize access to your Cloud SQL database from an application, client, or other GCP service, Cloud Composer provides the Cloud SQL proxy in the GKE cluster.

App Engine

App Engine flexible environment hosts the Airflow webserver. By default, the Airflow webserver is integrated with Cloud Identity-Aware Proxy. Composer hides the Cloud IAP integration details and enables you to use the Composer IAM policy to manage webserver access. To grant access only to the Airflow webserver, you can assign the composer.user role , or you can assign different Cloud Composer roles that provide access to other resources in your environment. For organizations with additional access-control requirements, Cloud Composer also supports deploying a self-managed Airflow webserver in the customer project.

Customer project resources

Cloud Composer deploys Cloud Storage, Google Kubernetes Engine, and Stackdriver in your customer project.

Cloud Storage

Cloud Storage provides the storage bucket for staging DAGs, plugins, data dependencies, and logs. To deploy workflows (DAGs), you copy your files to the bucket for your environment. Cloud Composer takes care of synchronizing the DAGs among workers, schedulers, and webservers. With Cloud Storage you can store your workflow artifacts in the data/ and logs/ folders without worrying about size limitations and retain full access control of your data.

Google Kubernetes Engine

Cloud Composer deploys core components—such as Airflow scheduler, worker nodes, and CeleryExecutor—in a GKE cluster. Redis, the message broker for the CelerayExecutor, runs as a StatefulSet application so that messages persist across container restarts.

Running the scheduler and workers on GKE enables you to use the KubernetesPodOperator to run any container workload. By default, Cloud Composer enables auto-upgrade and auto-repair to protect the GKE clusters from security vulnerabilities.

Note that the Airflow worker and scheduler nodes and the Airflow webserver run on different service accounts.

  • Scheduler and workers: If you do not specify a service account during environment creation, the environment runs under the default Compute Engine service account.
  • Webserver: The service account is auto-generated during environment creation and derived from the webserver domain. For example, if the domain is foo-tp.appspot.com, the service account is foo-tp@appspot.gserviceaccount.com.

You can see serviceAccount and airflowUri information in the environment details.

Stackdriver

Cloud Composer integrates with Stackdriver for monitoring and logging, so you have a central place view of all Airflow service and workflow logs. Because of the streaming nature of Stackdriver Logging, you can view the logs that the Airflow scheduler and workers emit immediately instead of waiting for Airflow logging module synchronization. And because the Stackdriver logs for Cloud Composer are based on google-fluentd, you have access to all logs the the scheduler and worker containers produce. These logs greatly improve debugging and contain useful system-level and Airflow dependency information.

Important configuration information

  • Some Airflow parameters are preconfigured for Cloud Composer environments, and you cannot change them. Other parameters you configure when you create your environment.
  • Cloud Composer relies on the following configurations to execute workflows successfully:
    • The Cloud Composer service backend coordinates with its GKE service agent through Cloud Pub/Sub. Do not delete .*-composer-.* topics.
    • The Cloud Composer service coordinates logging with Stackdriver. To limit the number of logs in your GCP project, you can stop all logs ingestion. Do not disable Stackdriver.
    • Do not modify the Cloud Identity and Access Management policy binding for the Cloud Composer service account, for example service-your-project-number@cloudcomposer-accounts.iam.gserviceaccount.com.
    • Do not change the Airflow database schema.
  • The gcloud command line tool enables you to run the Airflow CLI. To avoid affecting database operations, do not issue resetdb, initdb, or upgradedb Airflow commands.
  • A Cloud Composer release running a stable Airflow version can include Airflow updates that are backported from a future Airflow version.
  • The worker and scheduler nodes have a different capacity and run under a different service account than the Airflow webserver. To avoid DAG failures on the Airflow webserver, do not perform heavyweight computation or access GCP resources that the webserver does not have access to at DAG parse time.
  • Deleting your environment does not delete the following data in your customer project: the Cloud Storage bucket for your environment and Stackdriver logs. To avoid incurring changes to your GCP account, export and delete your data, as needed.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Composer