Overview of Cloud Composer

This page provides an overview of a Cloud Composer environment and the Google Cloud Platform products used for an Apache Airflow deployment.

Cloud Composer is a managed workflow orchestration service that is built on Airflow. Similar to an on-premises deployment, Cloud Composer deploys multiple components to run Airflow. This page describes the GCP components, their functions, and how you run your workflows.

Also similar to on-premises, Cloud Composer relies on certain configurations to successfully execute your workflows. Altering configurations that Cloud Composer relies on for connections or communications can have unintended consequences or break your Airflow deployment. This page identifies the important environment configurations.

Environments

Airflow is a micro-service architected framework. To deploy Airflow in a distributed setup, Cloud Composer provisions several GCP components, which are collectively known as a Cloud Composer environment.

Environments are a core concept in Cloud Composer. You can create one or more Cloud Composer environments inside of a project. Environments are self-contained Airflow deployments based on Google Kubernetes Engine. These environments work with GCP services through connectors that are built into Airflow.

You create Cloud Composer environments in supported regions, and the environments run within a Compute Engine zone. For simple use cases, you can create one environment in one region. For complex use cases, you can create multiple environments within a single region or across multiple regions. Airflow communicates with other GCP products through the products' public APIs.

Architecture

When you create an environment, Cloud Composer distributes the environment's resources between a Google-managed tenant project and a customer project, as shown in the following diagram:

Cloud Composer environment resources in the tenant project and the customer project (click to enlarge)
Cloud Composer Environment Resources (click to enlarge)

Tenant project resources

For unified Cloud Identity and Access Management access control and an additional layer of data security, Cloud Composer deploys Cloud SQL and App Engine in the tenant project.

Cloud SQL

Cloud SQL stores the Airflow metadata. To protect sensitive connection and workflow information, Cloud Composer limits database access to the default or the specified custom service account used to create the environment. Cloud Composer backs up the Airflow metadata daily to minimize potential data loss.

The service account used to create the Cloud Composer environment is the only account that can access your data in the Cloud SQL database. To remotely authorize access to your Cloud SQL database from an application, client, or other GCP service, Cloud Composer provides the Cloud SQL proxy in the GKE cluster.

App Engine

App Engine flexible environment hosts the Airflow web server. By default, the Airflow web server is integrated with Cloud Identity-Aware Proxy. Cloud Composer hides the Cloud IAP integration details and enables you to use the Cloud Composer IAM policy to manage web server access. To grant access only to the Airflow web server, you can assign the composer.user role, or you can assign different Cloud Composer roles that provide access to other resources in your environment. For organizations with additional access-control requirements, Cloud Composer also supports deploying a self-managed Airflow web server in the customer project.

Customer project resources

Cloud Composer deploys Cloud Storage, Google Kubernetes Engine, and Stackdriver in your customer project.

Cloud Storage

Cloud Storage provides the storage bucket for staging DAGs, plugins, data dependencies, and logs. To deploy workflows (DAGs), you copy your files to the bucket for your environment. Cloud Composer takes care of synchronizing the DAGs among workers, schedulers, and the web server. With Cloud Storage you can store your workflow artifacts in the data/ and logs/ folders without worrying about size limitations and retain full access control of your data.

Google Kubernetes Engine

By default, Cloud Composer deploys core components—such as Airflow scheduler, worker nodes, and CeleryExecutor—in a GKE. For additional scale and security, Cloud Composer also supports VPC-native clusters using alias IPs.

Redis, the message broker for the CeleryExecutor, runs as a StatefulSet application so that messages persist across container restarts.

Running the scheduler and workers on GKE enables you to use the KubernetesPodOperator to run any container workload. By default, Cloud Composer enables auto-upgrade and auto-repair to protect the GKE clusters from security vulnerabilities. If you need to upgrade your Cloud Composer GKE cluster before the auto-upgrade cycle, you can perform a manual upgrade.

Note that the Airflow worker and scheduler nodes and the Airflow web server run on different service accounts.

  • Scheduler and workers: If you do not specify a service account during environment creation, the environment runs under the default Compute Engine service account.
  • web server: The service account is auto-generated during environment creation and derived from the web server domain. For example, if the domain is foo-tp.appspot.com, the service account is foo-tp@appspot.gserviceaccount.com.

You can see serviceAccount and airflowUri information in the environment details.

Stackdriver

Cloud Composer integrates with Stackdriver Logging and Stackdriver Monitoring, so you have a central place to view all Airflow service and workflow logs.

Because of the streaming nature of Stackdriver Logging, you can view the logs that the Airflow scheduler and workers emit immediately instead of waiting for Airflow logging module synchronization. And because the Stackdriver logs for Cloud Composer are based on google-fluentd, you have access to all logs the scheduler and worker containers produce. These logs greatly improve debugging and contain useful system-level and Airflow dependency information.

Stackdriver Monitoring collects and ingests metrics, events, and metadata from Cloud Composer to generate insights via dashboards and charts.

Networking and security

By default, Cloud Composer deploys a route-based GKE cluster that uses the default VPC network for machine communications. For additional security and networking flexibility, Cloud Composer also supports the following features.

Shared VPC

Shared VPC enables shared network resource management from a central host project to enforce consistent network policies across projects.

When Cloud Composer participates in a shared VPC, the Cloud Composer environment is in a service project and can invoke services hosted in other GCP projects. Resources within your service projects communicate securely across project boundaries using internal IP addresses. For network and host project requirements, see Configuring shared VPC.

VPC-native Cloud Composer environment

With VPC native, pod and service IP addresses in the GKE cluster are natively routable within the GCP network, including through VPC Network Peering.

In this configuration, Cloud Composer deploys a VPC-native GKE cluster using alias IP addresses in your environment. When you use VPC-native clusters, GKE automatically chooses a secondary range. For specific networking requirements, you can also configure the secondary ranges for your GKE pods and GKE services during Cloud Composer environment configuration.

Private IP Cloud Composer environment

With private IP, Cloud Composer workflows are fully are isolated from the public internet.

In this configuration, Cloud Composer deploys a VPC-native GKE cluster using alias IP addresses in the customer project. The GKE cluster for your environment is configured as a private cluster, and the Cloud SQL instance is configured for private IP. Cloud Composer also creates a peering connection between your customer project's VPC network and your tenant project's VPC network.

Important configuration information

  • Some Airflow parameters are preconfigured for Cloud Composer environments, and you cannot change them. Other parameters you configure when you create your environment.
  • Any quotas or limits that apply to the standalone GCP products that Cloud Composer uses for your Airflow deployment, apply also to your environment.
  • Cloud Composer relies on the following configurations to execute workflows successfully:
    • The Cloud Composer service backend coordinates with its GKE service agent through Cloud Pub/Sub using subscriptions and relies on Cloud Pub/Sub's default behavior to manage messages. Do not delete .*-composer-.* topics. Cloud Pub/Sub supports a maximum of 10,000 topics per project.
    • The Cloud Composer service coordinates logging with Stackdriver. To limit the number of logs in your GCP project, you can stop all logs ingestion. Do not disable Stackdriver.
    • Do not modify the Cloud Identity and Access Management policy binding for the Cloud Composer service account, for example service-your-project-number@cloudcomposer-accounts.iam.gserviceaccount.com.
    • Do not change the Airflow database schema.
  • A Cloud Composer release running a stable Airflow version can include Airflow updates that are backported from a later Airflow version.
  • The worker and scheduler nodes have a different capacity and run under a different service account than the Airflow web server. To avoid DAG failures on the Airflow web server, do not perform heavyweight computation or access GCP resources that the web server does not have access to at DAG parse time.
  • Deleting your environment does not delete the following data in your customer project: the Cloud Storage bucket for your environment, Stackdriver logs, and Cloud Pub/Sub topics. To avoid incurring charges to your GCP account, export and delete your data, as needed.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Composer