Google Cloud Big Data and Machine Learning Blog

Innovation in data processing and machine learning technology

Cloud Composer is now in beta: build and run practical workflows with minimal effort

Tuesday, May 1, 2018

By James Malone, Product Manager

Today, Google Cloud is announcing the beta launch of Cloud Composer, a managed Apache Airflow service, to make workflow creation and management easy, powerful, and consistent. We’ve experienced and investigated the challenges of workflow orchestration, and worked to deliver a better experience that saves users both time and stress.

Google Cloud Composer (left) provides a managed Apache Airflow service (right)

Analysts, data engineers, and other users can use Cloud Composer’s fully managed capabilities to author, schedule, and monitor their workflows. Composer offers a unique package of attributes to users, including:

  • End-to-end GCP integration: Orchestrate your full GCP pipeline through Cloud Composer’s deep integration within Google Cloud Platform.
  • Hybrid and multi-cloud environments: Connect your pipeline and break down silos through a single orchestration service whether your workflow lives on-premises or in multiple clouds.
  • Easy workflow orchestration: Get moving quickly by coding your directed acyclic graphs (DAGs) using Python, and increase the reliability of your workflows through graphs that highlight areas to troubleshoot.
  • Open source at its core: Ensure freedom from lock-in through Apache Airflow’s open source portability, and gain access to new integrations developed by the community.

We believe there should be an easy and reliable workflow solution at a platform level, like other cloud services. With the aforementioned features and others outlined below, Cloud Composer delivers a single managed solution to create and manage workflows, regardless of where they live, and gives you the portability to take these critical pieces of infrastructure with you if you migrate your environment.

With this broad context, let’s explore traditional workflow construction, the Apache Airflow project, and a deeper dive on Cloud Composer, including how to get started and some early success users have experienced with the service.

The pros and cons of traditional workflows

Creating a workflow to automate a process is usually pretty easy. For example, it’s common to fire up a text editor and write some Bash or Python to automate a process. Once you opt to schedule a script via a mechanism like cron, you can focus your time and attention on other more important tasks. You minimize time spent and likelihood of errors: that’s a win every time, right? Unfortunately, the ability for one person to automate is a double-edged sword.

When creating this workflow, did the author use standard tools and save time by reusing previously-developed code from other workflows? Do other people on the team or in the organization know this workflow exists and how it works? Is it easy for everyone to understand the state of this workflow and to investigate any problems when they occur? Will workflow authors all easily or immediately know the APIs needed to create rich workflows? Without a common workflow language and system, the answer to these questions is most frequently “no.”

Considering these workflows can be mission-critical, we believe it should be easy to answer “yes” to all of these questions. Anyone from an analyst to an experienced software developer should be able to author and manage workflows in a way that saves time and reduces risk.

Why Apache Airflow?

When we started building Cloud Composer we knew we wanted to base the product on an open-source project. Consistent with GCP’s “open Cloud” philosophy, one of our must have features was a no-lock-in approach. If you develop a workflow on Google Cloud Platform (GCP), it should be easily portable to other environments. Likewise, we wanted Composer to work across a wide range of technologies beyond those in GCP.

A number of awesome open-source projects have focused on workflows, including Apache Oozie, Luigi, Azkaban, and Apache Airflow. We chose to base Cloud Composer on Airflow for many reasons, but specifically because Airflow:

  1. Has an active and diverse developer community
  2. Is based on Python with support for custom plugins
  3. Includes connectors (called operators in Airflow) for many clouds and common technologies
  4. Features an elegant web user interface and command-line tooling
  5. Provides support for multi-cloud and hybrid cloud orchestration
  6. Has been used in production settings by companies large and small, including WePay, Airbnb (who created Airflow initially), Quizlet, and others.

If you are new to Airflow, you may want to check out the Airflow core concepts.

Cloud Composer overview

In building Cloud Composer, we wanted to combine the strengths of Google Cloud Platform with Airflow. We set out to build a service that offers the best of Airflow without the overhead of installing and managing Airflow yourself. As a result, you spend less time on low-value work, such as getting Airflow installed, and more time on what matters: your workflows.

This initial beta release of Cloud Composer includes several features to save you time and increase the usefulness of your Airflow deployments, including:

This initial release is our starting point and we have a number of features planned for the future including additional Google Cloud regions, Airflow and Python version selection, and autoscaling.

We are passionate about being an active member of the Apache Airflow community. Leading up to this beta release the Cloud Composer team has submitted over 20 pull requests to the Airflow project. Currently, we are working on several different efforts to help extend Airflow, including our work on KubernetesExecutor. The Cloud Composer team has a number of future efforts planned in core Airflow and looks forward to collaborating with the wider Airflow community.

Getting started

If you're already familiar with Apache Airflow, you already know how to use Google Cloud Composer. If you are new to Apache Airflow, the Airflow DAG tutorial is a good place to start. The Airflow API reference is also useful since it explains core concepts in Airflow technical design.

Using Cloud Composer with GCP products is easy. Currently, Cloud Composer and Airflow have support for BigQuery, Cloud Dataflow, Cloud Dataproc, Cloud Datastore, Cloud Storage, and Cloud Pub/Sub. The Airflow GCP documentation includes specifics on how to use the operators for these products.

As an example, suppose you want to run a query with BigQuery and then export the results to Cloud Storage. In your Airflow DAG you run on Cloud Composer, you’d simply use the BigQueryOperator and BigQueryToCloudStorageOperator operators.

To start, you can configure variables used by these operators, such as the dataset name, query parameters, and file output path. These variables can be local to the DAG or defined at the environment level, such as the gcs_bucket variable.

bq_dataset_name = 'airflow_bq_dataset_{{ ds_nodash }}'
bq_githib_commits_table_id = bq_dataset_name + '.github_commits'
output_file = 'gs://{gcs_bucket}/github_commits.csv'.format(
    gcs_bucket=models.Variable.get('gcs_bucket'))

After defining variables, you can then use the built-in Airflow operators to query BigQuery and then export the results to Cloud Storage.

# Perform query of Airflow GitHub commits
  bq_airflow_commits_query = bigquery_operator.BigQueryOperator(
      task_id='bq_airflow_commits_query',
      bql="""      SELECT commit, subject, message
      FROM [bigquery-public-data:github_repos.commits]
      WHERE repo_name contains 'airflow'
      """,
      destination_dataset_table=bq_githib_commits_table_id)

# Export query result to Cloud Storage
  export_commits_to_gcs = bigquery_to_gcs.BigQueryToCloudStorageOperator(      task_id='export_airflow_commits_to_gcs',
     source_project_dataset_table=bq_githib_commits_table_id,
     destination_cloud_storage_uris=[output_file],
  export_format='CSV')

To query BigQuery and export the results, you only needed to focus on your custom code (SQL) and configuration parameters, such as the Cloud Storage bucket name. This DAG can be scheduled in Cloud Composer, where you can see its state, logs, and manage the execution of this workflow.

Early success with alpha customers

Cloud Composer’s alpha program gave access to hundreds of users that wanted a better way to orchestrate their workflows. The feedback has helped us to improve the product and has been broadly positive, including:

"At Blue Apron, our data workflows need to operate flawlessly to enable on-time delivery of perishable food. Cloud Composer helps us orchestrate our pipelines more reliably by allowing us to author, schedule, and monitor our workflows from one place using Python."

—Michael Collis, Staff Software Engineer, Blue Apron

"Cloud Composer gives us the flexibility we need in workflow orchestration. The ability to define our workflows in Python and use a range of graphs to monitor them makes the challenges of data processing easier for us."

—Akshar Dave, Engineering Manager, Demandbase

Pricing

Pricing for Cloud Composer is consumption based, so you pay for what you use, as measured by vCPU/hour, GB/month, and GB transferred/month. We have multiple pricing units because Cloud Composer uses several GCP products as building blocks. Pricing is uniform across all levels of consumption and sustained usage. For more information, visit the Cloud Composer product page.

Next steps

Try Cloud Composer, either by referencing the referencing the “Getting Started” section above, or by trying out the Cloud Composer quickstart within the documentation. For additional examples of using GCP in Airflow DAGs, check out the Cloud Composer documentation. You can get help and share your thoughts about Cloud Composer via:

This beta release of Cloud Composer is a major milestone, but the first of many releases. You can follow our release notes for updates and fixes to Cloud Composer. We’d like to thank all of our alpha testers and the Apache Airflow community, and we look forward to future releases of Airflow and Cloud Composer!

  • Big Data Solutions

  • Product deep dives, technical comparisons, how-to's and tips and tricks for using the latest data processing and machine learning technologies.

  • Learn More

12 Months FREE TRIAL

Try BigQuery, Machine Learning and other cloud products and get $300 free credit to spend over 12 months.

TRY IT FREE