Schedule workloads

BigQuery tasks are usually part of larger workloads, with external tasks triggering and then being triggered by BigQuery operations. Workload scheduling helps data administrators, analysts, and developers organize and optimize this chain of actions, creating a seamless connection across data resources and processes. Scheduling methods and tools assist in designing, building, implementing, and monitoring these complex data workloads.

Choose a scheduling method

To select a scheduling method, you should identify whether your workloads are event-driven, time-driven, or both. An event is defined as a state change, such as a change to data in a database or a file added to a storage system. In event-driven scheduling, an action on a website might trigger a data activity, or an object landing in a certain bucket might need to be processed immediately on arrival. In time-driven scheduling, new data might need to be loaded once per day or frequently enough to produce hourly reports. You can use event-driven and time-driven scheduling in scenarios where you need to load objects into a data lake in real time, but activity reports on the data lake are only generated daily.

Choose a scheduling tool

Scheduling tools assist with tasks that are involved in managing complex data workloads, such as combining multiple Google Cloud or third-party services with BigQuery jobs, or running multiple BigQuery jobs in parallel. Each workload has unique requirements for dependency and parameter management to ensure that tasks are executed in the correct order using the correct data. Google Cloud provides several scheduling options that are based on scheduling method and workload requirements.

We recommend using Dataform, Workflows, Cloud Composer, or Vertex AI Pipelines for most use cases. Consult the following chart for a side-by-side comparison:

Dataform Workflows Cloud Composer Vertex AI Pipelines
Focus Data transformation Microservices ETL or ELT Machine learning
Complexity * ** *** **
User profile Data analyst or administrator Data architect Data engineer Data analyst
Code type JavaScript and SQL YAML or JSON Python Python
Serverless? Yes Yes Fully managed Yes
Not suitable for Chains of external services Data transformation and processing Low latency or event-driven pipelines Infrastructure tasks

The following sections detail these scheduling tools and several others.

Scheduled queries

The simplest form of workload scheduling is scheduling recurring queries directly in BigQuery. While this is the least complex approach to scheduling, we recommend it only for straightforward query chains with no external dependencies. Queries scheduled in this way must be written in GoogleSQL and can include data definition language (DDL) and data manipulation language (DML) statements.

Scheduling method: time-driven

Dataform

Dataform is a free, SQL-based, opinionated transformation framework that schedules complex data transformation tasks in BigQuery. When raw data is loaded into BigQuery, Dataform helps you create an organized, tested, version-controlled collection of datasets and tables. Use Dataform to schedule runs for your data preparations, notebooks, and BigQuery workflows.

Scheduling method: event-driven

Workflows

Workflows is a serverless tool that schedules HTTP-based services with very low latency. It is best for chaining microservices together, automating infrastructure tasks, integrating with external systems, or creating a sequence of operations in Google Cloud. To learn more about using Workflows with BigQuery, see Run multiple BigQuery jobs in parallel.

Scheduling method: event-driven and time-driven

Cloud Composer

Cloud Composer is a fully managed tool built on Apache Airflow. It is best for extract, transform, load (ETL) or extract, load, transform (ELT) workloads as it supports several operator types and patterns, as well as task execution across other Google Cloud products and external targets. To learn more about using Cloud Composer with BigQuery, see Run a data analytics DAG in Google Cloud.

Scheduling method: time-driven

Vertex AI Pipelines

Vertex AI Pipelines is a serverless tool based on Kubeflow Pipelines specially designed for scheduling machine learning workloads. It automates and connects all tasks of your model development and deployment, from training data to code, giving you a complete view of how your models work. To learn more about using Vertex AI Pipelines with BigQuery, see Export and deploy a BigQuery machine learning model for prediction.

Scheduling method: event-driven

Apigee Integration

Apigee Integration is an extension of the Apigee platform that includes connectors and data transformation tools. It is best for integrating with external enterprise applications, like Salesforce. To learn more about using Apigee Integration with BigQuery, see Get started with Apigee Integration and a Salesforce trigger.

Scheduling method: event-driven and time-driven

Cloud Data Fusion

Cloud Data Fusion is a data integration tool that offers code-free ELT/ETL pipelines and over 150 preconfigured connectors and transformations. To learn more about using Cloud Data Fusion with BigQuery, see Replicating data from MySQL to BigQuery.

Scheduling method: event-driven and time-driven

Cloud Scheduler

Cloud Scheduler is a fully managed scheduler for jobs like batch streaming or infrastructure operations that should occur on defined time intervals. To learn more about using Cloud Scheduler with BigQuery, see Scheduling workflows with Cloud Scheduler.

Scheduling method: time-driven

Cloud Tasks

Cloud Tasks is a fully managed service for asynchronous task distribution of jobs that can execute independently, outside of your main workload. It is best for delegating slow background operations or managing API call rates. To learn more about using Cloud Tasks with BigQuery, see Add a task to a Cloud Tasks queue.

Scheduling method: event-driven

Third-party tools

You can also connect to BigQuery using a number of popular third-party tools such as CData and SnapLogic. The BigQuery Ready program offers a full list of validated partner solutions.

Messaging tools

Many data workloads require additional messaging connections between decoupled microservices that only need to be activated when certain events occur. Google Cloud provides two tools that are designed to integrate with BigQuery.

Pub/Sub

Pub/Sub is an asynchronous messaging tool for data integration pipelines. It is designed to ingest and distribute data like server events and user interactions. It can also be used for parallel processing and data streaming from IoT devices. To learn more about using Pub/Sub with BigQuery, see Stream from Pub/Sub to BigQuery.

Eventarc

Eventarc is an event-driven tool that lets you manage the flow of state changes throughout your data pipeline. This tool has a wide range of use cases including automated error remediation, resource labeling, image retouching, and more. To learn more about using Eventarc with BigQuery, see Build a BigQuery processing pipeline with Eventarc.

What's next