This document helps you understand the architecture and execution of SQL workflows in Dataform.
You can use Dataform to develop, test, and version control SQL workflows that you can execute in BigQuery to transform data for analytics purposes. You can develop SQL workflows with Dataform core, using SQLX files and optionally JavaScript files, or with JavaScript.
A SQL workflow can consist of the following objects:
- Data source declarations
- Declarations of BigQuery data sources that let you reference these data sources in Dataform table definitions and SQL operations.
- Tables
- Tables that you create in Dataform based on the declared data sources or other tables in your SQL workflow. Dataform supports the following table types: table, incremental table, view, and materialized view.
- Assertions
- Data quality test queries that you can use to validate table data. Dataform runs assertions every time it updates your SQL workflow and it alerts you if any assertions fail.
- Custom SQL operations
- SQL statements that Dataform runs in BigQuery as they are, without modification.
- Includes
- JavaScript files with definitions of variables and functions that you can reuse across your SQL workflow.
Visualisation of a SQL workflow
You can view your SQL workflow visualized in the form of a Directed Acyclic Graph (DAG). The DAG displays all objects of the SQL workflow defined in your workspace, and the relationships between them. You can zoom in and out, and use drag and drop to navigate the DAG. If there are compilation errors in your SQL workflow, Dataform displays an error message instead of the DAG.
To view the DAG of your SQL workflow, in your workspace, click Compiled graph.
Execution of a SQL workflow
In your development workspace, you can manually trigger execution of your entire SQL workflow, a selection of actions, or a selection of tags.
You can schedule executions with Dataform release configurations and workflow configurations. First, create a release configuration to create compilation results of your repository. Then, create a workflow configuration, select a release configuration, select SQL workflow actions that you want to execute, and set the execution schedule.
Alternatively, you can schedule executions either with Cloud Composer, or with Workflows and Cloud Scheduler.
During execution, Dataform executes SQL queries in BigQuery, following the order of object dependencies in your SQL workflow. After execution, you can use your defined tables and views for all your analytics purposes in BigQuery.
Execution configuration options
To execute a specific group of your SQL workflow objects, you can add Dataform execution tags to the files you select. You can then execute only the files with a selected tag when you manually trigger execution.
By default, Dataform executes your SQL workflow with the execution
settings defined in the dataform.json
file.
You can override these execution settings with compilation overrides.
With workspace compilation overrides, you can turn workspaces into isolated execution environments. This means that when you manually trigger execution in a workspace, Dataform executes the output in an isolated location in BigQuery.
To create and execute a single compilation result with compilation overrides you can pass requests with the Dataform API.
With release configurations, you can configure compilation overrides for your whole repository, as well as the frequency of creating compilation results with the applied settings.
To learn more about ways to configure compilation and code lifecycle in Dataform, see Introduction to code lifecycle in Dataform.
What's next
- To learn how to declare a data source, see Declare a data source.
- To learn how to declare dependencies to define relationships between objects in your SQL workflow, see Declare dependencies.
- To learn how to define custom SQL operations, see Add custom SQL operations.
- To learn how to reuse code across your SQL workflow with includes, see Reuse variables and functions with includes.