Dataform overview

This document introduces you to Dataform concepts and processes.

Dataform is a service for data analysts to develop, test, control versions, and schedule complex workflows for data transformation in BigQuery.

Dataform lets you manage data transformation in the Extraction, Loading, and Transformation (ELT) process for data integration. After raw data is extracted from source systems and loaded into BigQuery, Dataform helps you to transform it into a well-defined, tested, and documented suite of data tables.

Dataform lets you perform the following data transformation actions:

  • Develop and execute workflows for data transformation.
  • Collaborate with team members on workflow development through Git.
  • Manage a large number of tables and their dependencies.
  • Declare source data and manage table dependencies.
  • View a visualization of the dependency tree of your workflow.
  • Manage data with SQL code in a central repository.
  • Reuse code with JavaScript.
  • Test data correctness with quality tests on source and output tables.
  • Version control SQL code.
  • Document data tables inside SQL code.

Data transformation processes in Dataform

The data transformation workflow for Dataform is as follows:

  1. Dataform lets you create repositories to manage your code.
  2. Dataform lets you create workspaces for development.
  3. Dataform lets you develop workflows in a development workspace.
  4. Dataform compiles Dataform core into SQL.
  5. Dataform executes the dependency tree.

Dataform lets you create repositories to manage your code

In a Dataform repository, you use Dataform core, an extension of SQL, to write SQLX files in which you define your workflow. Dataform repositories support version-control. You can link a Dataform repository to a third-party Git provider.

Dataform lets you create workspaces for development

You can create development workspaces inside a Dataform repository for Dataform core development. In a development workspace, you can make changes to the repository, compile, test, and push them to the main repository through Git.

Dataform lets you develop Dataform core in a development workspace

In a development workspace, you can define and document tables, their dependencies, and transformation logic to build your workflow. You can also configure actions in JavaScript.

Dataform compiles Dataform core

During compilation, Dataform performs the following tasks:

  • Compiles Dataform core into a workflow of Standard SQL.
  • Adds boilerplate SQL statements, such as CREATE TABLE or INSERT, to the code inline with your query configuration.
  • Transpiles (compiles source-to-source) JavaScript into SQL.
  • Resolves dependencies and checks for errors including missing or circular dependencies.
  • Builds the dependency tree of all actions to be run in BigQuery.

Dataform compilation is hermetic to ensure compilation consistency, meaning that the same code compiles to the same SQL compilation result every time. Dataform compiles your code in a sandbox environment with no internet access. No additional actions, such as calling external APIs, are available during compilation.

To debug in real-time, you can inspect the compiled workflow of your project in an interactive graph in your development workspace.

Dataform executes the dependency tree

In BigQuery, Dataform performs the following tasks:

  • Executes SQL commands, following the order of the dependency tree.
  • Executes assertion queries against your tables and views to check data correctness.
  • Executes other SQL operations that you defined.

After the execution, you can use your tables and views for all your analytics purposes.

You can view logs to see what tables were created, if assertions passed or failed, how long each action took to complete, and other information. You can also view the exact SQL code that was run in BigQuery.

Dataform features

With Dataform, you can develop and deploy tables, incremental tables, or views to BigQuery. Dataform offers a web environment for the following activities:

  • Workflow development
  • Connection with GitHub, GitLab, Azure DevOps Services, and Bitbucket
  • Continuous integration and continuous deployment
  • Workflow execution

The following sections describe the main features of Dataform.

Repositories

Each Dataform project is stored in a repository. A Dataform repository houses a collection of JSON configuration files, SQLX files, and JavaScript files.

Dataform repositories contain the following types of files:

  • Config files

    Config JSON or SQLX files let you configure your workflows. They contain general configuration, execution schedules, or schema for creating new tables and views.

  • Definitions

    Definitions are SQLX and JavaScript files that define new tables, views, and additional SQL operations to run in BigQuery.

  • Includes

    Includes are JavaScript files where you can define variables and functions to use in your project.

Each Dataform repository is connected to a service account. You can select a service account when you create a repository or edit the service account later.

By default, Dataform uses a service account derived from your project number in the following format:

service-YOUR_PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com

Version control

Dataform uses the Git version control system to maintain a record of each change made to project files and to manage file versions.

Each Dataform repository can manage its own Git repository, or be connected to a remote third-party Git repository. You can connect a Dataform repository to a GitHub, GitLab, Azure DevOps Services, or Bitbucket repository.

Users version control their workflow code inside Dataform workspaces. In a Dataform workspace, you can pull changes from the repository, commit all or selected changes, and push them to Git branches of the repository.

Workflow development

In Dataform, you make changes to files and directories inside a development workspace. A development workspace is a virtual, editable copy of the contents of a Git repository. Dataform preserves the state of files in your development workspace between sessions.

In a development workspace, you can develop SQL workflow actions by using Dataform core with SQLX and JavaScript, or exclusively with JavaScript. You can automatically format your Dataform core or JavaScript code.

Each element of a Dataform workflow, such as a table or assertion, corresponds to an action that Dataform performs in BigQuery. For example, a table definition file is an action of creating or updating the table in BigQuery.

In a Dataform workspace, you can develop the following workflow actions:

You can use JavaScript to reuse your Dataform workflow code in the following ways:

Dataform compiles the workflow code in your workspace in real-time. In your workspace, you can view the compiled queries and details of actions in each file. You can also view the compilation status and errors in the edited file or in the repository.

To test the output of a compiled SQL query before you execute it to BigQuery, you can run preview of the query in your Dataform workspace.

To inspect the entire workflow defined in your workspace, you can view an interactive compiled graph that shows all compiled actions in your workflow and relationships between them.

Workflow compilation

Dataform uses default compilation settings, configured in the workflow settings file, to compile the workflow code in your workspace to SQL in real-time, creating a compilation result of the workspace.

You can override compilation settings to customize how Dataform compiles your workflow into a compilation result.

With workspace compilation overrides, you can configure compilation overrides for all workspaces in a repository. You can set dynamic workspace overrides to create compilation results custom for each workspace, turning workspaces into isolated development environments. You can override the Google Cloud project in which Dataform will execute the contents of a workspace, add a prefix to names of all compiled tables, and add a suffix to the default schema.

With release configurations, you can configure templates of compilation settings for creating compilation results of a Dataform repository. In a release configuration, you can override the Google Cloud project in which Dataform will execute compilation results, add a prefix to names of all compiled tables, add a suffix the default schema, and add compilation variables. You can also set the frequency of creating compilation results. To schedule executions of compilation results created in a selected release configuration, you can create a workflow configuration.

Workflow execution

During workflow execution, Dataform executes compilation results of workflows to create or update assets in BigQuery.

To create or refresh the tables and views defined in your workflow in BigQuery, you can start a workflow execution manually in a development workspace or schedule executions.

You can schedule Dataform executions in BigQuery in the following ways:

To debug errors, you can monitor executions in the following ways:

Dataform core

Dataform core is an open source meta-language to create SQL tables and workflows. Dataform core extends SQL by providing a dependency management system, automated data quality testing, and data documentation.

You can use Dataform core for the following purposes:

  • Defining tables, views, materialized views, or incremental tables
  • Defining data transformation logic
  • Declaring source data and managing table dependencies
  • Documenting table and column descriptions inside code
  • Reusing functions and variables across different queries
  • Writing data assertions to ensure data consistency

In Dataform, you use Dataform core to develop workflows and deploy assets to BigQuery.

Dataform core is part of the open-source Dataform data modeling framework that also includes Dataform CLI. You can compile and run Dataform core locally through the Dataform CLI outside of Google Cloud.

To use Dataform core, you write SQLX files. Each SQLX file contains a query that defines a database relation that Dataform creates and updates inside BigQuery.

Dataform compiles your Dataform core code in real time to create a SQL compilation result that you can execute in BigQuery.

Dataform compilation is hermetic to ensure compilation consistency, meaning that the same code compiles to the same SQL compilation result every time. Dataform compiles your code in a sandbox environment with no internet access. No additional actions, such as calling external APIs, are available during compilation.

SQLX file config block

A SQLX file consists of a config block and a body. All config properties, and the config block itself, are optional. Given this, any plain SQL file is a valid SQLX file that Dataform executes as-is.

In the config block, you can perform the following actions:

Specify query metadata

You can configure how Dataform materializes queries into BigQuery, for example the output table type, the target database, or labels using the config metadata.

Document data

You can document your tables and their fields directly in the config block. Documentation of your tables is pushed directly to BigQuery. You can parse this documentation and push it out to other tools.

Define data quality tests

You can define data quality tests, called assertions, to check for uniqueness, null values, or a custom condition. Dataform adds assertions defined in the config block to your workflow dependency tree after table creation. You can also define assertions outside the config block, in a separate SQLX file.

The following code sample shows you how to define the output table type, document the table, and define a quality test in a config block of a SQLX file.

config {
  type: "table",
    description: "This table joins orders information from OnlineStore & payment information from PaymentApp",
  columns: {
    order_date: "The date when a customer placed their order",
    id: "Order ID as defined by OnlineStore",
    order_status: "The status of an order e.g. sent, delivered",
    customer_id: "Unique customer ID",
    payment_status: "The status of a payment e.g. pending, paid",
    payment_method: "How the customer chose to pay",
    item_count: "The number of items the customer ordered",
    amount: "The amount the customer paid"
  },
    assertions: {
    uniqueKey: ["id"]
  }
}

SQLX file body

In the body of a SQLX file, you can perform the following actions:

Define a table

To define a new table you can use SQL SELECT statements and the ref function.

The ref function is a SQLX built-in function that is critical to dependency management in Dataform. The ref function lets you reference tables defined in your Dataform project instead of hard coding the schema and table names of your data table.

Dataform uses the ref function to build a dependency tree of all the tables to be created or updated. After compiling, Dataform adds boilerplate statements like CREATE, REPLACE, or INSERT.

The following code sample shows you how to reference a table in a SQLX file with the ref function.

config { type: "table" }

SELECT
  order_date AS date,
  order_id AS order_id,
  order_status AS order_status,
  SUM(item_count) AS item_count,
  SUM(amount) AS revenue

FROM ${ref("store_clean")}

GROUP BY 1, 2, 3

The output is similar to the following:

CREATE OR REPLACE TABLE Dataform.orders AS

SELECT
  order_date AS date,
  order_id AS order_id,
  order_status AS order_status,
  SUM(item_count) AS item_count,
  SUM(amount) AS revenue

FROM Dataform_stg.store_clean

GROUP BY 1, 2, 3

For more information on additional dependency management, for example, executing code conditionally, using other Dataform core built-in functions, see the Dataform core reference.

Define additional SQL operations

To configure Dataform to execute one or more SQL statements before or after creating a table or view, you can specify pre-query and post-query operations.

The following code sample shows you how to configure table or view access permissions in a post-query operation.

SELECT * FROM ...

post_operations {
  GRANT `roles/bigquery.dataViewer` ON TABLE ${self()} TO "group:someusers@dataform.co"
}

Encapsulate SQL code

To define reusable functions to generate repetitive parts of SQL code, you can use JavaScript blocks. You can reuse code defined in a JavaScript block only inside the SLQX file where the block is defined. To reuse code across your entire repository, you can create includes.

To dynamically modify a query, you can use inline JavaScript anywhere in the body.

The following code sample shows how to define a JavaScript block in a SQLX file and use it inline inside a query:

js {
  const columnName = "foo";
}

SELECT 1 AS ${columnName} FROM "..."

Limitations

Dataform has the following known limitations:

  • Dataform in Google Cloud runs on a plain V8 runtime and does not support additional capabilities and modules provided by Node.js. If your existing codebase requires any Node.js modules, you need to remove these dependencies.

    Projects without a name field in package.json generate diffs on package-lock.json every time packages are installed. To avoid this outcome, you need to add a name property in package.json.

  • git+https:// URLs for dependencies in package.json are not supported.

    Convert such URLs to plain https:// archive URLs. For example, convert git+https://github.com/dataform-co/dataform-segment.git#1.5 to https://github.com/dataform-co/dataform-segment/archive/1.5.tar.gz.

  • Manually running unit tests is not available.

  • Searching for file content in development workspaces is not available.

  • As of Dataform core 3.0.0., Dataform doesn't distribute a Docker image. You can build your own Docker image of Dataform, which you can use to run the equivalent of Dataform CLI commands. To build your own Docker image, see Containerize an application in the Docker documentation.

  • The following Dataform API methods don't comply with the AIP.134 guidelines by treating the * wildcard entry as a bad request and by updating all fields instead of set fields when field_mask is omitted:

What's next