Structuring code in a repository

This document describes best practices for structuring and naming SQL workflow files in the root definitions directory of a Dataform repository. The recommended structure of the definitions directory reflects the stages of a SQL workflow. You can adopt any structure that fits your business needs.

You might want to structure the SQL workflow code in the definitions directory for the following reasons:

  • Improving collaboration on codebase by designating teams to selected parts of your workflow.
  • Improving maintainability of the SQL workflow in case of organizational changes.
  • Improving navigation through your codebase.
  • Improving scalability of the codebase.
  • Minimizing the administrative overhead for your team.

The root definitions directory in a Dataform repository contains code that creates elements of your SQL workflow. You can organize files in the definitions directory into a structure of directories that reflects the structure of the workflow.

When you develop a SQL workflow, you declare source tables and transform them to create output tables that you can use for business or analytics purposes.

You can distinguish three key stages of a SQL workflow:

  1. Declaration of data sources
  2. Transformation of source data
  3. Definition of output tables from the transformed source data

The following structure of subdirectories in the definitions directory reflects the key stages of a SQL workflow:

sources
Data source declarations and basic transformation of source data, for example, filtering.
intermediate
Tables and actions that read from sources and transform data before you use the transformed data to define outputs tables. Tables typically not exposed to additional processes or tools, such as business intelligence (BI) tools, after Dataform executes them to BigQuery.
outputs
Definitions of tables consumed by processes or tools, such as BI, after Dataform executes them in BigQuery.
extra
Files outside of the main pipeline of your SQL workflow, for example, files that contain workflow data prepared for additional use, like machine learning. An optional and custom subdirectory.

Best practices for sources

The sources subdirectory contains the first stage of your SQL workflow: declaration and basic transformation of source data.

In the sources subdirectory, store data source declarations and tables that filter, categorize, cast, or rename columns.

Avoid storing tables that combine data from multiple sources.

Transform sources data in tables stored in the intermediate subdirectory.

If you declare data sources from multiple pools, for example, Google Ads or Google Analytics, dedicate a subdirectory to each pool.

The following sample shows a subdirectory structure of sources with two source pools:

definitions/
    sources/
        google_ads/
            google_ads_filtered.sqlx
            google_ads_criteria_metrics.sqlx
            google_ads_criteria_metrics_filtered.sqlx
            google_ads_labels.sqlx
            google_ads_labels_filtered.sqlx
        google_analytics/
            google_analytics_users.sqlx
            google_analytics_users_filtered.sqlx
            google_analytics_sessions.sqlx

If you declare multiple data source tables within the same schema, you can consolidate their declarations into a single JavaScript file. In a JavaScript file, you can store multiple data source declarations. For more information about creating data source declarations with JavaScript, see Create Dataform SQL workflows with JavaScript.

The following code sample shows multiple data sources within one schema, declared in a single JavaScript file:

[
  "source_table_1",
  "source_table_2",
  "source_table_3"
].forEach((name) =>
  declare({
    database: "gcp_project", 
    schema: "source_dataset",
    name,
  })
);

To protect your SQL workflow against data source changes, you can create a view for each data source declaration, for example, analytics_users_filtered.sqlx. The view can contain basic filtering and formatting of source data. Store the views in the sources subdirectory.

Then, when you create intermediate or outputs tables, reference the views instead of raw source tables. This approach lets you test the source tables. In case a source table changes, you can modify its view, for example, by adding filters or recasting data.

Best practices for intermediate

The intermediate subdirectory contains the second stage of your SQL workflow: transformation and aggregation of source data from one or multiple sources.

In the intermediate subdirectory, store files that significantly transform source data from one or multiple sources in the sources subdirectory, for example, tables that join data. Tables in the intermediate subdirectory typically query data from source tables or other intermediate tables.

Use intermediate tables to create outputs tables. Typically, intermediate tables are not used additional purposes, for example, data analytics, after Dataform executes them to BigQuery. You can think of intermediate tables as the data transformation logic that enables creation of output tables.

We recommend that you document and test all intermediate tables.

Best practices for outputs

The outputs subdirectory contains the final stage of your SQL workflow: creation of output tables for your business purposes from transformed data.

In the outputs directory, store tables that you plan to use in additional processes or tools after Dataform executes them to BigQuery, for example, reports or dashboards. Tables in the outputs directory typically query data from intermediate tables.

Group outputs tables by the business entity they are related to, for example, marketing, orders, or analytics. Dedicate a subdirectory to each business entity.

To store output tables separately in BigQuery, you can configure a dedicated schema for output tables. For instructions to configure table schema, see Configure additional table settings.

The following sample shows a subdirectory structure of outputs with two business entities:

definitions/
    outputs/
        orders/
            orders.sqlx
            returns.sqlx
        sales/
            sales.sqlx
            revenue.sqlx
        marketing/
            campaigns.sqlx
        

We recommend that you document and test all outputs tables.

Naming strategy

Names of all files in Dataform must conform to BigQuery table naming guidelines.

We recommend that the names of files in the definitions directory in a Dataform repository reflect the subdirectory structure.

In the sources subdirectory, the filenames should point to the source that file is related to. Add the name of the source as a prefix to filenames, for example, analytics_filtered.sqlx

In the intermediate subdirectory, the filenames should identify the subdirectory, so that collaborators can clearly distinguish intermediate files. Select a unique prefix, and apply it only to files in the intermediate directory. For example, stg_ads_concept.sqlx.

In the outputs subdirectory, the filenames should be concise, for example orders.sqlx. If you have outputs tables with the same names in different entity subdirectories, add a prefix that identifies the entity, for example, sales_revenue.sqlx and ads_revenue.sqlx.

The following example shows a subdirectory structure inside the definitions directory with filenames that conform to the recommended naming strategy:

definitions/
    sources/
        google_analytics.sqlx
        google_analytics_filtered.sqlx
    intermediate/
        stg_analytics_concept.sqlx
    outputs/
        customers.sqlx
        sales/
            sales.sqlx
            sales_revenue.sqlx
        ads/
            campaigns.sqlx
            ads_revenue.sqlx

What's next