This document describes best practices for structuring and naming SQL workflow files in the root definitions
directory of a Dataform repository. The recommended structure of the definitions
directory reflects the stages
of a SQL workflow. You can adopt any structure that fits your business needs.
You might want to structure the SQL workflow code
in the definitions
directory for the following reasons:
- Improving collaboration on codebase by designating teams to selected parts of your workflow.
- Improving maintainability of the SQL workflow in case of organizational changes.
- Improving navigation through your codebase.
- Improving scalability of the codebase.
- Minimizing the administrative overhead for your team.
Recommended structure of the definitions
directory
The root definitions
directory in a Dataform repository
contains code that creates elements of your SQL workflow. You can organize
files in the definitions
directory into a structure of directories
that reflects the structure of the workflow.
When you develop a SQL workflow, you declare source tables and transform them to create output tables that you can use for business or analytics purposes.
You can distinguish three key stages of a SQL workflow:
- Declaration of data sources
- Transformation of source data
- Definition of output tables from the transformed source data
The following structure of subdirectories in the definitions
directory
reflects the key stages of a SQL workflow:
sources
- Data source declarations and basic transformation of source data, for example, filtering.
intermediate
- Tables and actions that read from
sources
and transform data before you use the transformed data to defineoutputs
tables. Tables typically not exposed to additional processes or tools, such as business intelligence (BI) tools, after Dataform executes them to BigQuery. outputs
- Definitions of tables consumed by processes or tools, such as BI, after Dataform executes them in BigQuery.
extra
- Files outside of the main pipeline of your SQL workflow, for example, files that contain workflow data prepared for additional use, like machine learning. An optional and custom subdirectory.
Best practices for sources
The sources
subdirectory contains the first stage of your SQL workflow:
declaration and basic transformation of source data.
In the sources
subdirectory, store data source declarations and tables
that filter, categorize, cast, or rename columns.
Avoid storing tables that combine data from multiple sources.
Transform sources
data in tables stored in the intermediate
subdirectory.
If you declare data sources from multiple pools, for example, Google Ads or Google Analytics, dedicate a subdirectory to each pool.
The following sample shows a subdirectory structure of sources
with two source pools:
definitions/
sources/
google_ads/
google_ads_filtered.sqlx
google_ads_criteria_metrics.sqlx
google_ads_criteria_metrics_filtered.sqlx
google_ads_labels.sqlx
google_ads_labels_filtered.sqlx
google_analytics/
google_analytics_users.sqlx
google_analytics_users_filtered.sqlx
google_analytics_sessions.sqlx
If you declare multiple data source tables within the same schema, you can consolidate their declarations into a single JavaScript file. In a JavaScript file, you can store multiple data source declarations. For more information about creating data source declarations with JavaScript, see Create Dataform SQL workflows with JavaScript.
The following code sample shows multiple data sources within one schema, declared in a single JavaScript file:
[
"source_table_1",
"source_table_2",
"source_table_3"
].forEach((name) =>
declare({
database: "gcp_project",
schema: "source_dataset",
name,
})
);
To protect your SQL workflow against data source changes, you can create a view
for each data source declaration, for example, analytics_users_filtered.sqlx
.
The view can contain basic filtering and formatting of source data.
Store the views in the sources
subdirectory.
Then, when you create intermediate
or outputs
tables, reference the views
instead of raw source tables. This approach lets you test the source tables.
In case a source table changes, you can modify its view, for example,
by adding filters or recasting data.
Best practices for intermediate
The intermediate
subdirectory contains the second stage of your SQL workflow:
transformation and aggregation of source data from one or multiple sources.
In the intermediate
subdirectory, store files that significantly transform
source data from one or multiple sources in the sources
subdirectory,
for example, tables that join data. Tables in the intermediate
subdirectory
typically query data from source tables or other intermediate
tables.
Use intermediate
tables to create outputs
tables.
Typically, intermediate
tables are not used for additional purposes,
for example,
data analytics, after Dataform executes them to BigQuery.
You can think of intermediate
tables as the data transformation logic
that enables creation of output tables.
We recommend that you document and
test all intermediate
tables.
Best practices for outputs
The outputs
subdirectory contains the final stage of your SQL workflow:
creation of output tables for your business purposes from transformed data.
In the outputs
directory, store tables that you plan to use in additional
processes or tools after Dataform executes them to BigQuery,
for example, reports or dashboards. Tables in the outputs
directory typically
query data from intermediate
tables.
Group outputs
tables by the business entity they are related to, for example,
marketing, orders, or analytics. Dedicate a subdirectory to each business entity.
To store output tables separately in BigQuery, you can configure a dedicated schema for output tables. For instructions to configure table schema, see Configure additional table settings.
The following sample shows a subdirectory structure of outputs
with sales
and marketing
business entities:
definitions/
outputs/
orders/
orders.sqlx
returns.sqlx
sales/
sales.sqlx
revenue.sqlx
marketing/
campaigns.sqlx
We recommend that you document and
test all outputs
tables.
Naming strategy
Names of all files in Dataform must conform to BigQuery table naming guidelines.
We recommend that the names of files in the definitions
directory
in a Dataform repository reflect the subdirectory structure.
In the sources
subdirectory, the filenames should point to the source
that file is related to. Add the name of the source as a prefix to filenames,
for example, analytics_filtered.sqlx
In the intermediate
subdirectory, the filenames should identify the subdirectory,
so that collaborators can clearly distinguish intermediate
files.
Select a unique prefix, and apply it only to files in the intermediate
directory. For example, stg_ads_concept.sqlx
.
In the outputs
subdirectory, the filenames should be concise,
for example orders.sqlx
. If you have outputs
tables with the same names in
different entity subdirectories, add a prefix that identifies the entity,
for example, sales_revenue.sqlx
and ads_revenue.sqlx
.
The following example shows a subdirectory structure inside the definitions
directory with filenames that conform to the recommended naming strategy:
definitions/
sources/
google_analytics.sqlx
google_analytics_filtered.sqlx
intermediate/
stg_analytics_concept.sqlx
outputs/
customers.sqlx
sales/
sales.sqlx
sales_revenue.sqlx
ads/
campaigns.sqlx
ads_revenue.sqlx
What's next
- To learn more about SQL workflows in Dataform, see Introduction to SQL workflows.
- To learn more about Dataform repositories, see Introduction to repositories.
- To learn more about splitting repositories, see
Introduction to splitting repositories.