This document helps you understand how repository size impacts SQL workflow development and Dataform compilation resources usage, and how to estimate compilation resources usage of your repository.
About repository size in Dataform
The size of a repository impacts the following aspects of development in Dataform:
- Collaboration
- Multiple collaborators working on a large repository can create an excessive number of pull requests, increasing the risk of merge conflicts.
- Codebase readability
- A larger number of files that make up a SQL workflow in a single repository can make it difficult to navigate through the repository.
- Development processes
- Some areas of a large SQL workflow in a single repository might require custom permissions or processes, such as scheduling, different from the permissions and processes applied to the rest of the SQL workflow. Large repository size makes it difficult to tailor development processes to specific areas of the SQL workflow.
- Workflow compilation
- Dataform enforces usage limits on compilation resources. Large repository size can lead to exceeding these limits, causing compilation to fail.
- Workflow execution
- During execution, Dataform executes all repository code inside your workspace and deploys assets to BigQuery. The larger the repository, the more time it takes Dataform to execute it.
If the large size of your repository negatively impacts your development in Dataform, you can split the repository into multiple smaller repositories.
About repository compilation resources limits
During development, Dataform compiles all repository code inside your workspace to generate a representation of the SQL workflow in your repository, called a compilation result. Dataform enforces usage limits on compilation resources.
Your repository might exceed the usage limits for the following reasons:
- An infinite loop bug in the repository code.
- A memory leak bug in the repository code.
- Large repository size, approximately more than 1000 SQL workflow nodes.
For more information on usage limits on compilation resources, see Dataform compilation resources limits.
Estimate compilation resources usage of your repository
You can estimate the usage of the following compilation resources for your repository:
- CPU time usage
- Maximum total serialized data size of the generated graph of actions defined in your repository
To obtain a rough approximation of the current compilation CPU time usage for the compilation of your repository, you can time the compilation of your Dataform SQL workflow on a local Linux or macOS machine.
- To time the compilation of your SQL workflow, inside your repository, execute
the Dataform CLI
dataform compile
command in the following format:
time dataform compile
The following code sample shows a result of executing the
time dataform compile
command:
real 0m3.480s
user 0m1.828s
sys 0m0.260s
You can treat the real
result as a rough indicator of the CPU time usage for
the compilation of your repository.
To obtain a rough approximation of the total size of the generated graph of actions in your repository, you can write the output of the graph to a JSON file. You can treat the size of the uncompressed JSON file as a rough indicator of the total graph size.
- To write the output of the compiled graph of your SQL workflow to a JSON file, inside your repository, execute the following Dataform CLI command:
dataform compile --json > graph.json
What's next
- To learn more about Dataform compilation resource limits, see Dataform compilation resources limits.
- To learn more about splitting a repository in Dataform, see
Splitting repositories.