Keep up with the latest announcements from Google Cloud Next '21. Click here.

Developers & Practitioners

The BigQuery admin reference guide: Resource Hierarchy

Starting this week, we’re adding new content to the BigQuery Spotlight Youtube series. Throughout the summer we’ll be adding new videos and blog posts focused on helping new BigQuery architects and administrators master the fundamentals. You can find complimentary material for the topics discussed in the official BigQuery documentation, which is linked below. First up, the BigQuery Resource Model!

BigQuery, like other Google Cloud resources, is organized hierarchically where the Organization node is the root node, the Projects are the children of the Organization, and Datasets are descendants of Projects. In this post, we will look closer at the BigQuery resource model and discuss key considerations for architecting your deployment based on business needs. 

BigQuery core Resource Model 

BigQuery Core Resource Model

Organizations, folders & billing accounts

The Organization resource is the root node of the Google Cloud resource hierarchy. It represents a company and is closely associated with your organization’s domain by being linked to one Google Workspace or Cloud Identity account. While an Organization is not required to get started using BigQuery, it is recommended. With an Organization resource, projects belong to your organization instead of the employee who created the project. Furthermore, organization administrators have central control of all resources. 

Folders are an additional grouping mechanism on top of Projects. They can be seen as sub-organizations within the Organization. Folders can be used to model different legal entities, departments, and teams within a company. Folders act as a policy inheritance point - IAM roles granted on a folder are automatically inherited by all Projects and folders included in that folder. For BigQuery flat-rate customers, slots (units of CPU) can be assigned to Organizations, Folders or Projects where they are distributed fairly among projects to handle job workloads

A Billing Account is required to use BigQuery, unless you are using the BigQuery sandbox. Many times, different teams will want to be billed individually for consuming resources in Google Cloud. Therefore, each billing group will have its own billing account, which results in a single invoice and is tied to a Google Payments profile.  

Projects

A Project is required to use BigQuery and forms the basis for using all Google Cloud services. Projects are analogous to databases in other systems.
Projects
  bq query \
--project_id=project
--use_legacy_sql=false \
'SELECT *
FROM
project1.dataset.table as table1
LEFT JOIN project2.dataset.table as table2
ON table1.col = table2.col'

A project is used both for storing data and for running jobs on, or querying, data. And because storage and compute are separate, these don’t need to be the same project. You can store your data in one project and query it from another, this includes combining data stored in multiple projects in a single query. A project can have only one billing account, the project will be billed for data stored in the project and jobs run in the project. Watch out for per-project limitations and quotas. 

Datasets

Datasets are top-level containers, within a project, that are used to organize and control access to tables, views, routines and machine learning models. A table must belong to a dataset, so you need to create at least one dataset before loading data into BigQuery.

Datasets

Your data will be stored in the geographic location that you chose at the dataset's creation time. After a dataset has been created, the location can't be changed. One important consideration is that you will not be able to query across multiple locations, you can read details on location considerations here. Many users chose to store their data in a multi-region location, however some chose to set a specific region that is close to on-premise databases or ETL jobs.

Access controls

Access to data within BigQuery can be controlled at different levels in the resource model, including the Project, Dataset, Table or even column. However, it’s often easier to control access higher in the hierarchy for simpler management.  


Examples of common BigQuery project structures:

By now you probably realize that deciding on a Project structure can have a big influence on data governance, billing and even query efficiency. Many customers chose to deploy some notion of data lakes and data marts by leveraging different Project hierarchies. This is mainly a result of cheap data storage, more advanced SQL offerings which allow for ELT workloads and in-database transformations, plus the separation of storage and compute inside of BigQuery

Central data lake, department data marts

With this structure, there is a common project that stores raw data in BigQuery (Unified Storage project), also referred to as a Data Lake. It’s common for a centralized data platform team to create a pipeline that actually ingest data from various sources into BigQuery within this project. Each department or team would then have their own datamart projects (e.g. Department A Compute) where they can query the data, save results and create aggregate views.


unified storage

How it works:

  • Central data engineering team is granted permission to ingest and edit data in the storage project
  • Department analysts are granted BigQuery Data Viewer role for specific datasets in the Unified Storage project
  • Department analysts are also granted BigQuery Data Editor role and BigQuery Job User role for their department’s compute project
  • Each compute project would be connected to the team’s billing account

This is especially useful for when:

  • Each business unit wants to be billed individually for their queries
  • There is a centralized platform or data engineering team that ingests data into BigQuery across business units
  • Different business units access their data in their own tools or directly in the console
  • You need to avoid too many concurrent queries running in the same project (due to per-project quotas)

Department data lakes, one common data warehouse project

With this option, data for each department is ingested into separate projects - essentially giving each department their own data lake. Analysts are then able to query these datasets or create aggregate views in a central data warehouse project, which can also easily be connected to a business intelligence tool.

Data as storage

How it works:

  • Data engineers who are responsible for ingesting specific data sources are granted BQ Data Editor and BQ Job User roles in their department’s storage project
  • Analysts are granted BQ Data Viewer role to underlying data at the project level, for example an HR analyst might be granted data viewer access to the entire HR storage project
  • Service accounts that are used to connect BigQuery to external business intelligence tools can be also be granted data viewer access to specific projects that contain datasets to be used in visualizations
  • Analysts and Service Account are then granted BQ Job User and BQ Data Editor roles in the Central Data Warehouse project

This is especially useful for when:

  • It’s easier to manage raw data access at the project / department level
  • Central analytics team would rather have a single project for compute, which could make monitoring queries simpler 
  • Users are accessing data from a centralized business intelligence tool
  • Slots can be assigned to the data warehouse project to handle all queries from analysts and external tools

Note that this structure may result in a lot of concurrent queries, so watch out for per-project limitations. This structure works best for flare-rate customers with lifted concurrency limits.  

Department data lakes and department data marts

Here, we combine the previous approaches and create a data lake or storage project for each department. Additionally, each department might have their own datamart project where analysts can run queries.

a data lake

How it works:

  • Department data engineers are granted BQ Data Editor and BQ Job User roles for their department’s data lake project
  • Department data analysts are granted BQ Viewer roles for their department’s data lake project
  • Department data analysts are granted BQ Data Editor and BQ Job User roles for their department’s data mart project
  • Authorized views and authorized user defined functions can be leveraged to give data analysts access to data in projects where they themselves don’t have access

This is especially useful for when:

  • Each business unit wants to be billed individually both for data storage and compute
  • Different business units access their data in their own tools or directly in the console
  • You need to avoid too many concurrent queries running in the same project
  • It’s easier to manage raw data access at the project / department level

Now you should be well on your way to beginning to architect your BigQuery Data Warehouse! Be sure to keep an eye out for more in this series by following me on LinkedIn and Twitter!