Data Analytics

BigQuery and surrogate keys: a practical approach

October 1, 2018

Marco Tranquilin

Cloud Consultant, Professional Services, Google Cloud

When working with tables in data warehouse environments, it is fairly common to come across a situation in which you need to generate surrogate keys. A surrogate key is a system-generated identifier that uniquely identifies a record within a table.

Why do we need to use surrogate keys? Quite simply: contrary to natural keys, they persist over time (i.e. they are not tied to any business meaning) and they allow for unlimited values. Think about a table that is collecting IoT data from multiple devices in different regions: since you want to store unlimited data and you could have possible overlaps on device identifiers, surrogate keys will help a lot to uniquely identify a record. Another use case could be an “items” table in which you want to store information not only related to articles but also history of changes made on data: even in this case surrogate keys could be an elegant solution to easily assign unique identifiers to all records.

https://storage.googleapis.com/gweb-cloudblog-publish/images/image3_RzRG1LE.max-1500x1500.png

How to generate surrogate keys

A common way to generate surrogate keys is to assign an incremental number to each row of a table: you can accomplish this goal using the standard SQL function ROW_NUMBER() OVER({window_name | (window_specification)})

lang-sql

https://storage.googleapis.com/gweb-cloudblog-publish/images/image1_sDK2XuK.max-2000x2000.png

Unfortunately, this approach is limited. To implement ROW_NUMBER(), BigQuery needs to sort values at the root node of the execution tree, which is limited by the amount of memory in one execution node.

The general solution is to partition the data using PARTITION predicate combined with the partition fields to get a unique id per row.

https://storage.googleapis.com/gweb-cloudblog-publish/images/image2_cCtMAQO.max-2000x2000.png

An alternative approach

Another solution that you might adopt—especially in streaming scenarios—depends on hashing functions. Since this pattern computes the surrogate key values at run time, it can be applied to data sets of any dimension. The idea is to compute the SHA256 digest of selected record fields and use the result as surrogate key.

Two possible approaches are possible:

INSERT INTO statement when new records have to be inserted in the table (play with inner query to select target data to be inserted into target table)
INSERT INTO `MyPrj.MyDataset.Wiki100B_With_SK`
(SurrogateKey,year,month,day,wikimedia_project,language,title,views)
SELECT (SHA256(title)) AS SurrogateKey,*
FROM `bigquery-samples.wikipedia_benchmark.Wiki100B`

UPDATE statement when existing records have to be updated (play with filters to select correct data to be updated)
UPDATE `MyPrj.MyDataset.Wiki100B_With_SK`
SET SurrogateKey = (SHA256(title))
WHERE year = 2010

When you plan your activities, please note that you are limited to 1000 INSERT operation per table per day and 200 UPDATE operation per table per day.

In case of duplicate records in the source table, a random value can be concatenated before computing the digest: this will reduce the probability of encountering collisions.

https://storage.googleapis.com/gweb-cloudblog-publish/images/image4_qUrQoTq.max-2000x2000.png

Another way to achieve the same result is to use a new function called GENERATE_UUID() that generates a unique identifier for each row of the query result at run-time. The generated key will be a lowercase String consisting of 32 hexadecimal digits in five groups separated by hyphens in the form 8-4-4-4-12.

lang-sql

https://storage.googleapis.com/gweb-cloudblog-publish/images/image5_H4x9kEp.max-2000x2000.png

Conclusion

Surrogate keys are common in data warehouse environments because they are:

Context-free
More future-compatible
Potentially infinitely scalable

BigQuery provides end users with the ability to easily deal with surrogate keys, enabling their generation and update at scale.

We hope you enjoyed learning some possible patterns to tackle surrogate key management; the engineering team is continuously improving the service by adding new capabilities or extending the ones already available. If you’d like to stay up to date with future additions, consider bookmarking the release notes page.

Posted in