Simplifying best practices at scale with auto-generated keys in Cloud Spanner
Diem Tran
Software Engineer
Sean Fox
Software Engineer
Innovators in financial services, gaming, retail, and many other industries rely on Cloud Spanner to power demanding relational database workloads that need to scale without downtime. Built on Google’s distributed infrastructure, Spanner provides a fully managed experience with the highest levels of consistency and availability at any scale.
In a continued push to meet developers where they are, we announced support for auto-generated keys in Spanner at Google Cloud Next today. Auto-generated keys allow schema authors to push down critical identity logic into the database. This simplifies application code and reduces opportunities for errors or inconsistencies. Auto-generated keys also help you migrate from relational databases that expect keys based on integer sequences. Like everything in Spanner, auto-generated keys are designed from the ground up to not compromise scalability, availability, or performance over the entire lifecycle of your applications.
What’s in the box?
Auto-generated keys rely on several underlying features:
Primary key default values: Specify an expression in DDL to calculate a DEFAULT primary key value of a table when none is provided by an INSERT. DML queries can get the newly created value using a RETURNING (PostgreSQL) or THEN RETURN (GoogleSQL) clause.
UUID function: Generate a UUID (v4) as part of a table’s primary key DEFAULT expression.
Integer sequences (bit-reversed): Generate unique, uniformly distributed numbers as part of a primary key DEFAULT expression.
Primary key generated columns: Generate a component in a composite primary key, for example to hash an incrementing number or timestamp value to avoid hotspots.
Bit reverse function: Map existing integer keys using the same logic as a bit-reversed sequence to avoid hotspotting.
Together, these capabilities balance the usability of default keys and sequences in legacy databases with best practices for Spanner at scale. All of these are available today in both the PostgreSQL and GoogleSQL dialects.
Counting by one is actually hard
The default guidance for primary keys in most relational databases is to create surrogate keys using integer sequences or an auto-incrementing type. In a single-instance database at low volume, monotonically increasing keys are easy to understand and perform well.
Counting by one is surprisingly difficult to do efficiently at scale, though. Truly monotonic sequences — increasing by exactly one at each step — require a global lock on a counter. Like a line at the deli, this can add latency as transactions wait for their number. It can also introduce a single point of failure. Even single-instance relational databases have added options like sequence sharding and caching that trade off the simplicity of counting by one for performance and availability.
From scale up to scale out
Ordered keys in a massively distributed system like Spanner can also lead to “hotspotting” at scale.
Spanner uses range sharding to distribute groups of rows, called splits, across compute resources in a Spanner instance. This allows the database to parallelize queries and updates and is fundamental to how Spanner scales. Because range sharding arranges rows in sort order by primary key, inserts or updates in the same range of values will tend to land in the same split. If Spanner’s automatic splitting and rebalancing can‘t keep up with the load, a “hot” split can use up its available compute and memory resources. When this happens, Spanner’s work scheduler queues up further requests, increasing latency and potentially leading to timeouts and aborted transactions.
Thus it‘s vitally important to pick primary keys that will allow Spanner to distribute work evenly. Unrelated but close-in-time inserts shouldn't all target the same split.
With Spanner’s range partitioning best practices and developer experience in mind, Cloud Spanner has introduced two new ways of generating primary keys in the database: UUIDs and integer sequences.
Universal Unique Identifiers (UUIDs)
The default recommendation for primary keys in Spanner is to use a Universally Unique Identifier, or UUID. UUIDs are 128-bit identifiers generated using a public specification (rfc-4122). Version 4 of this spec requires 122 bits of random data, resulting in values that are roughly uniformly distributed, thus making them good candidates for non-hotspotting keys in Spanner. 122 bits also provide a very large number of potential values, roughly 5×1036.
The wide range and randomness ensure effective uniqueness regardless of where the UUID is generated. Many Spanner customers generate UUID primary keys today in their applications. With this release, they can also provide a default expression in their DDL that will generate a UUID in the database too, providing additional flexibility. For example, a large bulk insert might generate primary keys externally in an upstream process. However, interactive applications pointing at the same database can fall back to default generation to ensure they’re all using the same logic.
A table can use the new GENERATE_UUID()
function in PostgreSQL or GoogleSQL dialect databases in a primary key default value expression to automatically populate new rows with a UUID v4 value.
In PostgreSQL syntax a Spanner table with a default UUID primary key would look something like:
UUIDs are stored in Spanner today using their string representation. Easy input and output using strings is generally simpler to work with than a potentially more compact byte representation. (As such, the namespaced spanner.generate_uuid()
function in PostgreSQL returns text
.)
You can use RETURNING
(or THEN RETURN
in GoogleSQL) to retrieve the value of the primary key that the database created:
Integer sequences
UUIDs are simplest when you’re designing your schema from scratch. What about when your application, tool, or existing schema uses an integer primary key?
Spanner now offers a SEQUENCE
object that generates unique integers as part of a primary key DEFAULT
expression. However, unlike a monotonic sequence, the values generated by a Spanner sequence are distributed uniformly and thus won’t hotspot at scale. While this isn’t exactly the behavior you get out-of-the-box with a sequence in other databases, it addresses the key requirements of uniqueness and performance and availability at scale without having to significantly change critical application or schema logic that expect default integer keys.
You declare a sequence in Spanner just like you would in another relational database. You can generate a value from that sequence in a default expression using the nextval()
function in PostgreSQL or GET_NEXT_SEQUENCE_VALUE()
in GoogleSQL.
For example, using Spanner’s PostgreSQL dialect:
The bit_reversed_positive
option is required today (or sequence_kind='bit_reversed_positive'
in GoogleSQL). A bit-reversed sequence takes the underlying counting mechanism and applies a transformation before returning it, reordering the bytes in the binary representation of the current value of the counter. The result is a unique integer that‘s generated automatically in the database, but not a simple incrementing value starting at one. A bit-reversed sequence distributes values across all positive integers to avoid hotspots at scale.
While sequences internally use multiple counters for high availability, developers should not rely on the specific implementation. The approach may change in future releases. Sequences in Spanner should be treated more like a random number generator than a counter. Applications that need to track the order in which rows are inserted should create a separate timestamp column to decouple the identity and ordering aspects. You can find additional guidance on indexing timestamps.
Choosing a primary key strategy
For the most flexibility and interoperability, UUID keys are generally the best choice for applications. Sequences are preferable for migrations of existing schemas or application code. Default values using sequence keys incur a small overhead on inserts, typically less than 100μs for coordination and periodically refreshing caches. This small one-time cost, however, provides the benefits of guaranteed uniqueness at scale.
Get started today
Today Google Cloud introduced new capabilities that help users implement best practices for primary keys in Spanner using the SQL concepts they already know. Users can now use UUIDs or integer sequences as default values in their primary keys. These provide flexibility to simplify application code and are designed for performance and availability at scale.
In the next post we’ll dive into migrating schemas and data from databases that use auto-incrementing keys. Stay tuned.
In the meantime, learn more about what makes Spanner unique and how it’s being used today. Or try it yourself for free for 90-days or for as little as $65 USD/month for a production-ready instance that grows with your business without downtime or disruptive rearchitecture.