Best Practices for Schema Design

This page describes best practices for designing Cloud Spanner schemas to avoid hotspots and for loading data into Cloud Spanner.

Choosing a primary key

As mentioned in Schema and Data Model, you should be careful when choosing a primary key to not accidentally create hotspots in your database. One way to accidentally create hotspots is by choosing a column whose value monotonically increases as the first key part, because this results in all inserts occurring at the end of your key space. This is undesirable because Cloud Spanner divides data among servers by key ranges, which means all your inserts will be directed at a single server that will end up doing all the work.

For example, suppose you want to maintain a last access timestamp column on rows of Users data. The following table definition that uses a timestamp-based primary key as the first key part is an anti-pattern:


  LastAccessTimestamp INT64 NOT NULL,
  UserId              INT64 NOT NULL,
) PRIMARY KEY (LastAccessTimestamp, UserId);

The problem here is that rows will be written to this table in order of last access timestamp, and since last access timestamps are always increasing, they're always written to the end of the table. The hotspot is created because a single Cloud Spanner server will receive all of the writes, which overloads that one server.

The diagram below illustrates this pitfall:

Users table ordered by timestamp with corresponding hotspot

The Users table above includes five example rows of data, which represent five different users taking some sort of user action about a millisecond apart from each other. The diagram also annotates the order in which the rows are inserted (the labeled arrows indicate the order of writes for each row). Because inserts are ordered by timestamp, and the timestamp value is always increasing, the inserts are always added to the end of the table and are directed at the same split. (As discussed in Schema and Data Model, a split is a set of rows from one or more related tables that are stored in order of row key.)

This is problematic because Cloud Spanner assigns work to different servers in units of splits, so the server assigned to this particular split ends up handling all the insert requests. As the frequency of user access events increases, the frequency of insert requests to the corresponding server also increases. The server then becomes prone to becoming a hotspot, as indicated by the red border and background above. Note that in this simplified illustration, each server handles at most one split but in actuality each Cloud Spanner server can be assigned more than one split.

As more inserts are appended to the table, the split grows, and when it reaches its maximum size (~ a few GiB), Cloud Spanner creates another split. Subsequent inserts get appended to this new split and the server that gets assigned to it becomes the new potential hotspot.

When hotspots occur, you might observe that your inserts are slow and other work on the same server might also be slowed down. Changing the order of the LastAccessTimestamp column to ascending order doesn't solve this problem because then all the writes are inserted at the top of the table instead, which still sends all the inserts to a single server.

Schema design best practice #1: Do not choose a column whose value monotonically increases or decreases as the first key part.

Possible fix #1: Swap the order of keys

A simple fix to spread writes over the key space is to swap the order of the keys so the column that contains the monotonically increasing or decreasing value is not the first key part:

  UserId              INT64 NOT NULL,
  LastAccessTimestamp INT64 NOT NULL,
) PRIMARY KEY (UserId, LastAccessTimestamp);

In this modified schema, inserts are now ordered by UserId, rather than by chronological last access timestamp. This spreads writes among different splits because it's unlikely that a single user will produce thousands of events per second.

The diagram below illustrates the five rows from the Users table ordered by UserId instead of by access timestamp:

Users table ordered by UserId with balanced write throughput

Here the Users data is chunked into three splits, with each split containing on the order of a thousand rows of ordered UserId values. This is a reasonable estimate of how the user data could be split, assuming each row contains about 1MB of user data and given a maximum split size on the order of GB. Even though the user events occurred about a millisecond apart, each event was done by a different user, so the order of inserts is much less likely to create a hotspot compared with ordering by timestamp.

See also the related best practice for Ordering timestamp-based keys.

Possible fix #2: Hash the key and spread the writes among N shards

Another common technique for spreading the load across multiple servers is to hash the actual unique key, and use the hash (or the hash + the unique key) as the primary key. This will help avoid hotspots by ensuring that inserted rows are spread more evenly across your key space.

For example, to spread writes to the Users table among N shards, prepend a ShardId key column to the table:

  ShardId             INT64 NOT NULL,
  LastAccessTimestamp INT64 NOT NULL,
  UserId              INT64 NOT NULL,
) PRIMARY KEY (ShardId, LastAccessTimestamp, UserId);

To compute which shard to write a given row to, compute ShardId = hash(key parts) % N. Then if you want to read all the rows that were accessed since time T, read from the N different database splits and collate the results.

The diagram below illustrates how this fix spreads write throughput more evenly across servers:

Users table ordered by ShardId with balanced write throughput

Here the Users table is ordered by ShardId, which is calculated as a hash function of key columns . The number of shards you create and the order of the resulting insert requests depend on the specifics of the hash function that you define. In this example the five Users rows are chunked into three shards, each of which is in a different split. The inserts are spread evenly among the splits, which balances write throughput to the three servers that handle the splits.

Loading data

This section provides best practices for bulk inserting or writing rows into tables in a Cloud Spanner database.

  • Avoid writing rows in primary key order. For example, if you are loading data into a table with primary key UserId, it is important that you make sure your data load does not write all users in UserId order. Ideally each data load worker would be writing to a different part of the key space, so that writes go to many different Cloud Spanner servers instead of just overloading one.

  • Partition the ordered key space into ranges, and then have each range processed by a worker batching the data into that range into approximately 1MB or 10MB chunks. It's good to have a fairly large number of partitions (rule of thumb: 10 x number of nodes) so that individual workers don't cause moving hotspots.

Data loading best practices: Ensure that writes are well distributed and load the data using multiple workers.

Creating indexes

Similar to the previous primary key anti-pattern, it's also a bad idea to create non-interleaved indexes on columns whose values are monotonically increasing or decreasing, even if they aren't primary key columns.

For example, supposed you define the following table, in which LastAccessTimestamp is a non-primary key column:

  LastAccessTimestamp INT64,
  UserId              INT64 NOT NULL,

It might seem convenient to define an index on the LastAccessTimestamp column for quickly querying the database for user accesses "since time X", like this:


CREATE NULL_FILTERED INDEX UsersByLastAccessTimestamp ON Users(LastAccessTimestamp)

However, this results in the same pitfall as described in the previous best practice, because indexes are implemented as tables under the hood, and the resulting index table would use a column whose value monotonically increases as its first key part.

It is okay to create an interleaved index like this though, because rows of interleaved indexes are interleaved in corresponding parent rows, and it's unlikely for a single parent row to produce thousands of events per second.

Schema design best practice #2: Do not create a non-interleaved index on a column whose value monotonically increases or decreases.

If you are going to bulk load data, wait until your data is loaded before you create indexes. If you create indexes before you load your data, each insert will lead to a synchronous index update. Insert transactions with synchronous index updates likely involve more machines and are slower than transactions without index updates, so the bulk load will have lower throughput. If you create your indexes after you bulk load your data, the indexes will be populated asynchronously. Similar to many other relational database management systems, creating an index on pre-existing data is faster than updating an index incrementally as each row is loaded. You can track the status of asynchronous index creation using the database operations API.

Index creation best practice: Create indexes after you bulk load your data.

Ordering timestamp-based keys

If you have a table that's keyed by timestamp, make the timestamp key descending by adding the DESC keyword. Doing that makes reading all your history up to X-long-ago more efficient, since Cloud Spanner won't need to scan all your data to find out where X-long-ago begins, but can just begin reading "at the top".

For example:

  UserId              INT64 NOT NULL,
  LastAccessTimestamp INT64 NOT NULL,
) PRIMARY KEY (UserId, LastAccessTimestamp DESC);

Schema design best practice #3: Use descending order for timestamp-based keys.

Send feedback about...

Cloud Spanner Documentation