Schema and Data Model

Data model summary

A Cloud Spanner database can contain one or more tables. Tables look like relational database tables in that they are structured with rows, columns, and values, and they contain primary keys. Data in Cloud Spanner is strongly typed: you must define a schema for each database and that schema must specify the data types of each column of each table. Allowable data types include scalar and array types, which are explained in more detail in Data Types. You can also define one or more secondary indexes on a table, as described in Secondary Indexes.

Parent-child table relationships

You can define multiple tables in a database, and you can optionally define parent-child relationships between tables if you want Cloud Spanner to physically co-locate their rows for efficient retrieval. For example, if you have a Customers table and an Invoices table, and your application frequently fetches all the invoices for a given customer, you can define Invoices as a child table of Customers. In doing so, you're declaring a data locality relationship between two logically independent tables: you're telling Cloud Spanner to physically store one or more rows of Invoices with one Customers row.

Primary keys

How do you tell Cloud Spanner which Invoices rows to store with which Customers rows? You do so using the primary key of these tables. Every table must have a primary key, and that primary key can be composed of zero or more columns of that table. If you declare a table to be a child of another table, the primary key column(s) of the parent table must be the prefix of the primary key of the child table. This means if a parent table's primary key is composed of N columns, the primary key of each of its child tables must also be composed of those same N columns, in the same order and starting with the same column.

Cloud Spanner stores rows in sorted order by primary key values, with child rows inserted between parent rows that share the same primary key prefix. This insertion of child rows between parent rows along the primary key dimension is called interleaving, and child tables are also called interleaved tables. (See an illustration of interleaved rows in the physical view of schema example 3 below.)

In summary, primary keys in Cloud Spanner allow you to physically co-locate rows of related tables. The schema examples below show what this physical layout looks like.

Choosing a primary key

If your primary key is composed of one or more columns, you'll need a unique primary key for each row. Often your application already has a field that's a natural fit for use as the primary key. For example, in the Customers table example above, there might be an application-supplied CustomerId that serves well as the primary key. In other cases, you may need to generate a primary key when inserting the row, like a unique INT64 value that you generate using a UUID library.

In all cases, you should be careful not to create hotspots with the choice of your primary key. For example, if you insert records with a monotonically increasing integer as the key, you'll always insert at the end of your key space. This is undesirable because Cloud Spanner divides data among servers by key ranges, which means your inserts will be directed at a single server, creating a hotspot. A common technique for spreading the load across multiple servers is to hash the actual unique key, and use the hash (or the hash + the unique key) as the primary key. This will help ensure that inserted records are spread across your key space to avoid hotspots. Read more about this technique in Best Practices.

Database splits

You can define arbitrarily deep hierarchies of parent-child relationships between tables, which means you can co-locate rows of N logically independent tables. If N is small and/or the size of the data in your tables is small, your database can probably be handled by a single Cloud Spanner server. But what happens when your N related tables grow and start reaching the resource limits of an individual server? Cloud Spanner is a distributed database, which means that as your database grows, Cloud Spanner divides your data into chunks called "splits", where each split can move independently from each other and get assigned to different servers, which could be in different physical locations.

You have some control over how your data is split because Cloud Spanner can only draw split boundaries between rows of tables that are at the root of a hierarchy (i.e. tables that are not declared as a child of another table). Furthermore, because rows of an interleaved table are stored in sorted primary key order with the row from its parent table that shares its primary key prefix, this means that the rows of that child table cannot be split from their corresponding row in the parent table. (See an illustration of this in the physical view of schema example 4 below.) Thus, the parent-child table relationships that you define, along with the primary key values that you set for rows of related tables, give you control over how data is sharded under the hood.

As a rule of thumb, the size of every set of related rows in a hierarchy of parent-child tables should be less than a few GiB. A set of related rows in a hierarchy of parent-child tables is defined as: (a single row of a table at the root of a database hierarchy) + (all rows of that table's descendent tables that share the row's primary key) + (all rows of interleaved indexes that share the row's primary key).

The schema examples in the next section show how to create Cloud Spanner tables with and without parent-child relationships, and illustrate the corresponding physical layouts of data.

Schema examples

Creating a table

Suppose you're creating a music application and you need a simple table that stores rows of singer data:

Singers table with 5 rows and 4 columns.

Logical view of rows in a simple Singers table. The primary key column appears to the left of the bolded line.

Note that the table contains one primary key column, SingerId, which appears to the left of the bolded line, and that tables are organized by rows, columns, and values.

You can define the table with a Cloud Spanner schema like this:

CREATE TABLE Singers (
  SingerId   INT64 NOT NULL,
  FirstName  STRING(1024),
  LastName   STRING(1024),
  SingerInfo BYTES(MAX),
) PRIMARY KEY (SingerId);

Schema example 1 defines the Singers table

Note the following about the example schema:

  • Singers is a table at the root of the database hierarchy (because it's not defined as a child of another table).
  • Primary key columns are usually annotated with NOT NULL (though you can omit this annotation if you want to allow NULL values in key columns; see more in Notes About Key Columns).
  • Columns that are not included in the primary key are called non-key columns, and they can have an optional NOT NULL annotation.
  • Columns that use the STRING or BYTES type must be defined with a length, which represents the maximum number of Unicode characters that can be stored in the field. (More details in Scalar Data Types.)

What does the physical layout of the rows in the Singers table look like? The diagram below shows rows of the Singers table stored by contiguous (aka sorted order of) primary key (i.e. "Singers(1)", then "Singers(2)", and so on, where "Singers(1)" represents the row in the Singers table keyed by 1.).

Example rows of a table stored in contiguous key order.

Physical layout of rows in the Singers table, with an example split boundary that results in splits handled by different servers.

The diagram also illustrates possible split points, which can occur between any rows of Singers, because Singers is at the root of the database hierarchy. It also illustrates an example split boundary between the rows keyed by Singers(3) and Singers(4), with the data from the resulting splits assigned to different servers. This means that as this table grows, it's possible for rows of Singers data to be stored in different locations.

Creating multiple tables

Assume you now want to add some basic data about each singer's albums to the music application:

Albums table with 5 rows and 3 columns

Logical view of rows in an Albums table. Primary key columns appear to the left of the bolded line

Note that the primary key of Albums is composed of two columns: SingerId and AlbumId, to associate each album with its singer. The following example schema defines both the Albums and Singers tables at the root of the database hierarchy, which makes them sibling tables:

-- Schema hierarchy:
-- + Singers (sibling table of Albums)
-- + Albums (sibling table of Singers)
CREATE TABLE Singers (
  SingerId   INT64 NOT NULL,
  FirstName  STRING(1024),
  LastName   STRING(1024),
  SingerInfo BYTES(MAX),
) PRIMARY KEY (SingerId);

CREATE TABLE Albums (
  SingerId     INT64 NOT NULL,
  AlbumId      INT64 NOT NULL,
  AlbumTitle   STRING(MAX),
) PRIMARY KEY (SingerId, AlbumId);

Schema example 2 defines the Singers and Albums tables, both at the root of the database hierarchy.

The physical layout of the rows of Singers and Albums looks like this, with rows of the Albums table stored by contiguous primary key, then rows of Singers stored by contiguous primary key:

Physical layout of rows: Albums and Singers rows are each stored by key value

Physical layout of rows of Singers and Albums tables, both at the root of the database hiearchy.

One important note about the schema above is that Cloud Spanner assumes no data locality relationships between the Singers and Albums tables, because they are sibling tables, i.e. at the same level of hierarchy in the database. As the database grows, Cloud Spanner can add split points between any of the rows shown above. This means the rows of the Albums table could end up in a different split from the rows of the Singers table, and the two splits could move independently from each other.

Depending on your application's needs, it might be fine to allow Albums data to be located on different splits from Singers data. However, if your application frequently needs to retrieve information about all the albums for a given singer, then you should create Albums as a child table of Singers, which co-locates rows from the two tables along the primary key dimension. The next example explains this in more detail.

Creating interleaved tables

As you're designing your music application, suppose you realize that the app needs to frequently access rows from both Singers and Albums tables for a given primary key (e.g. each time you access the row Singers(1), you also need to access the rows Albums(1, 1) and Albums(1, 2)). In other words, Singers and Albums need to have a strong data locality relationship.

You can declare this data locality relationship by creating Albums as a child, or "interleaved", table of Singers. As mentioned above, an interleaved table is a table that you declare to be a child of another table because you want the rows of the child table to be physically stored together with the associated parent row. As mentioned above, the prefix of the primary key of a child table must be the primary key of the parent table.

The bolded line in the schema below shows how to create Albums as an interleaved table of Singers.

-- Schema hierarchy:
-- + Singers
--   + Albums (interleaved table, child table of Singers)
CREATE TABLE Singers (
  SingerId   INT64 NOT NULL,
  FirstName  STRING(1024),
  LastName   STRING(1024),
  SingerInfo BYTES(MAX),
) PRIMARY KEY (SingerId);

CREATE TABLE Albums (
  SingerId     INT64 NOT NULL,
  AlbumId      INT64 NOT NULL,
  AlbumTitle   STRING(MAX),
) PRIMARY KEY (SingerId, AlbumId),
  INTERLEAVE IN PARENT Singers ON DELETE CASCADE;

Schema example 3 defines Albums as a child of Singers, with Singers at the root of the database hierarchy.

Notes about this schema:

  • SingerId, which is the prefix of the primary key of the child table Albums, is also the primary key of its parent table Singers. This is not required if Singers and Albums are at the same level of the hierarchy, but is required in this schema because Albums is declared to be a child table of Singers.
  • The ON DELETE CASCADE annotation signifies that if a row from the parent table is deleted, then its child rows should also be deleted.
  • Interleaved rows are ordered first by rows of the parent table, then by contiguous rows of the child table that share the parent's primary key, i.e. "Singers(1)", then "Albums(1, 1)", then "Albums(1, 2)", and so on.
  • The data locality relationship of each singer and his or her album data would be preserved if this database splits, because splits can only be inserted between rows of the Singers table.

Physical layout of
rows: Albums rows are interleaved between Singers rows

Physical layout of rows of Singers and its child table Albums.

Creating a hierarchy of interleaved tables

The parent-child relationship between Singers and Albums can be extended to more descendant tables. For example, you could create an interleaved table called Songs as a child of Albums to store the track list of each album:

Songs table with 6 rows and 4 columns

Logical view of rows in an Songs table. Primary key columns appear to the left of the bolded line

Songs must have a primary key that's composed of all the primary keys of the tables above it in the hierarchy, i.e. SingerId and AlbumId.

-- Schema hierarchy:
-- + Singers
--   + Albums (interleaved table, child table of Singers)
--     + Songs (interleaved table, child table of Albums)
CREATE TABLE Singers (
  SingerId   INT64 NOT NULL,
  FirstName  STRING(1024),
  LastName   STRING(1024),
  SingerInfo BYTES(MAX),
) PRIMARY KEY (SingerId);

CREATE TABLE Albums (
  SingerId     INT64 NOT NULL,
  AlbumId      INT64 NOT NULL,
  AlbumTitle   STRING(MAX),
) PRIMARY KEY (SingerId, AlbumId),
  INTERLEAVE IN PARENT Singers ON DELETE CASCADE;

CREATE TABLE Songs (
  SingerId     INT64 NOT NULL,
  AlbumId      INT64 NOT NULL,
  TrackId      INT64 NOT NULL,
  SongName     STRING(MAX),
) PRIMARY KEY (SingerId, AlbumId, TrackId),
  INTERLEAVE IN PARENT Albums ON DELETE CASCADE;

Schema example 4 defines Songs as a child of Albums, which is a child of Singers. Singers is at the root of the database hierarchy.

The physical view of interleaved rows shows that the data locality relationship is preserved between a singer and his or her albums and songs data:

Physical views
of rows: Songs are interleaved in Albums, which are interleaved between
Singers

Physical layout of rows of Singers, Albums, and Songs tables, which form a hierarchy of interleaved tables.

In summary, a parent table along with all of its child and descendant tables forms a hierarchy of tables in the schema. Although each table in the hierarchy is logically independent, physically interleaving them this way can improve performance, effectively pre-joining the tables and allowing you to access related rows together while minimizing disk accesses.

Interleaving tables in Cloud Spanner is not required, but is recommended for tables with strong data locality relationships. Avoid interleaving tables if there is a chance that the size of a single row and its descendents will become larger than a few GiB.

Notes about key columns

The keys of a table can't change; you can't add a key column to an existing table or remove a key column from an existing table.

Storing NULLs

Primary key columns can be defined to store NULLs. If you would like to store NULLs in a primary key column, omit the NOT NULL clause for that column in the schema.

Here's an example of omitting the NOT NULL clause on the primary key column SingerId. Note that because SingerId is the primary key, there can be at most only one row in the Singers table that stores NULL in that column.

CREATE TABLE Singers (
  SingerId   INT64,
  FirstName  STRING(1024),
  LastName   STRING(1024),
) PRIMARY KEY (SingerId);

The nullable property of the primary key column must match between the parent and the child table declarations. In this example, Albums.SingerId INT64 NOT NULL is not allowed. The key declaration must omit the NOT NULL clause because Singers.SingerId omits it.

CREATE TABLE Singers (
  SingerId   INT64,
  FirstName  STRING(1024),
  LastName   STRING(1024),
) PRIMARY KEY (SingerId);

CREATE TABLE Albums (
  SingerId     INT64 NOT NULL,
  AlbumId      INT64 NOT NULL,
  AlbumTitle   STRING(MAX),
) PRIMARY KEY (SingerId, AlbumId),
  INTERLEAVE IN PARENT Singers ON DELETE CASCADE;

Disallowed types

These cannot be of type ARRAY:

  • A table's key columns.
  • An index's key columns.

Updating schemas

You can only make the following updates to the schema of an existing database:

  • Add new tables. (Columns in such tables may be NOT NULL.)
  • Entirely drop a table, as long as no other tables or indexes are interleaved within it.
  • Add a non-primary-key column to any table. (Such columns may not be NOT NULL.)
  • Remove a non-primary-key column from any table, unless it is used by a secondary index.
  • Add or remove any secondary index.
  • Change a STRING type to a BYTES type or a BYTES type to a STRING type.
  • Increase or decrease the length limit for a STRING or BYTES type (including to MAX), unless it is a primary key column inherited by one or more child tables.
  • Add or remove the NOT NULL annotation to/from a non-primary-key column.

Schema changes that require data validation

Cloud Spanner allows users to change schemas in ways that require validating that the existing data meets the new constraints. When such a schema change is executed, Cloud Spanner disallows conflicting schema changes to the affected schema entities and validates the data in the background. If validation is successful, the schema change completes successfully. If the validation is not successful, the schema change is cancelled with an error message.

And suppose we have a schema that defines the following table with an alternate definition of Singers:

CREATE TABLE Singers (
  SingerId   INT64 NOT NULL,
  FirstName  STRING(1024),
  LastName   STRING(1024),
  Nickname   STRING(MAX),
  OpaqueData BYTES(MAX),
) PRIMARY KEY (SingerId);

Schema example for schema changes that require validation.

Then the following schema changes are allowed, but require validation and may take time to complete, depending on the amount of existing data:

  • Adding the NOT NULL annotation to a non-primary-key column. Example: ALTER TABLE Singers ALTER COLUMN Nickname STRING(MAX) NOT NULL

  • Reducing the length of a column. Example: ALTER TABLE Singers ALTER COLUMN FirstName STRING(10)

  • Altering from BYTES to STRING. Example: ALTER TABLE Singers ALTER COLUMN OpaqueData STRING(MAX)

It's important to remember that these schema changes fail if the underlying data does not satisfy the new constraints. For example, the ALTER TABLE Singers ALTER COLUMN Nickname STRING(MAX) NOT NULL statement above fails if any Nickname columns contains a NULL, because the existing data does not meet the constraints of the new definition.

Data validation can take from several minutes to many hours. The time to complete data validation depends on:

  • the size of the data set
  • the number of nodes in the instance
  • the load on the instance

Some schema changes may alter the behavior of requests to the database before the change completes. For example, if you're adding NOT NULL to a column, Cloud Spanner will almost immediately begin rejecting writes for new requests that use NULL for the column. If the new schema change ultimately fails for data validation, there may be a period of time when writes were blocked even if they would have been accepted by the old schema.

You can cancel a long-running data validation operation using the projects.instances.databases.operations.cancel method.

Send feedback about...

Cloud Spanner Documentation