Create and manage generated columns

A generated column is a column that is always computed from other columns in a row. These columns can make a query simpler, save the cost of evaluating an expression at query time, and can be indexed or used as a foreign key. In this article, we'll describe how to manage this column type in your database.

Add a generated column to a new table

In the following CREATE TABLE snippet, we create a table to store information about users. We have columns for FirstName and LastName and define a generated column for FullName, which is the concatenation of FirstName and LastName

GoogleSQL

CREATE TABLE Users (
Id STRING(20) NOT NULL,
FirstName STRING(50),
LastName STRING(50),
Age INT64 NOT NULL,
FullName STRING(100) AS (ARRAY_TO_STRING([FirstName, LastName], " ")) STORED,
) PRIMARY KEY (Id);

PostgreSQL

CREATE TABLE users (
id VARCHAR(20) NOT NULL,
firstname VARCHAR(50),
lastname VARCHAR(50),
age BIGINT NOT NULL,
fullname VARCHAR(100) GENERATED ALWAYS AS (firstname || ' ' || lastname) STORED,
PRIMARY KEY(id)
);

The value of FullName is computed when a new row is inserted or when FirstName and/or LastName is updated for an existing row. The computed value is stored along with other columns from the table. The SQL in parentheses is called the generation expression.

  • expression can be any valid SQL expression that's assignable to the column data type with the following restrictions.

  • The STORED attribute that follows the expression stores the result of the expression along with other columns of the table. Subsequent updates to any of the referenced columns cause Spanner to re-evaluate and store the expression.

  • Spanner doesn't allow generated columns unless they use the STORED attribute.

  • Direct writes to generated columns aren't allowed.

  • Column option allow_commit_timestamp isn't allowed on generated columns or any columns that generated columns reference.

  • You can't change the data type of a generated column, or of any columns the generated column references.

  • You can't drop a column a generated column references.

  • You can use a generated column as a primary key with the following additional restrictions:

    • The generated primary key can't reference other generated columns.

    • The generated primary key can reference, at most, one non-key column.

    • The generated primary key can't depend on a non-key column with a DEFAULT clause.

  • The following rules apply when using generated key columns:

    • Read APIs: You must fully specify the key columns, including the generated key columns.
    • Mutation APIs: For INSERT, INSERT_OR_UPDATE, and REPLACE, Spanner doesn't allow you to specify generated key columns. For UPDATE, you can optionally specify generated key columns. For DELETE, you need to fully specify the key columns including the generated keys.
    • DML: You can't explicitly write to generated keys in INSERT or UPDATE statements.
    • Query: In general, we recommend that you use the generated key column as a filter in your query. Optionally, if the expression for the generated key column uses only one column as a reference, the query can apply an equality (=) or IN condition to the referenced column. For more information and an example, see Create a unique key derived from a value column.

The generated column can be queried just like any other column, as shown in the following example.

GoogleSQL

SELECT Id, FullName
FROM Users;

PostgreSQL

SELECT id, fullname
FROM users;

This is a equivalent to the following statement, which doesn't use the stored generated column.

GoogleSQL

SELECT Id, ARRAY_TO_STRING([FirstName, LastName], " ") as FullName
FROM Users;

PostgreSQL

SELECT id, firstname || ' ' || lastname as fullname
FROM users;

Apart from making the query simpler and saving the cost of evaluating the expression at query time, a generated column can also be indexed or used as a foreign key.

Create an index on a generated column

To help with lookups on our FullName generated column, we can create a secondary index as shown in the following snippet.

GoogleSQL

CREATE INDEX UsersByFullName ON Users (FullName);

PostgreSQL

CREATE INDEX UserByFullName ON users (fullname);

Add a generated column to an existing table

Using the following ALTER TABLE statement, we add a generated column to the Users table to generate and store the user's initials.

GoogleSQL

ALTER TABLE Users ADD COLUMN Initials STRING(2)
AS (ARRAY_TO_STRING([SUBSTR(FirstName, 0, 1), SUBSTR(LastName, 0, 1)], "")) STORED;

PostgreSQL

ALTER TABLE users ADD COLUMN Initials VARCHAR(2)
GENERATED ALWAYS AS (SUBSTR(FirstName, 0, 1) || SUBSTR(LastName, 0, 1)) STORED;

Adding a stored generated column to an existing table is a long-running operation to backfill the column values. During backfilling, the stored generated columns cannot be read or queried. The backfilling state is reflected in the INFORMATION_SCHEMA.

Create a partial index using a generated column

What if we only wanted to query users who are over 18? A full scan of the table would be inefficient, so we use a partial index.

  1. Use the following statement to add another generated column which returns the user's age if they are over 18, and returns NULL otherwise.

    GoogleSQL

    ALTER TABLE Users ADD COLUMN AgeAbove18 INT64
    AS (IF(Age > 18, Age, NULL)) STORED;
    

    PostgreSQL

    ALTER TABLE Users ADD COLUMN AgeAbove18 BIGINT
    GENERATED ALWAYS AS (nullif( Age , least( 18, Age) )) STORED;
    
  2. Create an index on this new column, and disable the indexing of NULL values with the NULL_FILTERED keyword in GoogleSQL or the IS NOT NULL predicate in PostgreSQL. This partial index is smaller and more efficient than a normal index because it excludes everyone who is 18 or younger.

    GoogleSQL

    CREATE NULL_FILTERED INDEX UsersAbove18ByAge
    ON Users (AgeAbove18);
    

    PostgreSQL

    CREATE INDEX UsersAbove18ByAge ON users (AgeAbove18)
    WHERE AgeAbove18 IS NOT NULL;
    
  3. To retrieve the Id and Age of all users who are over 18, run the following query.

    GoogleSQL

    SELECT Id, Age
    FROM Users@{FORCE_INDEX=UsersAbove18ByAge}
    WHERE AgeAbove18 IS NOT NULL;
    

    PostgreSQL

    SELECT Id, Age
    FROM users /*@ FORCE_INDEX = UsersAbove18ByAge */
    WHERE AgeAbove18 IS NOT NULL;
    
  4. To filter on a different age, for example, to retrieve all users who are over 21, use the same index and filter on the generated column as follows:

    GoogleSQL

    SELECT Id, Age
    FROM Users@{FORCE_INDEX=UsersAbove18ByAge}
    WHERE AgeAbove18 > 21;
    

    PostgreSQL

    SELECT Id, Age
    FROM users /*@ FORCE_INDEX = UsersAbove18ByAge */
    WHERE AgeAbove18 > 21;
    

Remove a generated column

The following DDL statement drops a generated column from the Users table.

ALTER TABLE Users
DROP COLUMN Initials;

Modify a generated column expression

Modifying the expression of a generated column is not allowed. Instead, you need to drop the existing column and create a new generated column with the new expression.

Create a primary key on a generated column

In Spanner, you can use a generated column in the primary key.

The following example shows a DDL statement that creates the UserInfoLog table with a ShardId generated column. The value of the ShardId column depends on another column. It is derived by using a MOD function on the UserId column. ShardId is declared as part of the primary key.

GoogleSQL

CREATE TABLE UserInfoLog (
  ShardId INT64 NOT NULL
  AS (MOD(UserId, 2048)) STORED,
  UserId INT64 NOT NULL,
  FullName STRING(1024) NOT NULL,
) PRIMARY KEY (ShardId, UserId);

PostgreSQL

CREATE TABLE UserInfoLog (
  ShardId BIGINT GENERATED ALWAYS
  AS (MOD(UserId, '2048'::BIGINT)) STORED NOT NULL,
  UserId BIGINT NOT NULL,
  FullName VARCHAR(1024) NOT NULL,
  PRIMARY KEY(ShardId, UserId));

Normally, to efficiently access a specific row you need to specify all key columns. In the previous example, this would mean providing both a ShardId and UserId. However, Spanner can sometimes infer the value of the generated primary key column if it depends on a single other column and if the value of the column it depends on is fully determined. This is true if the column referenced by the generated primary key column meets one of the following conditions:

  • It's equal to a constant value or bound parameter in the WHERE clause, or
  • It gets its value set by an IN operator in the WHERE clause
  • It gets its value from an equi-join condition

For example, for the following query:

GoogleSQL

SELECT * FROM UserInfoLog
AS T WHERE T.UserId=1;

PostgreSQL

SELECT * FROM UserInfoLog
AS T WHERE T.UserId=1;

Spanner can infer the value of ShardId from the provided UserId. The previous query is equivalent to the following query after query optimization:

GoogleSQL

SELECT * FROM UserInfoLog
AS T WHERE T.ShardId = MOD(1, 2048)
AND T.UserId=1;

PostgreSQL

SELECT * FROM UserInfoLog
AS T WHERE T.ShardId = MOD(1, 2048)
AND T.UserId=1;

The next example shows how to create the Students table and use an expression that retrieves the id field of the StudentInfo JSON column and uses it as the primary key:

GoogleSQL

CREATE TABLE Students (
  StudentId INT64 NOT NULL
  AS (CAST(JSON_VALUE(StudentInfo, "$.id") AS INT64)) STORED,
  StudentInfo JSON NOT NULL,
) PRIMARY KEY (StudentId);

PostgreSQL

CREATE TABLE Students (
  StudentId BIGINT GENERATED ALWAYS
  AS (((StudentInfo ->> 'id'::TEXT))::BIGINT) STORED NOT NULL,
  StudentInfo JSONB NOT NULL,
  PRIMARY KEY(StudentId));

View properties of a generated column

Spanner's INFORMATION_SCHEMA contains information about the generated columns on your database. The following are some examples of the questions you can answer when you query the information schema.

What generated columns are defined in my database?

SELECT c.TABLE_NAME, c.COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS as c
WHERE c.GENERATION_EXPRESSION IS NOT NULL;

What is the current state of generated columns in table Users?

If you have added a generated column to an existing table, you might want to pass SPANNER_STATE in a query to find out the current state of the column. SPANNER_STATE returns the following values:

  • COMMITTED: The column is fully usable.
  • WRITE_ONLY: The column is being backfilled. No read is allowed.

Use the following query to find the state of a column:

GoogleSQL

SELECT c.TABLE_NAME, c.COLUMN_NAME, c.SPANNER_STATE
FROM INFORMATION_SCHEMA.COLUMNS AS c
WHERE c.TABLE_NAME="Users" AND c.GENERATION_EXPRESSION IS NOT NULL;

PostgreSQL

SELECT c.TABLE_NAME, c.COLUMN_NAME, c.SPANNER_STATE
FROM INFORMATION_SCHEMA.COLUMNS AS c
WHERE c.TABLE_NAME='users' AND c.GENERATION_EXPRESSION IS NOT NULL;

Note: A generated column doesn't affect the performance of a read or query operation. However, it could affect performance of write operations (`DML` statements and `Mutations`) because of the overhead of evaluating the column expression of a generated column if a write operation modifies any of the columns referenced in the generated column expression. Since the overhead varies depending on the write workload for the application, schema design, and dataset characteristics, we recommend that you benchmark your applications before using a generated column.

What's next