Create and manage generated columns

A generated column is a column that is always computed from other columns in a row. These columns can make a query simpler, save the cost of evaluating an expression at query time, and can be indexed or used as a foreign key. In this article, we'll describe how to manage this column type in your database.

Add a generated column to a new table

In the following CREATE TABLE snippet, we create a table to store information about users. We have columns for FirstName and LastName and define a generated column for FullName, which is the concatenation of FirstName and LastName. The SQL in parentheses is called the generation expression.

A generated column can be marked as STORED to save the cost of evaluating the expression at query time. As a result, the value of FullName is only computed when a new row is inserted or when FirstName or LastName is updated for an existing row. The computed value is stored along with other columns in the table.

GoogleSQL

CREATE TABLE Users (
Id STRING(20) NOT NULL,
FirstName STRING(50),
LastName STRING(50),
Age INT64 NOT NULL,
FullName STRING(100) AS (ARRAY_TO_STRING([FirstName, LastName], " ")) STORED,
) PRIMARY KEY (Id);

PostgreSQL

CREATE TABLE users (
id VARCHAR(20) NOT NULL,
firstname VARCHAR(50),
lastname VARCHAR(50),
age BIGINT NOT NULL,
fullname VARCHAR(100) GENERATED ALWAYS AS (firstname || ' ' || lastname) STORED,
PRIMARY KEY(id)
);

You can create a non-stored generated column by omitting the STORED attribute in the DDL. This kind of generated column is evaluated at query time and can make a query simpler. In PostgreSQL, you can create a non-stored generated column using the VIRTUAL attribute.

GoogleSQL

FullName STRING(MAX) AS (CONCAT(FirstName, " ", LastName))

PostgreSQL

fullname text GENERATED ALWAYS AS (firstname || ' ' || lastname) VIRTUAL
  • expression can be any valid SQL expression that's assignable to the column data type with the following restrictions.

    • The expression can only reference columns in the same table.

    • The expression can't contain subqueries.

    • Expressions with non-deterministic functions such as PENDING_COMMIT_TIMESTAMP(), CURRENT_DATE(), and CURRENT_TIMESTAMP() can't be made into a STORED generated column or a generated column that is indexed.

    • You can't modify the expression of a STORED or indexed generated column.

  • The STORED attribute that follows the expression stores the result of the expression along with other columns of the table. Subsequent updates to any of the referenced columns cause Spanner to re-evaluate and store the expression.

  • Generated columns that are not STORED can't be marked as NOT NULL.

  • Direct writes to generated columns aren't allowed.

  • Column option allow_commit_timestamp isn't allowed on generated columns or any columns that generated columns reference.

  • For STORED or generated columns that are indexed, you can't change the data type of the column, or of any columns that the generated column references.

  • You can't drop a column a generated column references.

  • You can use a generated column as a primary key with the following additional restrictions:

    • The generated primary key can't reference other generated columns.

    • The generated primary key can reference, at most, one non-key column.

    • The generated primary key can't depend on a non-key column with a DEFAULT clause.

  • The following rules apply when using generated key columns:

    • Read APIs: You must fully specify the key columns, including the generated key columns.
    • Mutation APIs: For INSERT, INSERT_OR_UPDATE, and REPLACE, Spanner doesn't allow you to specify generated key columns. For UPDATE, you can optionally specify generated key columns. For DELETE, you need to fully specify the key columns including the generated keys.
    • DML: You can't explicitly write to generated keys in INSERT or UPDATE statements.
    • Query: In general, we recommend that you use the generated key column as a filter in your query. Optionally, if the expression for the generated key column uses only one column as a reference, the query can apply an equality (=) or IN condition to the referenced column. For more information and an example, see Create a unique key derived from a value column.

The generated column can be queried just like any other column, as shown in the following example.

GoogleSQL

SELECT Id, FullName
FROM Users;

PostgreSQL

SELECT id, fullname
FROM users;

The query using Fullname is equivalent to the query with the generated expression. Hence a generated column can make the query simpler.

GoogleSQL

SELECT Id, ARRAY_TO_STRING([FirstName, LastName], " ") as FullName
FROM Users;

PostgreSQL

SELECT id, firstname || ' ' || lastname as fullname
FROM users;

Create an index on a generated column

You can also index or use a generated column as a foreign key.

To help with lookups on our FullName generated column, we can create a secondary index as shown in the following snippet.

GoogleSQL

CREATE INDEX UsersByFullName ON Users (FullName);

PostgreSQL

CREATE INDEX UserByFullName ON users (fullname);

Add a generated column to an existing table

Using the following ALTER TABLE statement, we can add a generated column to the Users table to generate and store the user's initials.

GoogleSQL

ALTER TABLE Users ADD COLUMN Initials STRING(2)
AS (ARRAY_TO_STRING([SUBSTR(FirstName, 0, 1), SUBSTR(LastName, 0, 1)], "")) STORED;

PostgreSQL

ALTER TABLE users ADD COLUMN initials VARCHAR(2)
GENERATED ALWAYS AS (SUBSTR(firstname, 0, 1) || SUBSTR(lastname, 0, 1)) STORED;

If you add a stored generated column to an existing table, a long-running operation to backfill the column values is started. During backfilling, the stored generated columns can't be read or queried. The backfilling state is reflected in the INFORMATION_SCHEMA table.

Create a partial index using a generated column

What if we only wanted to query users who are over 18? A full scan of the table would be inefficient, so we use a partial index.

  1. Use the following statement to add another generated column which returns the user's age if they are over 18, and returns NULL otherwise.

    GoogleSQL

    ALTER TABLE Users ADD COLUMN AgeAbove18 INT64
    AS (IF(Age > 18, Age, NULL));
    

    PostgreSQL

    ALTER TABLE Users ADD COLUMN AgeAbove18 BIGINT
    GENERATED ALWAYS AS (nullif( Age , least( 18, Age) )) VIRTUAL;
    
  2. Create an index on this new column, and disable the indexing of NULL values with the NULL_FILTERED keyword in GoogleSQL or the IS NOT NULL predicate in PostgreSQL. This partial index is smaller and more efficient than a normal index because it excludes everyone who is 18 or younger.

    GoogleSQL

    CREATE NULL_FILTERED INDEX UsersAbove18ByAge
    ON Users (AgeAbove18);
    

    PostgreSQL

    CREATE INDEX UsersAbove18ByAge ON users (AgeAbove18)
    WHERE AgeAbove18 IS NOT NULL;
    
  3. To retrieve the Id and Age of all users who are over 18, run the following query.

    GoogleSQL

    SELECT Id, Age
    FROM Users@{FORCE_INDEX=UsersAbove18ByAge}
    WHERE AgeAbove18 IS NOT NULL;
    

    PostgreSQL

    SELECT Id, Age
    FROM users /*@ FORCE_INDEX = UsersAbove18ByAge */
    WHERE AgeAbove18 IS NOT NULL;
    
  4. To filter on a different age, for example, to retrieve all users who are over 21, use the same index and filter on the generated column as follows:

    GoogleSQL

    SELECT Id, Age
    FROM Users@{FORCE_INDEX=UsersAbove18ByAge}
    WHERE AgeAbove18 > 21;
    

    PostgreSQL

    SELECT Id, Age
    FROM users /*@ FORCE_INDEX = UsersAbove18ByAge */
    WHERE AgeAbove18 > 21;
    

    An indexed generated column can save the cost of evaluating an expression at query time and avoid storing the values twice (in the base table and index) as compared to a STORED generated column.

Remove a generated column

The following DDL statement drops a generated column from the Users table:

GoogleSQL

  ALTER TABLE Users DROP COLUMN Initials;

PostgreSQL

  ALTER TABLE users DROP COLUMN initials;

Modify a generated column expression

GoogleSQL

ALTER TABLE Users ALTER COLUMN FullName STRING(100) 
AS (ARRAY_TO_STRING(ARRAY_TO_STRING([LastName, FirstName ], " ")));

PostgreSQL

ALTER TABLE users ADD COLUMN Initials VARCHAR(2)
GENERATED ALWAYS AS (lastname || ' ' || firstname) VIRTUAL;

Updating the expression of a STORED generated column or an indexed non-stored generated column is not allowed.

Create a primary key on a generated column

In Spanner, you can use a STORED generated column in the primary key.

The following example shows a DDL statement that creates the UserInfoLog table with a ShardId generated column. The value of the ShardId column depends on another column. It is derived by using a MOD function on the UserId column. ShardId is declared as part of the primary key.

GoogleSQL

CREATE TABLE UserInfoLog (
  ShardId INT64 NOT NULL
  AS (MOD(UserId, 2048)) STORED,
  UserId INT64 NOT NULL,
  FullName STRING(1024) NOT NULL,
) PRIMARY KEY (ShardId, UserId);

PostgreSQL

CREATE TABLE UserInfoLog (
  ShardId BIGINT GENERATED ALWAYS
  AS (MOD(UserId, '2048'::BIGINT)) STORED NOT NULL,
  UserId BIGINT NOT NULL,
  FullName VARCHAR(1024) NOT NULL,
  PRIMARY KEY(ShardId, UserId));

Normally, to efficiently access a specific row you need to specify all key columns. In the previous example, this would mean providing both a ShardId and UserId. However, Spanner can sometimes infer the value of the generated primary key column if it depends on a single other column and if the value of the column it depends on is fully determined. This is true if the column referenced by the generated primary key column meets one of the following conditions:

  • It's equal to a constant value or bound parameter in the WHERE clause, or
  • It gets its value set by an IN operator in the WHERE clause
  • It gets its value from an equi-join condition

For example, for the following query:

GoogleSQL

SELECT * FROM UserInfoLog
AS T WHERE T.UserId=1;

PostgreSQL

SELECT * FROM UserInfoLog
AS T WHERE T.UserId=1;

Spanner can infer the value of ShardId from the provided UserId. The previous query is equivalent to the following query after query optimization:

GoogleSQL

SELECT * FROM UserInfoLog
AS T WHERE T.ShardId = MOD(1, 2048)
AND T.UserId=1;

PostgreSQL

SELECT * FROM UserInfoLog
AS T WHERE T.ShardId = MOD(1, 2048)
AND T.UserId=1;

The next example shows how to create the Students table and use an expression that retrieves the id field of the StudentInfo JSON column and uses it as the primary key:

GoogleSQL

CREATE TABLE Students (
  StudentId INT64 NOT NULL
  AS (CAST(JSON_VALUE(StudentInfo, "$.id") AS INT64)) STORED,
  StudentInfo JSON NOT NULL,
) PRIMARY KEY (StudentId);

PostgreSQL

CREATE TABLE Students (
  StudentId BIGINT GENERATED ALWAYS
  AS (((StudentInfo ->> 'id'::TEXT))::BIGINT) STORED NOT NULL,
  StudentInfo JSONB NOT NULL,
  PRIMARY KEY(StudentId));

View properties of a generated column

Spanner's INFORMATION_SCHEMA contains information about the generated columns on your database. The following are some examples of the questions you can answer when you query the information schema.

What generated columns are defined in my database?

GoogleSQL

SELECT c.TABLE_NAME, c.COLUMN_NAME, C.IS_STORED
FROM INFORMATION_SCHEMA.COLUMNS as c
WHERE c.GENERATION_EXPRESSION IS NOT NULL;

PostgreSQL

SELECT c.TABLE_NAME, c.COLUMN_NAME, C.IS_STORED
FROM INFORMATION_SCHEMA.COLUMNS as c
WHERE c.GENERATION_EXPRESSION IS NOT NULL;

IS_STORED is either YES for stored generated columns, NO for non-stored generated columns, or NULL for non-generated columns.

What is the current state of generated columns in table Users?

If you have added a generated column to an existing table, you might want to pass SPANNER_STATE in a query to find out the current state of the column. SPANNER_STATE returns the following values:

  • COMMITTED: The column is fully usable.
  • WRITE_ONLY: The column is being backfilled. No read is allowed.

Use the following query to find the state of a column:

GoogleSQL

SELECT c.TABLE_NAME, c.COLUMN_NAME, c.SPANNER_STATE
FROM INFORMATION_SCHEMA.COLUMNS AS c
WHERE c.TABLE_NAME="Users" AND c.GENERATION_EXPRESSION IS NOT NULL;

PostgreSQL

SELECT c.TABLE_NAME, c.COLUMN_NAME, c.SPANNER_STATE
FROM INFORMATION_SCHEMA.COLUMNS AS c
WHERE c.TABLE_NAME='users' AND c.GENERATION_EXPRESSION IS NOT NULL;

Note: A generated column that is non-stored can only be accessed using the SQL query. However, if it is indexed, you can use the read API to access the value from the index.

Performance

A STORED generated column doesn't affect the performance of a read or query operation. However, non-stored generated columns used in a query can impact its performance due to the overhead of evaluating the generated column expression.

Performance of write operations (DML statements and mutations) is impacted when using either a STORED generated column or a generated column that is indexed. The overhead is due to evaluating the generated column expression when the write operation inserts or modifies any of the columns referenced in the generated column expression. Since the overhead varies depending on the write workload for the application, schema design, and dataset characteristics, we recommend that you benchmark your applications before using a generated column.

What's next