Best practices for designing a Spanner Graph schema

This document describes best practices for designing a Spanner Graph schema, focusing on efficient queries, optimized edge traversal, and effective data management techniques.

For information about the design of Spanner schemas (not Spanner Graph schemas), see Schema design best practices.

Choose a schema design

Your schema design affects graph performance. The following topics help you choose an effective strategy.

Schematized versus schemaless designs

  • A schematized design stores the graph definition in the Spanner Graph schema, which is suitable for stable graphs with infrequent definition changes. The schema enforces the graph definition, and properties support all Spanner data types.

  • A schemaless design infers the graph definition from the data, offering more flexibility without requiring schema changes. Dynamic labels and properties aren't enforced by default. Properties must be valid JSON values.

The following summarizes the primary differences between schema and schemaless data management. Also consider your graph queries to help decide which type of schema to use.

Feature Schematized data management Schemaless data management
Storing graph definition Graph definition is stored in the Spanner Graph schema. Graph definition is evident from the data. However, Spanner Graph doesn't inspect the data to infer the definition.
Updating graph definition Requires a Spanner Graph schema change. Suitable when the definition is well-defined and changes infrequently. No Spanner Graph schema change needed.
Enforcing graph definition A property graph schema enforces the allowed node types for an edge. It also enforces the allowed properties and property types of a graph node or edge type. Not enforced by default. You can use check constraints to enforce label and property data integrity.
Property data types Support any Spanner data type, for example, timestamp. Dynamic properties must be a valid JSON value.

Choose a schema design based on graph queries

Schematized and schemaless designs typically offer comparable performance. The exception is when queries use quantified path patterns that span multiple node or edge types, in which case a schemaless design offers better performance. This is because schemaless designs store all data in single node and edge tables, which minimizes table scans. By contrast, schematized designs use separate tables for each node and edge type, so queries spanning multiple types must scan and combine data from all corresponding tables.

The following are sample queries that work well with schemaless designs, and a sample query that works well with both designs:

Schemaless design

The following queries perform better with a schemaless design because they use quantified path patterns that can match multiple types of nodes and edges:

  • This query's quantified path pattern uses multiple edge types (Transfer or Withdraw) and doesn't specify intermediate node types for paths longer than one hop.

    GRAPH FinGraph
    MATCH p = (:Account {id:1})-[:Transfer|Withdraw]->{1,3}(:Account)
    RETURN TO_JSON(p) AS p;
    
  • This query's quantified path pattern finds paths of one to three hops between Person and Account nodes, using multiple edge types (Owns or Transfers), without specifying intermediate node types for longer paths. This allows paths to traverse intermediate nodes of various types. For example, (:Person)-[:Owns]->(:Account)-[:Transfers]->(:Account).

    GRAPH FinGraph
    MATCH p = (:Person {id:1})-[:Owns|Transfers]->{1,3}(:Account)
    RETURN TO_JSON(p) AS p;
    
  • This query finds paths of one to three hops between Account nodes using edges of type Owns in any direction (-[:Owns]-). Because paths can traverse edges in either direction and intermediate nodes aren't specified, a two-hop path might go through nodes of different types. For example, (:Account)-[:Owns]-(:Person)-[:Owns]-(:Account).

    GRAPH FinGraph
    MATCH p = (:Account {id:1})-[:Owns]-{1,3}(:Account)
    RETURN TO_JSON(p) AS p;
    

Both designs

The following query performs comparably with both schematized and schemaless designs. Its quantified path, (:Account)-[:Transfer]->{1,3}(:Account), involves one node type, Account, and one edge type, Transfer. Because the path involves only one node type and one edge type, performance is comparable for both designs. Even though intermediate nodes aren't explicitly labeled, the pattern constrains them to be Account nodes. The Person node appears outside of this quantified path.

GRAPH FinGraph
MATCH p = (:Person {id:1})-[:Owns]->(:Account)-[:Transfer]->{1,3}(:Account)
RETURN TO_JSON(p) AS p;

Optimize Spanner Graph schema performance

After you choose to use a schematized or schemaless Spanner Graph schema, you can optimize its performance in the following ways:

Optimize edge traversal

Edge traversal is the process of navigating through a graph by following its edges, starting at a particular node and moving along connected edges to reach other nodes. The schema defines the direction of the edge. Edge traversal is a fundamental operation in Spanner Graph, so improving edge traversal efficiency can significantly improve your application's performance.

You can traverse an edge in two directions:

  • Forward edge traversal follows outgoing edges of the source node.
  • Reverse edge traversal follows incoming edges of the destination node.

Forward and reverse edge traversal query examples

The following example query performs forward edge traversal of Owns edges for a given person:

GRAPH FinGraph
MATCH (person:Person {id: 1})-[owns:Owns]->(accnt:Account)
RETURN accnt.id;

The following example query performs reverse edge traversal of Owns edges for a given account:

GRAPH FinGraph
MATCH (accnt:Account {id: 1})<-[owns:Owns]-(person:Person)
RETURN person.name;

Optimize forward edge traversal

To improve forward edge traversal performance, optimize traversal from source to edge and from edge to destination.

  • To optimize source to edge traversal, interleave the edge input table into the source node input table using the INTERLEAVE IN PARENT clause. Interleaving is a storage optimization technique in Spanner that colocates child table rows with their corresponding parent rows in storage. For more information about interleaving, see Schemas overview.

  • To optimize edge to destination traversal, create a foreign key constraint between the edge and the destination
    node. This enforces the edge-to-destination constraint, which can improve performance by eliminating destination table scans. If enforced foreign keys cause write performance bottlenecks (for example, when updating hub nodes), use an informational foreign key instead.

The following examples show how to use interleaving with an enforced and an informational foreign key constraint.

Enforced foreign key

In this edge table example, PersonOwnAccount does the following:

  • Interleaves into the source node table Person.

  • Creates an enforced foreign key to the destination node table Account.

CREATE TABLE Person (
  id               INT64 NOT NULL,
  name             STRING(MAX),
) PRIMARY KEY (id);

CREATE TABLE Account (
  id               INT64 NOT NULL,
  create_time      TIMESTAMP,
  close_time       TIMESTAMP,
) PRIMARY KEY (id)

CREATE TABLE PersonOwnAccount (
  id               INT64 NOT NULL,
  account_id       INT64 NOT NULL,
  create_time      TIMESTAMP,
  CONSTRAINT FK_Account FOREIGN KEY (account_id)
    REFERENCES Account (id)
) PRIMARY KEY (id, account_id),
  INTERLEAVE IN PARENT Person ON DELETE CASCADE;

Informational foreign key

In this edge table example, PersonOwnAccount does the following:

  • Interleaves into the source node table Person.

  • Creates an informational foreign key to the destination node table Account.

CREATE TABLE Person (
  id               INT64 NOT NULL,
  name             STRING(MAX),
) PRIMARY KEY (id);

CREATE TABLE Account (
  id               INT64 NOT NULL,
  create_time      TIMESTAMP,
  close_time       TIMESTAMP,
) PRIMARY KEY (id)

CREATE TABLE PersonOwnAccount (
  id               INT64 NOT NULL,
  account_id       INT64 NOT NULL,
  create_time      TIMESTAMP,
  CONSTRAINT FK_Account FOREIGN KEY (account_id)
    REFERENCES Account (id) NOT ENFORCED
) PRIMARY KEY (id, account_id),
  INTERLEAVE IN PARENT Person ON DELETE CASCADE;

Optimize reverse edge traversal

Optimize reverse edge traversal unless your queries use only forward traversal, because queries involving reverse or any-directional traversal are common.

To optimize reverse edge traversal, you can do the following:

  • Create a secondary index on the edge table.

  • Interleave the index into the destination node input table to colocate the edges with the destination nodes.

  • Store the edge properties in the index.

This example shows a secondary index to optimize reverse edge traversal for the edge table PersonOwnAccount:

  • The INTERLEAVE IN clause colocates the index data with the destination node table Account.

  • The STORING clause stores edge properties in the index.

For more information about interleaving indexes, see Indexes and interleaving.

CREATE TABLE PersonOwnAccount (
  id               INT64 NOT NULL,
  account_id       INT64 NOT NULL,
  create_time      TIMESTAMP,
) PRIMARY KEY (id, account_id),
  INTERLEAVE IN PARENT Person ON DELETE CASCADE;

CREATE INDEX AccountOwnedByPerson
ON PersonOwnAccount (account_id)
STORING (create_time),
INTERLEAVE IN Account;

Use secondary indexes to filter properties

A secondary index enables efficient lookup of nodes and edges based on specific property values. Using an index helps avoid a full-table scan and is especially useful for large graphs.

Speed up filtering nodes by property

The following query that finds accounts for a specified nickname. Because it doesn't use a secondary index, all Account nodes must be scanned to find the matching results:

GRAPH FinGraph
MATCH (acct:Account)
WHERE acct.nick_name = "abcd"
RETURN acct.id;

Create a secondary index on the filtered property in your schema to speed up the filtering process:

CREATE TABLE Account (
  id               INT64 NOT NULL,
  create_time      TIMESTAMP,
  is_blocked       BOOL,
  nick_name        STRING(MAX),
) PRIMARY KEY (id);

CREATE INDEX AccountByNickName
ON Account (nick_name);

Speed up filtering edges by property

You can use a secondary index to improve the performance of filtering edges based on property values.

Forward edge traversal

Without a secondary index, this query must scan all of a person's edges to find the edges that match the create_time filter:

GRAPH FinGraph
MATCH (person:Person)-[owns:Owns]->(acct:Account)
WHERE person.id = 1
  AND owns.create_time >= PARSE_TIMESTAMP("%c", "Thu Dec 25 07:30:00 2008")
RETURN acct.id;

The following code improves query efficiency by creating a secondary index on the edge source node reference (id) and the edge property (create_time). The query also defines the index as an interleaved child of the source node input table, which colocates the index with the source node.

CREATE TABLE PersonOwnAccount (
  id               INT64 NOT NULL,
  account_id       INT64 NOT NULL,
  create_time      TIMESTAMP,
) PRIMARY KEY (id, account_id),
  INTERLEAVE IN PARENT Person ON DELETE CASCADE;

CREATE INDEX PersonOwnAccountByCreateTime
ON PersonOwnAccount (id, create_time)
INTERLEAVE IN Person;

Reverse edge traversal

Without a secondary index, the following reverse edge traversal query must read all the edges before it can find the person that owns the specified account after the specified create_time:

GRAPH FinGraph
MATCH (acct:Account)<-[owns:Owns]-(person:Person)
WHERE acct.id = 1
  AND owns.create_time >= PARSE_TIMESTAMP("%c", "Thu Dec 25 07:30:00 2008")
RETURN person.id;

The following code improves query efficiency by creating a secondary index on the edge destination node reference (account_id) and the edge property (create_time). The query also defines the index as the interleaved child of the destination node table, which colocates the index with the destination node.

CREATE TABLE PersonOwnAccount (
  id               INT64 NOT NULL,
  account_id       INT64 NOT NULL,
  create_time      TIMESTAMP,
) PRIMARY KEY (id, account_id),
  INTERLEAVE IN PARENT Person ON DELETE CASCADE;

CREATE INDEX AccountOwnedByPersonByCreateTime
ON PersonOwnAccount (account_id, create_time),
INTERLEAVE IN Account;

Prevent dangling edges

An edge that connects zero or one node, a dangling edge, can compromise Spanner Graph query efficiency and graph structure integrity. A dangling edge can occur if you delete a node without deleting its associated edges. A dangling edge can also occur if you create an edge but its source or destination node doesn't exist. To prevent dangling edges, incorporate the following in your Spanner Graph schema:

Use referential constraints

You can use interleaving and enforced foreign keys on both endpoints to prevent dangling edges by following these steps:

  1. Interleave the edge input table into the source node input table to ensure that the source node of an edge always exists.

  2. Create an enforced foreign key constraint on edges to ensure that the destination node of an edge always exists. While enforced foreign keys prevent dangling edges, they make inserting and deleting edges more expensive.

The following example uses an enforced foreign key and interleaves the edge input table into the source node input table using the INTERLEAVE IN PARENT clause. Together, using an enforced foreign key and interleaving can also help optimize forward edge traversal.

  CREATE TABLE PersonOwnAccount (
    id               INT64 NOT NULL,
    account_id       INT64 NOT NULL,
    create_time      TIMESTAMP,
    CONSTRAINT FK_Account FOREIGN KEY (account_id) REFERENCES Account (id) ON DELETE CASCADE,
  ) PRIMARY KEY (id, account_id),
    INTERLEAVE IN PARENT Person ON DELETE CASCADE;

Delete edges with ON DELETE CASCADE

When you use interleaving or an enforced foreign key to prevent dangling edges, use the ON DELETE CASCADE clause in your Spanner Graph schema to delete a node's associated edges in the same transaction that deletes the node. For more information, see Delete cascading for interleaved tables and Foreign key actions.

Delete cascade for edges connecting different types of nodes

The following examples show how to use ON DELETE CASCADE in your Spanner Graph schema to delete dangling edges when you delete a source or destination node. In both cases, the deleted node's type and the type of the node connected to it by an edge are different.

Source node

Use interleaving to delete dangling edges when the source node is deleted. The following shows how to use interleaving to delete the outgoing edges when the source node (Person) is deleted. For more information, see Create interleaved tables.

CREATE TABLE PersonOwnAccount (
  id               INT64 NOT NULL,
  account_id       INT64 NOT NULL,
  create_time      TIMESTAMP,
  CONSTRAINT FK_Account FOREIGN KEY (account_id) REFERENCES Account (id) ON DELETE CASCADE,
) PRIMARY KEY (id, account_id),
  INTERLEAVE IN PARENT Person ON DELETE CASCADE

Destination node

Use a foreign key constraint to delete dangling edges when the destination node is deleted. The following example shows how to use a foreign key with ON DELETE CASCADE in an edge table to delete incoming edges when the destination node (Account) is deleted:

CONSTRAINT FK_Account FOREIGN KEY(account_id)
  REFERENCES Account(id) ON DELETE CASCADE

Delete cascade for edges connecting the same type of nodes

When an edge's source and destination nodes are of the same type and the edge is interleaved into the source node, you can define ON DELETE CASCADE for either the source or destination node, but not both.

To prevent dangling edges in these scenarios, don't interleave into the source node input table. Instead, create two enforced foreign keys on the source and destination node references.

The following example uses AccountTransferAccount as the edge input table. It defines two foreign keys, one on each end node of the transfer edge, both with the ON DELETE CASCADE action.

CREATE TABLE AccountTransferAccount (
  id               INT64 NOT NULL,
  to_id            INT64 NOT NULL,
  amount           FLOAT64,
  create_time      TIMESTAMP NOT NULL,
  order_number     STRING(MAX),
  CONSTRAINT FK_FromAccount FOREIGN KEY (id) REFERENCES Account (id) ON DELETE CASCADE,
  CONSTRAINT FK_ToAccount FOREIGN KEY (to_id) REFERENCES Account (id) ON DELETE CASCADE,
) PRIMARY KEY (id, to_id);

Configure time to live (TTL) on nodes and edges

TTL lets you expire and remove data after a specified period. You can use TTL in your schema to maintain database size and performance by removing data that has a limited lifespan or relevance. For example, you can configure it to remove session information, temporary caches, or event logs.

The following example uses TTL to delete accounts 90 days after their closure:

  CREATE TABLE Account (
    id               INT64 NOT NULL,
    create_time      TIMESTAMP,
    close_time       TIMESTAMP,
  ) PRIMARY KEY (id),
    ROW DELETION POLICY (OLDER_THAN(close_time, INTERVAL 90 DAY));

When you define a TTL policy on a node table, you must configure how related edges are handled to prevent unintended dangling edges:

  • For interleaved edge tables: If an edge table is interleaved in the node table, you can define the interleave relationship with ON DELETE CASCADE. This ensures that when TTL deletes a node, its associated interleaved edges are also deleted.

  • For edge tables with foreign keys: If an edge table references the node table with a foreign key, you have two options:

    • To automatically delete edges when the referenced node is deleted by TTL, use ON DELETE CASCADE on the foreign key. This maintains referential integrity.
    • To allow edges to remain after the referenced node is deleted (creating a dangling edge), define the foreign key as an informational foreign key.

In the following example, the AccountTransferAccount edge table is subject to two data deletion policies:

  • A TTL policy deletes transfer records that are more than ten years old.
  • The ON DELETE CASCADE clause deletes all transfer records associated with a source when that account is deleted.
CREATE TABLE AccountTransferAccount (
  id               INT64 NOT NULL,
  to_id            INT64 NOT NULL,
  amount           FLOAT64,
  create_time      TIMESTAMP NOT NULL,
  order_number     STRING(MAX),
) PRIMARY KEY (id, to_id),
  INTERLEAVE IN PARENT Account ON DELETE CASCADE,
  ROW DELETION POLICY (OLDER_THAN(create_time, INTERVAL 3650 DAY));

Merge node and edge input tables

To optimize your schema, define a node and its incoming or outgoing edges within a single table. This approach offers the following benefits:

  • Fewer tables: Reduces the number of tables in your schema, which simplifies data management.

  • Improved query performance: Eliminates traversal that uses joins to a separate edge table.

This technique works well when a table's primary key also defines a relationship to another table. For example, if the Account table has a composite primary key (owner_id, account_id), the owner_id part can be a foreign key that references the Person table. This structure allows the Account table to represent both the Account node and the incoming edge from the Person node.

  CREATE TABLE Person (
    id INT64 NOT NULL,
  ) PRIMARY KEY (id);

  -- Assume each account has exactly one owner.
  CREATE TABLE Account (
    owner_id INT64 NOT NULL,
    account_id INT64 NOT NULL,
  ) PRIMARY KEY (owner_id, account_id);

You can use the Account table to define both the Account node and its incoming Owns edge. This is shown in the following CREATE PROPERTY GRAPH statement. In the EDGE TABLES clause, you give the Account table the alias Owns. This is because each element in the graph schema must have a unique name.

  CREATE PROPERTY GRAPH FinGraph
    NODE TABLES (
      Person,
      Account
    )
    EDGE TABLES (
      Account AS Owns
        SOURCE KEY (owner_id) REFERENCES Person
        DESTINATION KEY (owner_id, account_id) REFERENCES Account
    );

What's next