More than just relational data at scale with Spanner’s new JSON data type
Justin Makeig
Senior Product Manager
Valentin Kuznetsov
Cloud Spanner Software Engineer
Try Google Cloud
Start building on Google Cloud with $300 in free credits and 20+ always free products.
Free trialJSON, or JavaScript Object Notation, is the format that developers rely on for hierarchical or semi-structured data. As a subset of JavaScript, JSON’s popularity has been driven by explosive growth of rich, interactive experiences in the browser and scripting environments like Node.js. Cloud Spanner’s new JSON data type allows you to extend your relational data with sparse, nested, or less structured JSON data. This provides flexibility and agility without having to compromise on the availability and consistency at scale that your applications rely on with Spanner.
Relational Is No Longer Enough
There are very few technologies that can match the ubiquity and staying power of the relational data model. E. F. Codd’s original paper likely predates many readers of this blog. Tables of rows and columns, related by keys are a natural way to capture structured data for operational applications: A “Customer” has “Sales Orders” which are made up of “Order Lines”, each with a well defined set of attributes. However, not all today’s data lends itself well to strict modeling in tables. For example, what if Customer data is sourced from three different systems, each with its own set of attributes, or the definition of an Order Line changes frequently or is defined on the fly by users?
Take, for example, a large electronics manufacturer with hundreds of different products. Each of these products has its own unique set of attributes. Modeling this relationally would require schema changes for each new attribute, even if their users or applications don’t need to query on them. With a growing business and new products coming online all the time, the analysis, modeling, deployment, and testing cycle for schema changes can be a drag on innovation. What they really need is the ability to query over a consistent set of key attributes, common to most products, and then to easily manage the long tail of other attributes without having to completely abandon the transactions and rich queries that Spanner provides.
JSON is great for representing key-value pairs (objects), ordered lists (arrays), strings, numbers, and Booleans, without having to predefine anything about the structure or the allowable values.
Our electronics manufacturer might model products with the following (grossly simplified) Products table.
This is standard relational modeling that normalizes attributes into columns. In Spanner—or any relational database—you can use SQL to filter or aggregate by a specific column or join to other tables, for example where an Order Line has a foreign key relationship to Product. Again, Relational 101.
However, in cases where the attributes for individual products vary widely, modeling using columns becomes unwieldy. The SocketSize
attribute might only apply to one product out of millions.
With Spanner, you now have the option to store this long tail of other attributes as JSON. Unlike a strongly typed column, JSON values don’t need to pre-define anything about their structure or values. Thus it’s easy to add new attributes without changing the relational schema.
Because the JSON is stored as part of the table row, it gets all the consistency and guarantees that Spanner provides for queries and updates.
As with any table, you can use SQL to query a table with JSON data. The dot operator (.) gives quick access to the properties of JSON values, in this case to project out the socketSize property of the other attributes.Spanner also provides a rich set of SQL functions that allow you to use JSONPath to traverse JSON values.
The above query uses the relational model to do the heavy lifting of filtering, while still providing the flexibility to project out of the JSON column for the filtered set. This is important because Spanner doesn’t (yet) index data in JSON columns. The built-in query optimizer and indexes rely on explicit column definitions. However, using a generated column, you can automatically extract a value out of a JSON column for indexing or query. Generated columns are automatically updated in the same transaction as the values they depend on, so columns and indexes will always be up-to-date.
For example, let’s say our electronics manufacturer wants to further refine their product type taxonomy by sub-types. Some products have already added a subtype property to a bag of ExtendedAttributes
. A product specific to airplanes might have a sub-type of “aviation”. The extended attributes could be represented in JSON as:
Using SQL, you can insert a new row with a JSON column or update an existing one:
The value of the JSON could be any valid JSON. While the SQL allows you to specify JSON as a string, internally Spanner uses an efficient normalized representation to minimize the storage size and speed up access.
In this case, you can “promote” a value from within a JSON column into its own column. Then, as with any column, you can create an index to speed up queries.
For example, to filter by subtype:
This query will use the ProductSubtypeIdx
index to avoid scanning each row.
Spanner’s new JSON data type gives developers and data architects new flexibility to manage data that doesn’t fit nicely into relational tables. This is useful for handling sparse or changing data. You can query JSON columns with SQL using a rich set of built-in functions. Generated columns allow you to automatically extract values from JSON data into their own columns when you need to filter, join, or aggregate at scale.