GoogleSQL for Bigtable overview

You can use GoogleSQL statements to query your Bigtable data. GoogleSQL is an ANSI-compliant structured query language (SQL) that is also implemented for other Google Cloud services such as BigQuery and Spanner.

This document provides an overview of GoogleSQL for Bigtable. It provides examples of SQL queries that you can use with Bigtable and describes how they relate to a Bigtable table schema. Before you read this document, you should be familiar with the Bigtable storage model and schema design concepts.

You can create and run queries in Bigtable Studio in the Google Cloud console, or you can run them programmatically using the Bigtable client library for Java. For more information, see Use SQL with a Bigtable client library.

SQL queries are handled by cluster nodes in the same way as NoSQL data requests. Therefore, the same best practices apply when creating SQL queries to run against your Bigtable data, such as avoiding full table scans or complex filters. For more information, see Reads and performance.

Use cases

GoogleSQL for Bigtable is optimal for low-latency application development. Additionally, running SQL queries in the Google Cloud console can be useful for quickly getting a visual representation of a table's schema, verifying that certain data was written, or debugging possible data issues.

The current release of GoogleSQL for Bigtable doesn't support come common SQL constructs, including, but not limited to, the following:

  • Data Manipulation Language (DML) statements beyond SELECT, such as INSERT, UPDATE, or DELETE
  • Data Definition Language (DDL) statements such as CREATE, ALTER, or DROP
  • Data Access Control statements
  • Query syntax for subqueries, JOIN, UNION, GROUP BY, UNNEST, and CTEs

For more information, including supported functions, operators, data types, and query syntax, see the GoogleSQL for Bigtable reference documentation .

Key concepts

This section discusses key concepts to be aware of when you use GoogleSQL to query your Bigtable data.

Column families in SQL responses

In Bigtable, a table contains one or more column families, which are used to group columns. When you query a Bigtable table with GoogleSQL, the schema for the table consists of the following:

  • A special column named _key that corresponds to row keys in the queried table
  • A single column for each Bigtable column family in the table, which contains the column family's data in that row

Map data type

GoogleSQL for Bigtable includes the data type MAP<key, value>, which is designed specifically to accommodate column families.

By default, each row in a map column contains key-value pairs, where a key is the Bigtable column qualifier in the queried table, and the value is the most recent value for that column.

The following is an example of a SQL query returns a table with the row key value and the latest value of the qualifier from a map named columnFamily.

  SELECT _key, columnFamily['qualifier'] FROM myTable

If your Bigtable schema involves storing multiple cells – or versions of the data – in columns, you can add a temporal filter, such as with_history, to your SQL statement.

In this case, maps representing column families are nested and returned as an array. In the array, each key is itself a map consisting of a timestamp as the key and cell data as the value. The format is MAP<key, ARRAY<STRUCT<timestamp, value>>>.

The following example returns all cells in the 'info' column family for a single row.

  SELECT _key, info FROM users(with_history => TRUE) WHERE _key = 'user_123';

The returned map looks like the following. In the queried table, info is the column family, user_123 is the row key and city and state are the column qualifiers. Each timestamp-value pair (STRUCT) in an array represents cells in those columns in that row, and they are sorted by timestamp descending.

/*----------+------------------------------------------------------------------+
 |   _key   |                              info                                |
 +----------+------------------------------------------------------------------+
 | user_123 | {"city":{<t5>:"Brooklyn", <t0>:"New York"}, "state":{<t0>:"NY"}} |
 +----------+------------------------------------------------------------------*/

Sparse tables

A key feature of Bigtable is its flexible data model. In a Bigtable table, if a column is unused in a row, no data is stored for the column. A row might have one column and the next row might have 100 columns. In contrast, in a relational database table, all rows contain all columns, and a NULL value is typically stored in the column of a row that has no data for that column.

When you query a Bigtable table with GoogleSQL, however, an unused column is represented with an empty map and is returned as a NULL value. These NULL values can be used as query predicates. For example, a predicate like WHERE family['column1'] IS NOT NULL can be used to return a row only if column1 is used in the row.

Bytes

When you provide a string, GoogleSQL by default implicitly casts from STRING values to BYTES values. This means, for example, that you can provide the string 'qualifier' rather than the bytes sequence b'qualifier'.

Because Bigtable by default treats all data as bytes, most Bigtable columns don't contain type information. However, with GoogleSQL you can define a schema at read time with the CAST function. For more information about casting, see Conversion functions.

Temporal filters

The following table lists the arguments that you can use when accessing temporal elements of a table. Arguments are listed in the order in which they are filtered. For example, with_history is applied before latest_n. You must provide a valid timestamp.

Argument Description
as_of Timestamp. Returns the latest values with timestamps less than or equal to the provided timestamp.
with_history Boolean. Controls whether to return the latest value as a scalar or timestamped values(s) as STRUCT.
after_or_equal Timestamp. Values with timestamps after the input, inclusive. Requires with_history => TRUE
before Timestamp. Values with timestamps before the input, exclusive. Requires with_history => TRUE
latest_n Integer. The number of timestamped values to return per column qualifier (map key). Must be greater than or equal to 1. Requires with_history => TRUE.

For more examples, see Advanced query patterns.

Foundational queries

This section describes and shows examples of basic Bigtable SQL queries and how they work. For additional sample queries, see GoogleSQL for Bigtable query pattern examples.

Retrieve the latest version

Although Bigtable lets you store multiple versions of data in each column, GoogleSQL for Bigtable by default returns the latest version – the most recent cell – of the data for each row.

Consider the following sample dataset, which shows that user1 relocated twice in the state of NY and once within the city of Brooklyn. In this example, address is the column family, and the column qualifiers are street, city, and state. Cells in a column are separated by empty lines.

address
_key street city state
user1 2023/01/10-14:10:01.000:
'113 Xyz Street'

2021/12/20-09:44:31.010:
'76 Xyz Street'

2005/03/01-11:12:15.112:
'123 Abc Street'
2021/12/20-09:44:31.010:
'Brooklyn'

2005/03/01-11:12:15.112:
'Queens'
2005/03/01-11:12:15.112:
'NY'

To retrieve the latest version of each column for user1, you can use a SELECT statement like the following.

   SELECT * FROM myTable WHERE _key = 'user1'

The response contains the current address, which is a combination of the most recent street, city and state values (written at different times) printed as JSON. Timestamps are not included in the response.

_key address
user1 {street:'113 Xyz Street', city:'Brooklyn', state: :'NY'}

Retrieve all versions

To retrieve older versions (cells) of the data, use the with_history flag. You can also alias columns and expressions, as illustrated in the following example.

  SELECT _key, columnFamily['qualifier'] AS col1
  FROM myTable(with_history => TRUE)

To better understand the events leading to the current state of a row, you can retrieve the timestamps for each value by retrieving the full history. For example, to understand when user1 moved to their current address and where they moved from, you could run the following query:

  SELECT
    address['street'][0].value AS moved_to,
    address['street'][1].value AS moved_from,
    FORMAT_TIMESTAMP('%Y-%m-%d', address['street'][0].timestamp) AS moved_on,
  FROM myTable(with_history => TRUE)
  WHERE _key = 'user1'

When you use the with_history flag in your SQL query, the response is returned as MAP<key, ARRAY<STRUCT<timestamp, value>>>. Each item in the array is a timestamped value for the specified row, column family, and column. Timestamps are ordered in reverse chronological order, so the latest data is always the first item returned.

The query response is as follows.

moved_to moved_from moved_on
113 Xyz Street 76 Xyz Street 2023/01/10

You can also retrieve the number of versions in each row using array functions as demonstrated in the following query:

  SELECT _key, ARRAY_LENGTH(MAP_ENTRIES(address)) AS version_count
  FROM myTable(with_history => TRUE)

Retrieve data from a specified time

Using an as_of filter lets you retrieve the state of a row as it was at a certain point in time. For example, if you want to know the address of user as of January 10, 2022 1:14 PM, you can run the following query.

  SELECT address
  FROM myTable(as_of => TIMESTAMP('2022/01/10-13:14:00'))
  WHERE _key = 'user1'

The result shows what would have been the last known address on January 10, 2022 1:14 PM, which is the combination of street and city from the 2021/12/20-09:44:31.010 update and the state from 2005/03/01-11:12:15.112.

address
{street:'76 Xyz Street', city:'Brooklyn', state: :'NY'}

The same result can also be achieved using Unix timestamps.

  SELECT address
  FROM myTable(as_of => TIMESTAMP_FROM_UNIX_MILLIS(1641820440000))
  WHERE _key = 'user1'

Consider the following dataset, which shows the on-or-off state of smoke and carbon monoxide alarms. The column family is alarmType and the column qualifiers are smoke and carbonMonoxide. Cells in each column are separated by empty lines.


alarmType
_key smoke carbonMonoxide
building1#section1 2023/04/01-09:10:15.000:
'off'

2023/04/01-08:41:40.000:
'on'

2020/07/03-06:25:31.000:
'off'

2020/07/03-06:02:04.000:
'on'
2023/04/01-09:22:08.000:
'off'

2023/04/01-08:53:12.000:
'on'
building1#section2 2021/03/11-07:15:04.000:
'off'

2021/03/11-07:00:25.000:
'on'

You can find sections of building1 where a smoke alarm was on at 9 AM on April 1, 2023 and the status of the carbon monoxide alarm at the time using the following query.

  SELECT _key AS location, sensorType['carbonMonoxide'] AS CO_sensor
  FROM alarms(as_of => TIMESTAMP('2023/04/01-09:00:00.000'))
  WHERE _key LIKE 'building1%' and sensorType['smoke'] = 'on'

The result is the following:

location CO_sensor
building1#section1 'on'

Query time series data

A common use case for Bigtable is the storage of time series data. Consider the following sample dataset, which shows temperature and humidity readings for weather sensors. The column family ID is metrics and the column qualifiers are temperature and humidity. Cells in a column are separated by empty lines, and each cell represents a timestamped sensor reading.


metrics
_key temperature humidity
sensorA#20230105 2023/01/05-02:00:00.000:
54

2023/01/05-01:00:00.000:
56

2023/01/05-00:00:00.000:
55
2023/01/05-02:00:00.000:
0.89

2023/01/05-01:00:00.000:
0.9

2023/01/05-00:00:00.000:
0.91
sensorA#20230104 2023/01/04-23:00:00.000:
56

2023/01/04-22:00:00.000:
57
2023/01/04-23:00:00.000:
0.9

2023/01/04-22:00:00.000:
0.91

You can retrieve a specific range of timestamp values using the temporal filters after, before, or after_or_equal. The following example uses after:

   SELECT metrics['temperature'] AS temp_versioned
   FROM
   sensorReadings(after => TIMESTAMP('2023/01/04-23:00:00'),
         before => TIMESTAMP('2023/01/05-01:00:00'))
   WHERE _key LIKE 'sensorA%'

The query returns the data in this format:

temp_versioned
{timestamp: '2023/01/05-01:00:00.000', value:56}
{timestamp: '2023/01/05-00:00:00.000', value: 55}
{timestamp: '2023/01/04-23:00:00.000', value:56}

Query JSON

JSON functions let you manipulate JSON stored as Bigtable values for operational workloads.

For example, you can retrieve the value for the JSON element abc from the latest cell in the session column family along with the row key by using the following query.

  SELECT _key, JSON_VALUE(session['payload'],'$.abc') AS abc FROM analytics

Escape special characters and reserved words

Bigtable offers high flexibility in naming tables and columns. As a result, in your SQL queries, your table names might need to be escaped due to special characters or reserved words.

For example the following query is not valid SQL due to the period in the table name.

  -- ERROR: Table name format not supported

  SELECT * FROM my.table WHERE _key = 'r1'

However, you can resolve this issue by enclosing the items with backtick (`) characters.

  SELECT * FROM `my.table` WHERE _key = 'r1'

If a SQL reserved keyword is used as an identifier, it can similarly be escaped.

  SELECT * FROM `select` WHERE _key = 'r1'

Use SQL with a Bigtable client library

The Bigtable client libraries for Java and Python support querying data with SQL using the executeQuery API. The following examples show how to issue a query and access the data:

Java

To use this feature you must use java-bigtable version 2.41.0 or later. For more information on usage, see executeQuery, Statement, and ResultSet in the Javadoc.

  static void query(BigtableDataClient client) {
    try (ResultSet resultSet =
        client.executeQuery(
            Statement.of(
                "SELECT cf1['bytesCol'] AS bytesCol, CAST(cf2['stringCol'] AS STRING) AS stringCol, cf3 FROM myTable WHERE _key='mykey'"))) {
      while (resultSet.next()) {
        ByteString byteValue = resultSet.getBytes("bytesCol");
        String stringValue = resultSet.getString("stringCol");
        Map<ByteString, ByteString> cf3Value =
            resultSet.getMap("cf3", SqlType.mapOf(SqlType.bytes(), SqlType.bytes()));
        // Do something with the data
      }
    }
  }

Python asyncio

To use this feature you must use python-bigtable version 2.26.0 or later.

  from google.cloud.bigtable.data import BigtableDataClientAsync

  async def execute_query(project_id, instance_id, table_id):
      async with BigtableDataClientAsync(project=project_id) as client:
          query = (
            "SELECT cf1['bytesCol'] AS bytesCol, CAST(cf2['stringCol'] AS STRING) AS stringCol,"
            " cf3 FROM {table_id} WHERE _key='mykey'"
          )
          async for row in await client.execute_query(query, instance_id):
            print(row["_key"], row["bytesCol"], row["stringCol"], row["cf3"])

What's next