GoogleSQL for Bigtable overview
You can use GoogleSQL statements to query your Bigtable data. GoogleSQL is an ANSI-compliant structured query language (SQL) that is also implemented for other Google Cloud services such as BigQuery and Spanner.
This document provides an overview of GoogleSQL for Bigtable. It provides examples of SQL queries that you can use with Bigtable and describes how they relate to a Bigtable table schema. Before you read this document, you should be familiar with the Bigtable storage model and schema design concepts.
You can create and run queries in Bigtable Studio in the Google Cloud console, or you can run them programmatically using the Bigtable client library for Java. For more information, see Use SQL with a Bigtable client library.
SQL queries are handled by cluster nodes in the same way as NoSQL data requests. Therefore, the same best practices apply when creating SQL queries to run against your Bigtable data, such as avoiding full table scans or complex filters. For more information, see Reads and performance.
Use cases
GoogleSQL for Bigtable is optimal for low-latency application development. Additionally, running SQL queries in the Google Cloud console can be useful for quickly getting a visual representation of a table's schema, verifying that certain data was written, or debugging possible data issues.
The current release of GoogleSQL for Bigtable doesn't support come common SQL constructs, including, but not limited to, the following:
- Data Manipulation Language (DML) statements beyond
SELECT
, such asINSERT
,UPDATE
, orDELETE
- Data Definition Language (DDL) statements such as
CREATE
,ALTER
, orDROP
- Data Access Control statements
- Query syntax for subqueries,
JOIN
,UNION
,GROUP BY
,UNNEST
, andCTEs
For more information, including supported functions, operators, data types, and query syntax, see the GoogleSQL for Bigtable reference documentation .
Key concepts
This section discusses key concepts to be aware of when you use GoogleSQL to query your Bigtable data.
Column families in SQL responses
In Bigtable, a table contains one or more column families, which are used to group columns. When you query a Bigtable table with GoogleSQL, the schema for the table consists of the following:
- A special column named
_key
that corresponds to row keys in the queried table - A single column for each Bigtable column family in the table, which contains the column family's data in that row
Map data type
GoogleSQL for Bigtable includes the data type
MAP<key, value>
,
which is designed specifically to accommodate column families.
By default, each row in a map column contains key-value pairs, where a key is the Bigtable column qualifier in the queried table, and the value is the most recent value for that column.
The following is an example of a SQL query returns a table with the
row key value and the latest value of the qualifier from a map named
columnFamily
.
SELECT _key, columnFamily['qualifier'] FROM myTable
If your Bigtable schema involves storing multiple cells – or
versions of the data – in columns, you can add a temporal
filter, such as with_history
, to your SQL statement.
In this case, maps representing column families are nested and returned as an
array. In the array, each key is itself a map consisting of a timestamp as the
key and cell data as the value. The format is
MAP<key, ARRAY<STRUCT<timestamp, value>>>
.
The following example returns all cells in the 'info' column family for a single row.
SELECT _key, info FROM users(with_history => TRUE) WHERE _key = 'user_123';
The returned map looks like the following. In the queried table, info
is the
column family, user_123
is the row key and city
and state
are the column
qualifiers. Each timestamp-value pair (STRUCT
) in an array represents cells in
those columns in that row, and they are sorted by timestamp descending.
/*----------+------------------------------------------------------------------+
| _key | info |
+----------+------------------------------------------------------------------+
| user_123 | {"city":{<t5>:"Brooklyn", <t0>:"New York"}, "state":{<t0>:"NY"}} |
+----------+------------------------------------------------------------------*/
Sparse tables
A key feature of Bigtable is its flexible data model. In a
Bigtable table, if a column is unused in a row, no data is stored
for the column. A row might have one column and the next row might have 100
columns. In contrast, in a relational database table, all rows contain all
columns, and a NULL
value is typically stored in the column of a row that has no
data for that column.
When you query a Bigtable table with GoogleSQL,
however, an unused column is represented with an empty map and is returned as a
NULL
value. These NULL
values can be used as query predicates. For example, a
predicate like WHERE family['column1'] IS NOT NULL
can be used to return a row
only if column1
is used in the row.
Bytes
When you provide a string, GoogleSQL by default implicitly casts
from STRING
values to BYTES
values. This means, for example, that you can
provide the string 'qualifier'
rather than the bytes sequence b'qualifier'
.
Because Bigtable by default treats all data as bytes, most
Bigtable columns don't contain type information. However, with
GoogleSQL you can define a schema at read time with the CAST
function. For more information about casting, see Conversion
functions.
Temporal filters
The following table lists the arguments that you can use when accessing
temporal elements of a table. Arguments are listed in the order in which they
are filtered. For example, with_history
is applied before latest_n
. You must
provide a valid timestamp.
Argument | Description |
---|---|
as_of |
Timestamp. Returns the latest values with timestamps less than or equal to the provided timestamp. |
with_history |
Boolean. Controls whether to return the latest value as a
scalar or timestamped values(s) as STRUCT . |
after_or_equal |
Timestamp. Values with timestamps after the input,
inclusive. Requires with_history => TRUE |
before |
Timestamp. Values with timestamps before the input,
exclusive. Requires with_history => TRUE |
latest_n |
Integer. The number of timestamped values to return per column
qualifier (map key). Must be greater than or equal to 1. Requires
with_history => TRUE . |
For more examples, see Advanced query patterns.
Foundational queries
This section describes and shows examples of basic Bigtable SQL queries and how they work. For additional sample queries, see GoogleSQL for Bigtable query pattern examples.
Retrieve the latest version
Although Bigtable lets you store multiple versions of data in each column, GoogleSQL for Bigtable by default returns the latest version – the most recent cell – of the data for each row.
Consider the following sample dataset, which shows that user1
relocated twice
in the state of NY and once within the city of Brooklyn. In this example,
address
is the column family, and the column qualifiers are street
, city
,
and state
. Cells in a column are separated by empty lines.
address | |||
---|---|---|---|
_key | street | city | state |
user1 | 2023/01/10-14:10:01.000: '113 Xyz Street' 2021/12/20-09:44:31.010: '76 Xyz Street' 2005/03/01-11:12:15.112: '123 Abc Street' |
2021/12/20-09:44:31.010: 'Brooklyn' 2005/03/01-11:12:15.112: 'Queens' |
2005/03/01-11:12:15.112: 'NY' |
To retrieve the latest version of each column for user1
, you can use a
SELECT
statement like the following.
SELECT * FROM myTable WHERE _key = 'user1'
The response contains the current address, which is a combination of the most recent street, city and state values (written at different times) printed as JSON. Timestamps are not included in the response.
_key | address | ||
---|---|---|---|
user1 | {street:'113 Xyz Street', city:'Brooklyn', state: :'NY'} |
Retrieve all versions
To retrieve older versions (cells) of the data, use the with_history
flag. You
can also alias columns and expressions, as illustrated in the following example.
SELECT _key, columnFamily['qualifier'] AS col1
FROM myTable(with_history => TRUE)
To better understand the events leading to the current state of a row, you can
retrieve the timestamps for each value by retrieving the full history. For
example, to understand when user1
moved to their current address and where
they moved from, you could run the following query:
SELECT
address['street'][0].value AS moved_to,
address['street'][1].value AS moved_from,
FORMAT_TIMESTAMP('%Y-%m-%d', address['street'][0].timestamp) AS moved_on,
FROM myTable(with_history => TRUE)
WHERE _key = 'user1'
When you use the with_history
flag in your SQL query, the response is
returned as MAP<key, ARRAY<STRUCT<timestamp, value>>>
. Each item in the array
is a timestamped value for the specified row, column family, and column.
Timestamps are ordered in reverse chronological order, so the latest data is
always the first item returned.
The query response is as follows.
moved_to | moved_from | moved_on | ||
---|---|---|---|---|
113 Xyz Street | 76 Xyz Street | 2023/01/10 |
You can also retrieve the number of versions in each row using array functions as demonstrated in the following query:
SELECT _key, ARRAY_LENGTH(MAP_ENTRIES(address)) AS version_count
FROM myTable(with_history => TRUE)
Retrieve data from a specified time
Using an as_of
filter lets you retrieve the state of a row as it was at a
certain point in time. For example, if you want to know the address of user
as
of January 10, 2022 1:14 PM, you can run the following query.
SELECT address
FROM myTable(as_of => TIMESTAMP('2022/01/10-13:14:00'))
WHERE _key = 'user1'
The result shows what would have been the last known address on January 10, 2022 1:14 PM, which is the combination of street and city from the 2021/12/20-09:44:31.010 update and the state from 2005/03/01-11:12:15.112.
address | ||
---|---|---|
{street:'76 Xyz Street', city:'Brooklyn', state: :'NY'} |
The same result can also be achieved using Unix timestamps.
SELECT address
FROM myTable(as_of => TIMESTAMP_FROM_UNIX_MILLIS(1641820440000))
WHERE _key = 'user1'
Consider the following dataset, which shows the on-or-off state of smoke and
carbon monoxide alarms. The column family is alarmType
and the column
qualifiers are smoke
and carbonMonoxide
. Cells in each column are separated
by empty lines.
alarmType |
||
---|---|---|
_key | smoke | carbonMonoxide |
building1#section1 | 2023/04/01-09:10:15.000: 'off' 2023/04/01-08:41:40.000: 'on' 2020/07/03-06:25:31.000: 'off' 2020/07/03-06:02:04.000: 'on' |
2023/04/01-09:22:08.000: 'off' 2023/04/01-08:53:12.000: 'on' |
building1#section2 | 2021/03/11-07:15:04.000: 'off' 2021/03/11-07:00:25.000: 'on' |
You can find sections of building1
where a smoke alarm was on at 9 AM on April
1, 2023 and the status of the carbon monoxide alarm at the time using the
following query.
SELECT _key AS location, sensorType['carbonMonoxide'] AS CO_sensor
FROM alarms(as_of => TIMESTAMP('2023/04/01-09:00:00.000'))
WHERE _key LIKE 'building1%' and sensorType['smoke'] = 'on'
The result is the following:
location | CO_sensor |
---|---|
building1#section1 | 'on' |
Query time series data
A common use case for Bigtable is the storage of
time series data.
Consider the following sample dataset, which shows temperature and humidity
readings for weather sensors. The column family ID is metrics
and the column
qualifiers are temperature
and humidity
. Cells in a column are separated by
empty lines, and each cell represents a timestamped sensor reading.
metrics |
||
---|---|---|
_key | temperature | humidity |
sensorA#20230105 | 2023/01/05-02:00:00.000: 54 2023/01/05-01:00:00.000: 56 2023/01/05-00:00:00.000: 55 |
2023/01/05-02:00:00.000: 0.89 2023/01/05-01:00:00.000: 0.9 2023/01/05-00:00:00.000: 0.91 |
sensorA#20230104 | 2023/01/04-23:00:00.000: 56 2023/01/04-22:00:00.000: 57 |
2023/01/04-23:00:00.000: 0.9 2023/01/04-22:00:00.000: 0.91 |
You can retrieve a specific range of timestamp values using the temporal filters
after
, before
, or after_or_equal
. The following example uses after
:
SELECT metrics['temperature'] AS temp_versioned
FROM
sensorReadings(after => TIMESTAMP('2023/01/04-23:00:00'),
before => TIMESTAMP('2023/01/05-01:00:00'))
WHERE _key LIKE 'sensorA%'
The query returns the data in this format:
temp_versioned |
---|
{timestamp: '2023/01/05-01:00:00.000', value:56} {timestamp: '2023/01/05-00:00:00.000', value: 55} |
{timestamp: '2023/01/04-23:00:00.000', value:56} |
Query JSON
JSON functions let you manipulate JSON stored as Bigtable values for operational workloads.
For example, you can retrieve the value for the JSON element abc
from the
latest cell in the session
column family along with the row key by using the
following query.
SELECT _key, JSON_VALUE(session['payload'],'$.abc') AS abc FROM analytics
Escape special characters and reserved words
Bigtable offers high flexibility in naming tables and columns. As a result, in your SQL queries, your table names might need to be escaped due to special characters or reserved words.
For example the following query is not valid SQL due to the period in the table name.
-- ERROR: Table name format not supported
SELECT * FROM my.table WHERE _key = 'r1'
However, you can resolve this issue by enclosing the items with backtick (`) characters.
SELECT * FROM `my.table` WHERE _key = 'r1'
If a SQL reserved keyword is used as an identifier, it can similarly be escaped.
SELECT * FROM `select` WHERE _key = 'r1'
Use SQL with a Bigtable client library
The Bigtable client libraries for Java and Python support querying
data with SQL using the executeQuery
API. The following examples show how to
issue a query and access the data:
Java
To use this feature you must use java-bigtable
version 2.41.0 or later. For
more information on usage, see
executeQuery,
Statement,
and
ResultSet
in the Javadoc.
static void query(BigtableDataClient client) {
try (ResultSet resultSet =
client.executeQuery(
Statement.of(
"SELECT cf1['bytesCol'] AS bytesCol, CAST(cf2['stringCol'] AS STRING) AS stringCol, cf3 FROM myTable WHERE _key='mykey'"))) {
while (resultSet.next()) {
ByteString byteValue = resultSet.getBytes("bytesCol");
String stringValue = resultSet.getString("stringCol");
Map<ByteString, ByteString> cf3Value =
resultSet.getMap("cf3", SqlType.mapOf(SqlType.bytes(), SqlType.bytes()));
// Do something with the data
}
}
}
Python asyncio
To use this feature you must use python-bigtable
version 2.26.0 or later.
from google.cloud.bigtable.data import BigtableDataClientAsync
async def execute_query(project_id, instance_id, table_id):
async with BigtableDataClientAsync(project=project_id) as client:
query = (
"SELECT cf1['bytesCol'] AS bytesCol, CAST(cf2['stringCol'] AS STRING) AS stringCol,"
" cf3 FROM {table_id} WHERE _key='mykey'"
)
async for row in await client.execute_query(query, instance_id):
print(row["_key"], row["bytesCol"], row["stringCol"], row["cf3"])