Spatial Clustering on BigQuery - Best Practices
Most data analysts are familiar with the concept of organizing data into clusters so that it can be queried faster and at a lower cost. The user behavior dictates how the dataset should be clustered: for example, when a user seeks to analyze or visualize geospatial data (a.k.a location data), it is most efficient to cluster on a geospatial column. This practice is known as spatial clustering, and in this blog, we will share best practices for implementing it in BigQuery (hint — let BigQuery do it for you).
BigQuery is a petabyte-scale data warehouse that has many geospatial capabilities and functions. In the following sections, we will describe how BigQuery does spatial clustering out of the box using the S2 indexing system. We will also touch on how to use other spatial indexes like H3 and geohash, and compare the cost savings of different approaches.
How BigQuery does spatial clustering under the hood
Clustering ensures that blocks of data with similar values are colocated in storage, which means that the data is easier to retrieve at query time. It also sorts the blocks of data, so that only the necessary blocks need to be scanned, which reduces cost and processing time. In geospatial terms, this means that when you’re querying a particular region, only the rows within or close to that region are scanned, rather than the whole globe.
All of the optimizations described above will occur automatically in BigQuery if you cluster your tables on a GEOGRAPHY column. It’s as easy as typing
CLUSTER BY [GEOGRAPHY] when creating the table. Only predicate functions (e.g. ST_Intersects, ST_DWithin) leverage clustering, with the exception of ST_DISJOINT. It should also be noted that while BigQuery supports partitioning and clustering on a variety of fields, only clustering is supported on a geospatial field. This is because geometries can be large and could span across partitions, no matter how BigQuery chooses to partition the space. Finally, cluster sizes will range from 100MB to 1GB, so clustering on a table smaller than 100MB will provide no benefit.
When writing to a table that is clustered by GEOGRAPHY, BigQuery shards the data into spatially-compact blocks. For each block, BigQuery computes a bit of metadata called an S2 covering that includes the spatial area of the data contained within. When querying a geography-clustered table using spatial predicates, BigQuery reads the covering and evaluates whether a particular covering can satisfy the filter. BigQuery then prunes the blocks that cannot satisfy the filter. Users are only charged for data from remaining blocks. Note that S2 coverings can overlap, as it is often impossible to divide data into non-overlapping regions.
Fundamentally, BigQuery is using the S2 index to map a geometry into a 64-bit integer, then BigQuery clusters on that integer using existing integer-based clustering mechanisms. In the past, customers have manually implemented an S2 indexing system in BigQuery. This was done prior to BigQuery’s native support of spatial clustering via S2. Using BigQuery's native clustering resulted in a large performance increase, not to mention the added simplicity of not having to manage your own S2 indexes.
Alternative Spatial Indexes
Spatial clustering utilizes a spatial indexing system, or hierarchy, to organize the stored data. The purpose of all spatial indices is to represent this globe we call Earth in numerical terms, allowing us to define a location as a geometric object like a point, polygon or line. There are dozens of spatial indexes, and most databases implement them in their own unique way. Although BigQuery natively uses S2 cells for clustering, other indexes can be manually implemented, such as H3, geohash, or quadkeys. The examples below will involve the following spatial indexes:
S2: The S2 system represents geospatial data as cells on a three dimensional sphere. It is used by Google Maps.
uses quadrilaterals, which are more efficient than hexagons
Higher precision than H3 or geohashing
H3: The H3 system represents geospatial data on overlapping hexagonal grids.
Hexagons are more visually appealing
Convolutions and smoothing algorithms are more efficient than S2
Geohash - Geohash is a public domain system that represents geospatial data on a curved grid.
Length of the Geohash id determines the spatial precision
Fairly poor spatial locality, so clustering does not work as well
Spatial clustering in BQ — S2 vs. Geohash
In most cases for analysis, BigQuery’s built-in spatial clustering will give the best performance with the least effort. But if the data is queried according to other attributes, e.g. by geohash box, a custom indexing is necessary. The method of querying the spatial indexes has implications on the performance, as is illustrated in the example below.
First, you will create a table with random points in longitude and latitude. Use the BigQuery function
st_geohash to generate a geohash id for each point.
st_geogpoint function to transform the latitude and longitude into a GEOGRAPHY, BigQuery’s native geospatial type, which uses S2 cells as the index. Select a collection of around 3,000 points. This should cost around 25MB. If you run the same query on an unclustered table, it would cost 5.77GB (the full table size).
Now you will query by geohash id. BigQuery’s ability to leverage the spatial clustering will depend on whether the BQ SAT solver can prove the cluster of data can be pruned. The queries below are both leveraging the geospatial clustering, costing only 340 MB. Note that if we had clustered the table by the ‘gh’ field (ie geohash id), these queries would cost the same as the one above, around 25MB.
The query below is much less efficient, costing 5.77GB, a full scan of the table. BigQuery cannot prove this condition fails based on the min/max values of the cluster so it must scan the entire table.
As the examples show, the least costly querying option is to use the indexing consistent with the query method — native S2 indexing when querying by geography, string indexing when querying by geohash. When using geohashing, avoid left() or right() functions, as it will cause BigQuery to scan the entire table.
Spatial clustering in BQ with H3
One may also find themselves in a situation where they need to use H3 as a spatial index in BigQuery. It is still possible to leverage the performance benefits of clustering, but as with geohashing, it is important to avoid certain patterns.
Suppose you have a huge table of geography points indexed by H3 cell ID at level 15, which you’ve clustered by
H3_index (note: these functions are supported through the Carto Spatial Extension for BigQuery). You want to find all the points that belong to lower resolution cells, e.g. at level 7. You might write a query like this:
H3_ToParent is a custom function that computes the parent cell ID from a higher resolution index. Since you’ve clustered by the H3 index, you might expect a lower cost, however this query will scan the entire table. This happens because
H3_ToParent involves bit operations, and is too complex for the BigQuery query analyser to understand how the query’s result is related to cluster boundaries. What you should do instead is give BigQuery the range of the H3 cell IDs at the level that the geographies are indexed, like the following example:
H3_CellRangeEnd are custom functions that map the lower-resolution parent ID to the appropriate start and end IDs of the higher resolution cells. Now BigQuery will be able to figure out relevant clusters, reducing the cost and improving the performance of the query.
Spatial clustering is a complex topic that requires specialized knowledge to implement. Using BiqQuery’s native spatial clustering will take most of the work out of your hands. With your geospatial data in BigQuery, you can do amazing spatial analyses like querying the stars, even on large datasets. You can also use BigQuery as a backend for a geospatial application, such as an application that allows customers to explore the climate risk of their assets. Using spatial clustering, and querying your clusters correctly will ensure you get the best performance at the lowest cost.
Acknowledgments: Thanks to Eric Engle and Travis Webb for their help with this post.