Working with BigQuery GIS data

BigQuery GIS allows you to analyze geographic data in BigQuery. Geographic data is also known as geospatial data.

Common types of objects when working with geospatial data include the following:

  • A geometry represents a surface area on the Earth. It is often described using points, lines, polygons, or a collection of points, lines, and polygons. A geometry collection is a geometry that represents the spatial union of all shapes in the collection.
  • A spatial feature represents a logical spatial object. It combines a geometry with additional attributes that are application-specific.
  • A spatial feature collection is a set of spatial features.

In BigQuery, the GEOGRAPHY data type represents a geometry value or geometry collection. To represent spatial features, create a table with a GEOGRAPHY column for the geometry plus additional columns for the attributes. Each row of the table is a spatial feature, and the entire table represents a spatial feature collection.

The GEOGRAPHY data type describes a point set on the Earth's surface. A point set is a set of points, lines, and polygons on the WGS84 reference spheroid, with geodesic edges. You can use the GEOGRAPHY data type by calling one of the standard SQL geography functions.

Loading geospatial data

Single points on Earth can be described by just a longitude, latitude pair. For example, you can load a CSV file that contains longitude and latitude values and then use the ST_GEOGPOINT function to convert them into GEOGRAPHY values.

For more complex geographies, you can load the following geospatial data formats into a GEOGRAPHY column:

  • Well-known text (WKT)
  • Well-known binary (WKB)
  • GeoJSON

Loading WKT or WKB data

WKT is a text format for describing individual geometry shapes using points, lines, polygons with optional holes, or a collection of points, lines, or polygons. WKB is the binary version of the WKT format.

For example, the following defines a point in WKT:

POINT(-121 41)

To describe a spatial feature, WKT is usually embedded in a container file format, such as a CSV file, or in a database table. A file row or a table row usually corresponds to the spatial feature. The whole file or the whole table corresponds to the feature collection. To load WKT data into BigQuery, provide a schema that specifies a GEOGRAPHY column for the geospatial data.

For example, you might have a CSV file that contains the following data:

"POLYGON((-124.49 47.35,-124.49 40.73,-116.49 40.73,-116.49 47.35,-124.49 47.35))",poly1
"POLYGON((-85.6 31.66,-85.6 24.29,-78.22 24.29,-78.22 31.66,-85.6 31.66))",poly2
"POINT(1 2)",point1

You can load this file by running the bq command-line tool load command:

bq load --source_format=CSV \
  --schema="geography:GEOGRAPHY,name:STRING" \
  mydataset.mytable filename1.csv

For more information about loading data in BigQuery, see Introduction to loading data.

You can also convert a WKT text string into a GEOGRAPHY value by using the ST_GeogFromText function.

Loading GeoJSON data

GeoJSON is a JSON-based format for geometries and spatial features. For example, the following defines a point in GeoJSON:

{ "type": "Point", "coordinates": [-121,41] }

GeoJSON data can contain any of the following object types:

  • Geometry objects. A geometry object is a spatial shape, described as a union of points, lines, and polygons with optional holes.
  • Feature objects. A feature object contains a geometry plus additional name/value pairs, whose meaning is application-specific.
  • Feature collections. A feature collection is a set of feature objects.

BigQuery GIS supports loading individual GeoJSON geometry objects embedded in other file types. For example, you can load a CSV file where one of the columns contains a GeoJSON geometry object.

BigQuery GIS does not support loading GeoJSON feature objects, feature collections, or the GeoJSON file format.

To load GeoJSON data into BigQuery, provide a schema that specifies a GEOGRAPHY column for the GeoJSON data. Format the geometry objects as text strings, not as JSON objects, even in the data file is a newline-delimited JSON file.

You can also convert a GeoJSON geometry object into a GEOGRAPHY value by using the ST_GEOGFROMGEOJSON function.

Coordinate systems and edges

In BigQuery GIS, points are positions on the surface of a WGS84 spheroid, expressed as longitude and geodetic latitude. An edge is a spherical geodesic between two endpoints. (That is, edges are the shortest path on the surface of a sphere.)

The WKT format does not provide a coordinate system. When loading WKT data, BigQuery GIS assumes the data uses WGS84 coordinates with spherical edges. Make sure your source data matches that coordinate system, unless the geographies are small enough that the difference between spherical and planar edges can be ignored.

GeoJSON explicitly uses WGS84 coordinates with planar edges. When loading GeoJSON data, BigQuery GIS converts planar edges to spherical edges. BigQuery GIS adds additional points to the line as necessary, so that the converted sequence of edges remains within 10 meters of the original line. This process is known as tessellation or non-uniform densification. You cannot directly control the tessellation process.

To load geographies with spherical edges, use WKT. To load geographies with planar edges, often called geometries, it's simplest to use GeoJSON. However, if your geometry data is already in WKT format, another option is to load the data as a STRING type and then use the ST_GEOGFROMTEXT function to convert to GEOGRAPHY values. Set the planar parameter to TRUE to interpret the data as planar.

When choosing an interchange format, be sure to understand the coordinate system used by your source data. Most systems either explicitly support parsing geography (as opposed to geometry) from WKT, or else they assume planar edges.

Your coordinates should be longitude first, latitude second. If the geography has any long segments or edges then they must be tessellated, because BigQuery GIS interprets them as spherical geodesics, which may not correspond to the coordinate system where your data originated.

Polygon orientation

On a sphere, every polygon has a complementary polygon. For example, a polygon that describes the Earth's continents would have a complementary polygon that describes the Earth's oceans. Because the two polygons are described by the same boundary rings, rules are required to resolve the ambiguity around which of the two polygons is described by a given WKT string.

When you load WKT and WKB strings from files or by using streaming ingestion, BigQuery GIS assumes the polygons in the input are oriented as follows: If you traverse the boundary of the polygon in the order of the input vertices, the interior of the polygon is on the left. BigQuery GIS uses the same rule when exporting geography objects to WKT and WKB strings.

If you use the ST_GeogFromText function to convert a WKT string to a GEOGRAPHY value, the oriented parameter specifies how the function determines the polygon:

  • FALSE: Interpret the input as the polygon with the smaller area. This is the default behavior.

  • TRUE: Use the left-hand orientation rule described previously. This option allows you to load polygons with an area larger than a hemisphere.

Because GeoJSON strings are defined on a planar map, the orientation can be determined without ambiguity, even if the input does not follow the orientation rule defined in the GeoJSON format specification, RFC 7946.

Handling improperly formatted spatial data

When you load spatial data from other tools into BigQuery, you might encounter conversion errors due to invalid WKT or GeoJSON data. For example, an error such as Edge K has duplicate vertex with edge N indicates that the polygon has duplicate vertices (besides the first and last).

To avoid formatting issues, you can use a function that generates standards-compliant output. For example, when you export data from PostGIS, you can use the PostGIS ST_MakeValid function to standardize the output. Alternatively, import your data as text and then convert it by calling ST_GEOGFROMTEXT or ST_GEOGFROMGEOJSON with the make_valid parameter. When make_valid is TRUE, these functions attempt to repair invalid polygons.

To find or to ignore the improperly formatted data, use the SAFE function prefix to output the problematic data. For example, the following query uses the SAFE prefix to retrieve improperly formatted spatial data.

SELECT
  geojson AS bad_geojson
FROM
  mytable
WHERE
  geojson IS NOT NULL
  AND SAFE.ST_GeogFromGeoJson(geojson) IS NULL

Constraints

BigQuery GIS does not support the following features in geospatial formats:

  • Three-dimensional geometries. This includes the "Z" suffix in the WKT format, and the altitude coordinate in the GeoJSON format.
  • Linear reference systems. This includes the "M" suffix in WKT format.
  • WKT geometry objects other than geometry primitives or multipart geometries. In particular, BigQuery GIS supports only Point, MultiPoint, LineString, MultiLineString, Polygon, MultiPolygon, and GeometryCollection.

See ST_GeogFromGeoJson and ST_GeogFromText for constraints specific to GeoJson and WKT input formats.

Transforming BigQuery GIS data

If your table contains separate columns for longitude and latitude, you can transform the values into geographies by using standard SQL geography functions such as ST_GeogPoint. For example, if you have two DOUBLE columns for longitude and latitude, you can create a geography column with the following query:

SELECT
  *,
  ST_GeogPoint(longitude, latitude) AS g
FROM
  mytable

BigQuery can convert WKT and GeoJSON strings to geography types. If your data is in another format such as Shapefiles, use an external tool to convert the data to a supported input file format, such as a CSV file, with GEOGRAPHY columns encoded as WKT or GeoJSON strings.

Partitioning and clustering BigQuery GIS data

You can partition and cluster tables that contain GEOGRAPHY columns. You can use a GEOGRAPHY column as a clustering column, but you cannot use a GEOGRAPHY column as a partitioning column.

If you store GEOGRAPHY data in a table and your queries filter data by using a spatial predicate, ensure that the table is clustered by the GEOGRAPHY column. This typically improves query performance and might reduce cost. A spatial predicate calls a boolean geography function and has a GEOGRAPHY column as one of the arguments. The following sample shows a spatial predicate that uses the ST_DWithin function:

WHERE ST_DWithin(geo, ST_GeogPoint(longitude, latitude), 100)

Using JOINs with spatial data

Spatial JOINs are joins of two tables with a predicate geographic function in the WHERE clause. For example:

-- how many stations within 1 mile range of each zip code?
SELECT
    zip_code AS zip,
    ANY_VALUE(zip_code_geom) AS polygon,
    COUNT(*) AS bike_stations
FROM
    `bigquery-public-data.new_york.citibike_stations` AS bike_stations,
    `bigquery-public-data.geo_us_boundaries.zip_codes` AS zip_codes
WHERE ST_DWithin(
         zip_codes.zip_code_geom,
         ST_GeogPoint(bike_stations.longitude, bike_stations.latitude),
         1609.34)
GROUP BY zip
ORDER BY bike_stations DESC

Spatial joins perform better when your geography data is persisted. The example above creates the geography values in the query. It is more performant to store the geography values in a BigQuery table.

For example, the following query retrieves longitude, latitude pairs and converts them to geographic points. When you run this query, you specify a new destination table to store the query results:

SELECT
  *,
  ST_GeogPoint(pLongitude, pLatitude) AS p
FROM
  mytable

BigQuery implements optimized spatial JOINs for INNER JOIN and CROSS JOIN operators with the following standard SQL predicate functions:

Spatial joins are not optimized:

  • For LEFT, RIGHT or FULL OUTER joins
  • In cases involving ANTI joins
  • When the spatial predicate is negated

A JOIN that uses the ST_DWithin predicate is optimized only when the distance parameter is a constant expression.

Exporting spatial data

When you export spatial data from BigQuery, GEOGRAPHY column values are always formatted as WKT strings. To export data in GeoJSON format, use the ST_AsGeoJSON function.

If the tools you're using to analyze the exported data do not understand the GEOGRAPHY data type, you can convert the column values to strings using a geographic function such as ST_AsText or ST_AsGeoJSON. BigQuery GIS adds additional points to the line where necessary so that the converted sequence of edges remains within 10 meters of the original geodesic line.

For example, the following query uses ST_AsGeoJSON to convert GeoJSON values to strings.

SELECT
  ST_AsGeoJSON(ST_MakeLine(ST_GeogPoint(1,1), ST_GeogPoint(3,2)))

The resulting data would look like the following:

{ "type": "LineString", "coordinates": [ [1, 1], [1.99977145571783, 1.50022838764041], [2.49981908082299, 1.75018082434274], [3, 2] ] }

The GeoJSON line has two additional points. BigQuery GIS adds these points so that the GeoJSON line closely follows the same path on the ground as the original line.

What's next