Using GeoJSON in BigQuery for geospatial analytics
Solutions Engineering Manager
The first step in most analytical workloads is to ingest the data that you need for your analysis into your data warehouse. For geospatial analysis involving point, line, or polygon data, ingesting data can be complex because geospatial data comes in myriad data formats. Two of the most popular geospatial formats are GeoJSON and GeoJSON-NL (newline-delimited geoJSON).
This blog describes how to use the new GeoJSON-NL load functionality in BigQuery to load data from geoJSON-NL files directly into BigQuery tables. We’ll also cover the geoJSON-NL format and its benefits, show you some simple tools to transform various spatial data formats into GeoJSON-NL, and walk through a hands-on code example.
BigQuery’s geospatial support allows users to import and analyze GEOGRAPHY data types made up of points, lines, and polygons at massive scales right in their data warehouse, instead of specialized geospatial tools. To make it easier for you to use geospatial data with BigQuery, we recently added direct support for GeoJSON-NL files using
bq load and external tables.
What is a GeoJSON-NL file?
GeoJSON-NL stands for newline-delimited GeoJSON. GeoJSON-NL files consist of geoJSON geographic features, with each feature separated by a newline “\n” character. GeoJSON-NL files differ from GeoJSON files in that instead of a single “FeatureCollection” object with multiple “Feature” objects in a GeoJSON file, a newline-delimited GeoJSON file will have only “Feature” object, each on its own unique line.
Geojson format file
The preceding code block is a a geoJSON format file. The following shows an equivalent geoJSON-NL formatted file
An easy way to tell the difference between the two formats is to check if there is a “FeatureCollection” JSON object in the file - then it is GeoJSON. If there is exactly one “Feature” JSON object on each line, then it is GeoJSON-NL.
Why geoJSON-NL as a spatial format?
GeoJSON is a great spatial file format because it works in a broad number of applications across many platforms, and is a subset of JSON. However, the problem with GeoJSON is that all features are stored as a “FeatureCollection” object in a large array. Parsing a large array with large JSON strings is a performance bottleneck. All features stored in the array must be loaded and parsed into memory by a single CPU thread from the source file, leading to ETL jobs that fail due to out-of-memory errors. If there are issues with any one feature in a FeatureCollection, the entire parse & load job may fail.
Newline-delimited geoJSON solves the problems of standard geoJSON files by storing each feature on a unique line instead of in an array, enabling higher performance and efficiency for applications working with a large amount of spatial data, such as streaming IoT (example). Features can be read and written from multiple threads or a distributed process, and each feature can be parsed on it’s own, lowering the memory footprint and allowing more robust schema and geometry validation.
Converting spatial data to geoJSON-NL
You can convert your spatial data to geoJSON-NL format using a variety of open source and proprietary tools. The section below includes code examples for converting geoJSON, shapefiles, GPX, KML, and CSV files to geoJSON-NL. You can use the code examples below using the GCP Cloud Shell (easiest to start), your local development machine (Linux or MacOS based), or as a GCP Cloud Function using a Node.js or Python runtime environment.
- geojson2ndjson - a node.js command line tool for converting geoJSON to GeoJSON-NL.
Other libraries can be combined with
geojson2ndjson for converting spatial formats such as CSV, KML, shapefiles, and GPX into geoJSON-NL for use with BQ:
- shapefile - a node.js command line tool for converting shapefiles to other spatial formats, including geojson-nl using the
- togeojson - a node.js command line tool for converting KML, TCX, and GPX files to geoJSON. Example:
- csv2geojson - a node.js command line tool for converting CSV files to geoJSON. Specify the
--lonparameters to identify the latitude and longitude column header names. Example:
Fiona - a Python spatial package for reading and writing most common spatial data formats for conversion, analytics, and scripting workloads.
GDAL - a powerful command line spatial toolset, which includes the
ogr2ogrcommand for converting between most popular geospatial file formats and more complex geospatial functions such as coordinate system transformations.
- FME - a partner tool that allows ingestion of myriad types and formats. Super useful. Requires a commercial license.
Working with GeoJSON-NL in BigQuery
bq load works with the GCP Cloud SDK installed on your development machine, or via the GCP Cloud Console. The general structure to import a geoJSON-NL file into a bq table using bq load is:
As an example, we can use
bq load with the
--autodetect flag to ingest a geoJSON-NL file containing the single Feature using below, resulting in the following BQ table and schema:
To import a geoJSON-NL format file, specify the source format and json extension:
You can either autodetect or specify the schema of the table imported from a geoJSON-NL file. All data in the “geometry” keys of your features will be loaded into a table column named “geometry” with data type “GEOGRAPHY”. Other table field names and data types will be parsed from the “properties” object and “id” objects.
Alternatively, you can create an external table from a GeoJSON-NL file stored in Bigtable, Cloud Storage, or Google Drive. External tables are useful for “ELT” (extract-load-transform) workloads where you need to load and clean your data in one command, as well as for joining to external data that frequently updates, such as a real-time weather forecast. External tables have a number of limitations, so only use them if appropriate for your use case.
To create an external table from a geoJSON-NL file stored in a cloud storage bucket (docs), adapt the example SQL code below in the BQ console:
Hands-on code example - Airbnb listing analytics
Let’s get started analyzing geoJSON-NL data in BigQuery with an example. If you’d like to follow along, you can log into your BigQuery workspace here, and if you don’t have one yet you can get set up in less than 1 minute and for free with the BigQuery Sandbox.
Let’s say you’re planning a vacation to Hawaii, and you’d like to know which neighborhoods on which islands have the largest selection of listings in your price range. In order to do this analysis, you need listing stats and neighborhood boundary data from Airbnb. Fortunately, Airbnb makes its data available in aggregate under the creative commons license on data hubs like opendatasoft, which we’ll use for this example.
Prerequisites1. Access to BigQuery Sandbox and Cloud Console
Loading the data into BigQuery
2. Create a new GCS bucket with a unique name in your project with all the default settings. For this code example, we’ll use the bucket name “geojson-bq-example”, however note that you’ll need to create a unique bucket name and reference it for your example.
3. Create a dataset in your BQ workspace named “geojson_examples”
4. Open the BigQuery UI and activate Cloud Shell
5.1 There are two source data files:
5.1.a “listings data” file containing Airbnb listings in Hawaii in CSV format with specific metrics such as availability, price, and the listing location as longitude, latitude pairs.
5.1.b “neighborhood data” file in geoJSON format containing the names of each neighborhood in Hawaii along with their polygon geometries.
5.2 First, we’ll get the Airbnb listings data for Hawaii from opendatasoft with cURL, and then upload it to our Cloud Storage Bucket with gsutil.
5.2.a We do not need to transform the listings data CSV into geoJSON-NL format because it already contains “point” data as longitude and latitude data in a single column. We will transform the column to a GEOGRAPHY data type in the following step after we load the file into BQ.
5.2.b Enter the following commands into Cloud Console:
5.3. Then, we’ll get the Hawaii Airbnb neighborhood data in geojson format from opendatasoft using cURL, transform it from geojson to geoJSON-NL format using jq, and upload it to our Cloud Storage Bucket with gsutil.
5.3.a Enter the following commands into Cloud Shell:
6. Now that our source data is staged in Cloud Storage, we can load the data into BigQuery using
6.1 First load “airbnb-hi-neighborhoods.geojsonl” file from cloud storage into a BigQuery table named
geojson_examples.airbnb-hi-neighborhoods. Enter the following command into Cloud Shell:
6.2. Then, load the “airbnb-hi-listings.csv” file from the GCS bucket into a BigQuery table named
geojson_examples.airbnb-hi-listings. Enter the following command into Cloud Shell:
Analyzing the data in BigQuery
Now that our Airbnb data for the state of Hawaii is loaded into BQ, we can analyze it using BQ geospatial functions. First, let’s have a look at the neighborhoods in the lists:
1. Create a table with a native “geography” point column with clustering on the geography column from the “airbnb-hi-locations” table. This increases the speed of geospatial analytics such as a “point in polygon” query using ST_Contains() in the WHERE clause, which we’ll use in the next step to aggregate listing metrics by neighborhood geometry.
Note that the geometry data is in “latitude,longitude” format in the source file, so we reverse the order in the SQL query above when creating a GEOGRAPHY type point with ST_GEOGPOINT.
2. Check the result of the query by creating a visualization in GeoViz.
3. Execute a query to count the number of Airbnb listings in each HI neighborhood using the geographic function “ST_Contains”.
3.1. The “ST_Contains” function performs a spatial join between the geographic features. In this case, we’re checking the “point” of each Airbnb listing to determine if it’s within the geographic “polygon” of a neighborhood in Hawaii. If the point is within the neighborhood polygon, the function will return “true”, else it will return “false”.
3.2. We group the query by “neighborhood” and aggregate and return in descending order the number of listings in the group by clause with a “count” function.
Note the use of the
ANY_VALUE aggregate function to return a column with data type GEOGRAPHY in a query with a GROUP BY clause.
5. Now we can create a query to understand which neighborhood has the most listings, highest average room price, best availability, or most reviews for “entire home/apt” listings on Airbnb in Hawaii and visualize the results in GeoViz.
Creating a Data Studio dashboard
Now we’ll create an interactive dashboard in Data Studio to explore different questions from the data and determine the best neighborhood for our vacation.
1. After running the query from step 5 above, click Explore Data -> Data Studio from the results panel in BigQuery.
2. In the Data Studio GUI for the BigQuery custom SQL data source that just created, create a “Table” chart and drag / drop metrics from the “available fields” column to the “Metric” column.
3. Now we’ll create a filled map chart from the neighborhood polygon geometry. Click “Add a chart” -> “Filled Map” and configure the Fields Location, Geospatial field, and Color metric as per the screenshot below.
4. Now add two filters to the Data Studio dashboard - one for “avg price” and the other for “num listings”. These filters will enable the user to ask questions of the data to find neighborhoods that meet their vacation selection criteria.
5. With this Data Studio dashboard, we can answer our original question - which neighborhoods on which islands have the best selection of Airbnb’s in our price range? Filter the average listing price to suit our budget - below $300 per night and a large volume of Airbnb listing options in the neighborhood (>300 listings available).
The Primary Urban Center on O’ahu, Puna on Hawai’i, and Kapaa-Wallis on Kaua’i neighborhoods all meet our price and popularity criteria - so we’ll focus our Airbnb vacation rental search there.
A new feature in BigQuery allows users to load geoJSON-NL files directly into tables with GEOMETRY columns using
bq load. GeoJSON-NL file support makes it faster and easier to load point, linestring, and polygon spatial data from into your BigQuery analytics workloads.
Try working with geoJSON-NL data in BigQuery today.