Geospatial analytics architecture

This document helps you understand Google Cloud geospatial capabilities and how you can use these capabilities in your geospatial analytics applications. This document is intended for geographic information systems (GIS) professionals, data scientists, and application developers who want to learn how to use the products and services available in Google Cloud to deliver geospatial insights to business stakeholders.

Overview

Google Cloud provides a comprehensive suite of geospatial analytics and machine learning capabilities that can help you develop insights to understand more about the world, your environment, and your business. Geospatial insights that you get from these Google Cloud capabilities can help you make more accurate and sustainable business decisions without the complexity and expense of managing traditional GIS infrastructure.

Geospatial analytics use cases

Many critical business decisions revolve around location data. Insights gleaned from geospatial analytics are applicable across a number of industries, businesses, and markets, as described in the following examples:

  • Assessing environmental risk. Understand the risks posed by environmental conditions by predicting natural disasters like flooding and wildfires, which can help you more effectively anticipate risk and plan for it.
  • Optimizing site selection. Combine proprietary site metrics with publicly available data like traffic patterns and geographic mobility, and then use geospatial analytics to find the optimum locations for your business and to predict financial outcomes.
  • Planning logistics and transport. Better manage fleet operations such as last-mile logistics, analyze data from autonomous vehicles, manage precision railroading, and improve mobility planning by incorporating geospatial data into business decision-making.
  • Understanding and improving soil health and yield. Analyze millions of acres of land to understand soil characteristics and help farmers analyze the interactions among variables that affect crop production.
  • Managing sustainable development. Map economic, environmental, and social conditions to determine focus areas for protecting and preserving the environment.

Geospatial cloud building blocks

Your geospatial analytics architecture can consist of one or more geospatial cloud components, depending on your use case and requirements. Each component provides different capabilities, and these components work together to form a unified, scalable geospatial cloud analytics architecture.

Data is the raw material for delivering geospatial insights. Quality geospatial data is available from a number of public and proprietary sources. Public data sources include BigQuery public datasets, the Earth Engine catalog, and the United States Geological Survey (USGS). Proprietary data sources include internal systems such as SAP and Oracle, and internal GIS tooling such as Esri ArcGIS Server, Carto, and QGIS. You can aggregate data from multiple business systems, such as inventory management, marketing analytics, and supply chain logistics, and then combine that data with geospatial source data and send the results to your geospatial data warehouse.

Depending on a source's data type and destination, you might be able to load geospatial data sources directly into your analytics data warehouse. For example, BigQuery has built-in support for loading newline-delimited GeoJSON files, and Earth Engine has an integrated data catalog with a comprehensive collection of analysis-ready datasets. You can load other data in other formats through a geospatial data pipeline that preprocesses the geospatial data and loads it into your enterprise data warehouse in Google Cloud. You can build production-ready data pipelines using Dataflow. Alternatively, you can use a partner solution such as FME Spatial ETL.

The enterprise data warehouse is the core of your geospatial analytics platform. After geospatial data is loaded into your data warehouse, you can start building geospatial applications and insights by using some of the following capabilities:

Your architecture then serves as a single system that you can use to store, process, and manage data at scale. The architecture also lets you build and deploy advanced analytics solutions that can produce insights that are not feasible on systems that don't include these features.

Geospatial data types, formats, and coordinate systems

To aggregate your geospatial data into a data warehouse like BigQuery, you must understand the geospatial data formats that you're likely to encounter in internal systems and from public sources.

Data types

Geospatial data types fall into two categories: vector and raster.

Vector data is composed of vertices and line segments, as shown in the following diagram.

Examples of vector images (point, linestring, polygon, multi-polygon, and collections).

Examples of vector data include parcel boundaries, public rights-of-way (roads), and asset locations. Because vector data can be stored in a tabular (row and column) format, geospatial databases such as BigQuery and PostGIS in Cloud SQL excel at storing, indexing, and analyzing vector data.

Raster data is composed of grids of pixels. Examples of raster data include atmospheric measurements and satellite imagery, as shown in the following examples.

Examples of raster images showing aerial photos of geographic areas.

Earth Engine is designed for planetary-scale storage and analysis of raster data. Earth Engine includes the ability to vectorize rasters, which can help you classify regions and understand patterns in raster data. For example, by analyzing atmospheric raster data over time, you can extract vectors that represent prevailing wind currents. You can load each individual raster pixel into BigQuery by using a process called polygonization, which converts each pixel directly to a vector shape.

Geospatial cloud applications often combine both types of data to produce holistic insights that leverage the strengths of data sources from each category. For example, a real-estate application that helps identify new development sites might combine vector data such as parcel boundaries with raster data such as elevation data to minimize flood risk and insurance costs.

Data formats

The following table lists popular geospatial data formats and ways in which they can be used in your analytics platform.

Data source format Description Examples
Shapefile A vector data format that was developed by Esri. It lets you store geometric locations and associate attributes. Census tract geometries, building footprints
WKT A human-readable vector data format that's published by OGC. Support for this format is built into BigQuery. Representation of geometries in CSV files
WKB A storage-efficient binary equivalent of WKT. Support for this format is built into BigQuery. Representation of geometries in CSV files and databases
KML An XML-compatible vector format used by Google Earth and other desktop tools. The format is published by OGC. 3D building shapes, roads, land features
Geojson An open vector data format that's based on JSON. Features in web browsers and mobile applications
GeoTIFF A widely used raster data format. This format lets you map pixels in a TIFF image to geographic coordinates. Digital elevation models, Landsat

Coordinate reference systems

All geospatial data, regardless of type and format, includes a coordinate reference system that lets geospatial analytics tools such as BigQuery and Earth Engine associate coordinates with a physical location on the earth's surface. There are two basic types of coordinate reference systems: geodesic and planar.

Geodesic data takes the curvature of the earth into account, and uses a coordinate system based on geographic coordinates (longitude and latitude). Geodesic shapes are commonly referred to as geographies. The WGS 84 coordinate reference system that's used by BigQuery is a geodesic coordinate system.

Planar data is based on a map projection such as Mercator that maps geographic coordinates to a two-dimensional plane. To load planar data into BigQuery, you need to reproject planar data into the WGS 84 coordinate system. You can do this reprojection manually by using your existing GIS tooling, or by using a geospatial cloud data pipeline (see the next section).

Considerations for building a geospatial cloud data pipeline

As noted, you can load some geospatial data directly into BigQuery and Earth Engine, depending on data type. BigQuery lets you load vector data in the WKT, WKB, and GeoJSON file formats if the data uses the WGS 84 reference system. Earth Engine integrates directly with the data that's available in the Earth Engine catalog and supports loading raster images directly in the GeoTIFF file format.

You might encounter geospatial data that's stored in other formats and that can't be loaded directly into BigQuery. Or the data might be in a coordinate reference system that you must first reproject into the WGS 84 reference system. Similarly, you might encounter data that needs to be preprocessed, simplified, and corrected for errors.

You can load preprocessed geospatial data into BigQuery by building geospatial data pipelines using Dataflow. Dataflow is a managed analytics service that supports streaming and batch processing of data at scale.

You can use the geobeam Python library that extends Apache Beam and adds geospatial processing capabilities to Dataflow. The library lets you read geospatial data from a variety of sources. The library also helps you process and transform the data and load it into BigQuery to use as your geospatial cloud data warehouse. The geobeam library is open source, so you can modify and extend it to support additional formats and preprocessing tasks.

Using Dataflow and the geobeam library, you can ingest and analyze massive amounts of geospatial data in parallel. The geobeam library works by implementing custom I/O connectors. The geobeam library includes GDAL, PROJ, and other related libraries to make it easier to process geospatial data. For example, geobeam automatically reprojects all input geometries to the WGS84 coordinate system used by BigQuery to store, cluster, and process spatial data.

The geobeam library follows Apache Beam design patterns, so your spatial pipelines work similar to non-spatial pipelines. The difference is that you use the geobeam custom FileBasedSource classes to read from spatial source files. You can also use the built-in geobeam transform functions to process your spatial data and to implement your own functions.

The following example shows how you can create a pipeline that reads a raster file, polygonizes the raster, reprojects it to WGS 84, and writes the polygons to BigQuery.

with beam.Pipeline(options=pipeline_options) as p:
  (p
   | beam.io.Read(GeotiffSource(known_args.gcs_url))
   | 'MakeValid' >> beam.Map(geobeam.fn.make_valid)
   | 'FilterInvalid' >> beam.Filter(geobeam.fn.filter_invalid)
   | 'FormatRecords' >> beam.Map(geobeam.fn.format_record,
       known_args.band_column, known_args.band_type)
   | 'WriteToBigQuery' >> beam.io.WriteToBigQuery('DATASET.TABLE'))

Geospatial data analysis in BigQuery

When the data is in BigQuery, you can transform, analyze, and model the data. For example, you can query the average elevation of a land parcel by computing the intersection of those geographies and joining the tables using standard SQL. BigQuery offers many functions that let you construct new geography values, compute the measurements of geographies, explore the relationship between two geographies, and more. You can do hierarchical geospatial indexing with S2 grid cells using BigQuery S2 functions. In addition, you can use the machine learning features of BigQuery ML to identify patterns in the data, such as creating a k-means machine learning model to cluster geospatial data.

Geospatial visualization, reports, and deployment

Google Cloud provides several options for visualizing and reporting your spatial data and insights in order to deliver them to users and applications. The methods you use to represent your spatial insights depend on your business requirements and objectives. Not all spatial insights are represented graphically. Many insights are best delivered through an API service like Apigee, or by saving them into an application database like Firestore so that the insights can power features in your user-facing applications.

While you're testing and prototyping your geospatial analyses, you can use BigQuery GeoViz as a way to validate your queries and to generate a visual output from BigQuery. For business intelligence reporting, you can use Data Studio or Looker to connect to BigQuery and combine your geospatial visualizations with a wide variety of other report types in order to present a unified view of the insights you need.

You can also build applications that let your users interact with geospatial data and insights and incorporate those insights into your business applications. For example, by using the Google Maps Platform, you can combine geospatial analytics, machine learning, and data from the Maps API into a single map-based application. By using open source libraries like deck.gl, you can include high-performance visualizations and animations to tell map-based stories and better represent your data.

Google also has a robust and growing ecosystem of partner offerings that can help you make the most of your geospatial insights. Carto, NGIS, Climate Engine, and others each have specialized capabilities and offerings that you can customize to your industry and business.

Reference architecture

The following diagram shows a reference architecture that illustrates how the geospatial cloud components interact. The architecture has two key components: the geospatial data pipeline and the geospatial analytics platform.

Architecture that shows flow from a data source (Earth Engine or Cloud Storage) through a pipeline based on Dataflow and putting the results in BigQuery.

As the diagram shows, geospatial source data is loaded into Cloud Storage and Earth Engine. From either of these products, the data can be loaded through a Dataflow pipeline using geobeam to perform common preprocessing operations such as feature validation and geometry reprojection. Dataflow writes the pipeline output into BigQuery. When the data is in BigQuery, it can be analyzed in place using BigQuery analytics and machine learning, or it can be accessed by other services such as Data Studio, Looker, Vertex AI, and Apigee.

What's next