Google Cloud Platform

Scaling exactEarth’s satellite vessel-tracking service using GeoMesa on Google Cloud Platform

Google Cloud Bigtable, Google Cloud Dataproc and GeoMesa are the main components in this solution for global geospatial analytics

Automatic Identification Systems (AISs) are used by ships and vessel-tracking services to provide real-time awareness of locations for safe navigation and routing and historical analysis, similar to what air-traffic control systems do for aircraft. The data services company exactEarth uses a constellation of satellites in space to collect more than 7 million AIS vessel-position reports per day. That number is expected to grow by an order of magnitude when 58 new satellite payloads come online soon.

Over its history, exactEarth built a scalable architecture to give its customers real-time views of AIS vessel positions anywhere across our oceans, as well as insight into patterns of maritime activity through historical archives — contributing to collision avoidance, cargo tracking, navigation assistance, environmental protection and other applications. (It’s worth noting that the data attributes and scalability requirements associated with this workload, such as location and time-series information, are similar to those in many spatial-temporal applications across industries such as Retail and Manufacturing.)

However, as a forward-thinking technology company, exactEarth is constantly exploring options for scaling its services to the next level. Recently, it partnered with machine-learning and geospatial analytics firm CCRi to help evaluate the viability of Google Cloud Platform (GCP) for managing and analyzing satellite-collected data more efficiently.

Global AIS vessel positions collected by exactEarth

In this post, we’ll summarize the resulting solution, including the role of the open source GeoMesa spatial-temporal framework.

What is GeoMesa?

GeoMesa is a suite of tools for bringing spatial-temporal data, real-time IoT and sensor workloads to the cloud. It can run on a variety of distributed databases including Apache Accumulo, Apache Cassandra, Apache HBase and the original massively scalable, columnar NoSQL database that inspired them all, Google Cloud Bigtable.

Cloud Bigtable is a managed service in the GCP ecosystem. It supports the Apache HBase API for seamless adoption and portability across self-hosted HBase instances or managed Bigtable instances. Bigtable scales to huge volumes of data, while maintaining low-latency access, by separating compute from storage and utilizing the Colossus distributed file system across any number of provisioned server nodes.

As we’ll describe next, the combination of Cloud Bigtable scalability and spatial-temporal analytic functionality in GeoMesa proved to be a key ingredient for exactEarth’s use case.

Solution details

As a first step, CCRi loaded one year of exactEarth data (spanning November 2015 through October 2016) into a GeoMesa instance on top of a 3-node Cloud Bigtable setup; this data amounted to 2.62 billion position reports. Through GeoMesa, users can query the AIS data using a variety of spatial relationships.

For example, to find the unique number of vessels that passed through the Panama Canal on any given day, the analyst can combine a bounding box with a temporal duration and subsequently observe approximately 40 ships. To drill-down to a particular vessel, you’d enter a query with a predicate on the Maritime Mobile Service Identities (MMSI) number of the vessel of interest. In the latter case, GeoMesa’s cost-based query optimizer will kick in and delegate the processing of the query to the secondary index table. With GeoMesa, you can also select which attributes get their own indexes and control the amount of data that gets stored in each additional index to trade-off performance for storage costs, as needed.

After ingesting all the AIS data, CCRi set up a GCP Compute instance to run GeoServer, an open-source Open Geospatial Consortium (OGC) service provider that enables access to exactEarth data via standards such as Web Feature Service (WFS) and Web Map Service (WMS). Through these standard protocols, any OGC-compliant mapping client such as OpenLayers, Leaflet or QGIS can consume exactEarth’s data. GeoMesa’s efficient optimizations for typical requests such as on-demand heat maps can also be triggered via standard styling rules.

Heat map of AIS historical data generated by GeoMesa

To handle many user requests, GeoServer can be set up behind a GCP load balancer that will distribute OGC requests across a configurable cluster of instances. Combined with Bigtable’s load balancing of requests across Bigtable server instances, this setup can scale to high volumes of requests while maintaining good performance.

Schematic view of GeoMesa on GCP

GeoMesa and Cloud Dataproc at work

For analysis, GeoMesa provides deep integration with Apache Spark and the Spark SQL query optimizer (Catalyst). Thanks to the GeoMesa API’s consistency across Cloud Bigtable and Google Cloud Dataproc — the managed service for Apache Hadoop and Spark clusters available via GCP — as well as other components like Accumulo and Cassandra, Cloud Dataproc was a useful target architecture for this deployment.

GeoMesa extends Spark SQL to add spatial data types, topological SQL predicates, geometry functions and optimization rules for efficiently loading data out of a GeoMesa backend and into Spark’s in-memory execution engine. Used in combination with Jupyter Notebook, GeoMesa and Cloud Dataproc, which offers fast data load and 90-second cluster spin-up time, provide an interactive ad-hoc analysis and visualization environment over massive spatio-temporal data sets.

Schematic view of GeoMesa/Cloud Dataproc integration

For example, it takes about 7 to 8 hours to traverse the Panama Canal. An analyst can determine the busiest hours of the day in the canal by executing the following SQL statement and then viewing a graph of the results using Vega-Lite charts.

  SELECT hour, avg(count) AS avgcount
    SELECT hour, day, count(distinct MMSI) AS count 
    FROM (
        SELECT MMSI, dayofyear(Time) AS day, hour(Time) AS hour 
        FROM exactEarth 
        WHERE st_contains(st_geomFromWKT('POLYGON((-79.88 9.18,-79.83...))'), geom)
     GROUP BY day, hour

GeoMesa intercepts the st_contains(st_geomFromWKT($panama), geom) predicate and loads only the relevant data into the Spark DataFrame. From there, the full power and flexibility of Spark is available to the programmer or analyst to compute answers to complex questions like “Which vessels that traveled through the Panama Canal had the longest voyage in the past two months?”

      MMSI, line AS track,
      st_lengthSpheroid(line) AS length
         collect_list(geom) AS list,
         st_makeLine(collect_list(geom)) AS line
  FROM (
    SELECT MMSI, Time, geom
    FROM exactEarth
    WHERE MMSI IN (<>)
    SORT BY Time ASC
WHERE size(list) > 1

To get even more out of this data, CCRi deployed an instance of CCRi’s web-based, high-volume spatio-temporal visualization tool, Stealth. Static data maps don’t do justice to the dynamic nature of spatio-temporal data, which is best appreciated through animation. Stealth can animate multiple millions of records in a standard browser without the need for plugins or WebGL. 

Check out the following short demo to get a sense of the data that exactEarth provides, and how visualization can lead to additional insights.

(View other examples here and here.)

Next steps

As demonstrated here, GCP, and specifically Cloud Bigtable and Cloud Dataproc, proved to be a high-performance, highly interactive backend for exactEarth’s high volume and real-time needs. Furthermore, thanks to GeoMesa’s standard API that works across multiple backends, exactEarth can migrate this application between cloud-based and on-premise architectures as it desires.

To learn more:.