BigQuery’s performance and scale means that everyone gets to play
Daniel J. Lewis
Distinguished Data Scientist, Geotab
Senior Data Scientist - Team Lead, Geotab
Editor’s note: Today, we’re hearing from telematics solutions company Geotab about how BigQuery enables them to democratize data across their entire organization and reduce the complexity of their data pipelines.
Geotab’s telematics devices and an extensive range of integrated sensors and apps record a wealth of raw vehicle data, such as GPS, engine speeds, ambient air temperatures, driving patterns, and weather conditions. With the help of our telematics solutions, our customers gain insights that help them optimize fleet operations, improve safety, and reduce fuel consumption.
BigQuery sits at the heart of our platform as the data warehouse for our entire organization, ingesting data from our vehicle telematics devices and all customer-related data. Essentially, each of the nearly 3 billion raw data records that we collect every day across the organization, goes into BigQuery, whatever its purpose.
In this post, we’ll share why we leverage BigQuery to accelerate our analytical insights, and how it’s helped us solve some of our most demanding data challenges.
Democratizing big data with ease
As a company, Geotab manages geospatial data, but the general scalability of our data platform is even more critical for us than specific geospatial features. One of our biggest goals is to democratize the use of data within the company. If someone has an idea to use data to inform some aspect of the business better, they should have the green light to do that whenever they want.
Nearly every employee within our organization has access to BigQuery to run queries related to the projects that they have permission to see. Analysts, VPs, data scientists, and even users who don’t typically work with data have access to the environment to help solve customer issues and improve our product offerings.
While we have petabytes of information, not everything is big—our tables range in size from a few megabytes up to several hundred terabytes. Of course, there are many tricks and techniques for optimizing performant queries in the BigQuery environment, but most users don’t have to worry about optimization, parallelization, or scalability.
BigQuery sits at the heart of our platform as the data warehouse for our entire organization.
The beauty of the BigQuery environment is that it handles all of that for us behind the scenes. If someone needs insight from data and isn’t a BigQuery expert, we want them to be as comfortable querying those terabytes as they are on smaller tables—and this is where BigQuery excels. A user can write a simple query just as easily on a billion rows as on 100 rows without once thinking about whether BigQuery can handle the load. It’s fast, reliable, and frees up our time to rapidly iterate on product ideation and data exploration.
Geotab has thousands of dashboards and scheduled queries constantly running to provide insights for various business units across the organization. While we do hit occasional performance and optimization bumps, most of the time, BigQuery races through everything without a hiccup. Also, the fact that BigQuery is optimized for performance on small tables means we can spread our operations and monitoring across the organization without too much thought—20% of the queries we run touch less than 6 MB of data while 50% touch less than 800 MB. That's why it's important that BigQuery excels not only at scale but at throughput for more bite-sized applications.
The confidence we have in BigQuery to handle these loads across so many disparate business units is part of why we continue to push for increasingly more teams to take a data-driven approach to their business objectives.
Reducing the complexity of the geospatial data pipeline
The ability of BigQuery to manage vast amounts of geospatial data has also changed our approach to data science. On the scale we are operating, with tens of petabytes of data, it’s not feasible for us to operate with anything other than BigQuery.
In the past, using open-source geospatial tools, we would hit limits at volumes of around 2.5 million data points. BigQuery allows us to model over 4 billion data points, which is game-changing. Even basic functions, such as ingesting and managing geospatial polygons, used to be a complex workflow to string together in Python with Dataflow. Now, those geographic data types are handled natively by BigQuery and can be streamed directly into a table.
Even better, all of the analytics, model building, and algorithm development can happen in that same environment—without ever leaving BigQuery. No other solution that would provide geospatial model building and analytics at this scale in a single environment.
Here’s an example. We have datasets of vehicle movements through intersections. Even just a few years ago, we struggled to run an intersection dataset at scale and had to limit its use to one city at a time. Today, we are processing all the intersection data for the entire world every day without ever leaving BigQuery. Rather than worry about architecting a complex data pipeline across multiple tools, we can focus on what we want to do with the data and the business outcomes we are trying to achieve.
BigQuery is more than a data warehouse
We frequently deal with four or five billion data points in our analytics applications and BigQuery functions like a data lake. It’s not just our SQL database—it also easily supports all of our unstructured data, such as BLOBS from our CRM systems or GIS data files as well as images.
It’s been a fascinating experience to see SQL consuming more and more unstructured data and applying a more relational structure that makes it consumable and familiar to analysts with traditional database management skills.
A great example is BigQuery’s support for JSON functions, which allows us to take hierarchical non-uniform data structures of metadata from things like OpenStreetMap and store it natively in BigQuery with easy access to descriptive keys and values. As a result, we can hire a wider range of analysts for roles across the business, not just PhD-level data scientists, knowing they can work effectively with the data in BigQuery.
Even within our data science team, most of the things that we needed Python to accomplish a few years ago can now be done in SQL. That allows us to spend more time deriving insights rather than managing extended parts of the data pipeline. We also leverage SQL capabilities, such as stored procedures, to run within the data warehouse and churn through billions of data points with a five-second latency.
The ease of using SQL to access this data has made it possible to democratize data across our company and give everyone the opportunity to use data to improve outcomes for our customers and develop interesting new applications.
Reimagining innovation with Google Cloud
Over the years, we haven’t stayed with BigQuery because we have to—we want to. Google Cloud is helping us drive the insights that will fuel our future and the future of all organizations looking to raise the bar with data-driven insights and intelligence. BigQuery’s capabilities have continued to evolve along with our needs, with the addition of increasingly complex analytics, data science methodologies, geospatial support, and BQML.
BigQuery offers Geotab an environment that provides a unique ability to manage, transform and analyze geospatial data at enormous scale. It also makes it possible to aggregate all kinds of other structured and unstructured data needed for our business into a single source of truth—against which all of our analytics can be performed.