Unifying data and AI to bring unstructured data analytics to BigQuery
Group Product Manager
Senior Staff Software Engineer
Over one third of organizations believe that data analytics and machine learning have the most potential to significantly alter the way they run business over the next 3 to 5 years. However, only 26% of organizations are data driven. One of the biggest reasons for this gap is that a major portion of the data generated today is unstructured, which includes images, documents, and videos. It is estimated to cover roughly up to 80% of all data, which has so far remained untapped by organizations.
One of the goals of Google’s data cloud is to help customers realize value from data of all types and formats. Earlier this year, we announced BigLake, which unifies data lakes and warehouses under a single management framework, enabling you to analyze, search, secure, govern and share unstructured data using BigQuery.
At Next ‘22, we announced the preview of object tables, a new table type in BigQuery that provides a structured record interface for unstructured data stored in Google Cloud Storage. This enables you to directly run analytics and machine learning on images, audio, documents and other file types using existing frameworks like SQL and remote functions natively in BigQuery itself. Object tables also extend our best practices of securing, sharing and governing structured data to unstructured, without needing to learn or deploy new tools.
Directly process unstructured data using BigQuery ML
Object tables contain metadata such as URI (Uniform Resource Identifier), content type, and size that can be queried just like other BigQuery tables. You can then derive inferences using machine learning models on unstructured data with BigQuery ML. As part of preview, you can import open source TensorFlow Hub image models, or your own custom models to annotate the images. Very soon, we plan to enable this for audio, video, text and many other formats, and pre-trained models to enable out-of-the box analysis. Check out this video to learn more and watch a demo.
By analyzing unstructured data natively in BigQuery, businesses can
Eliminate manual effort as pre-processing steps such as tuning image sizes to model requirements are automated
Leverage the simple and familiar SQL interface to quickly gain insights
Save costs by utilizing existing BigQuery slots without needing to provision new forms of compute
Adswerve is a leading Google Marketing, Analytics and Cloud partner on a mission to humanize data. Twiddy & Co. is Adswerve’s client - a vacation rental company in North Carolina. By combining structured and unstructured data, Twiddy and Adswerve used BigQuery ML to analyze images of rental listings and predict the click-through rate, enabling data-driven photo editorial decisions.
"Twiddy now has the capability to use advanced image analysis to stay competitive in an ever changing landscape of vacation rental providers – and can do this using their in-house SQL skills." said Pat Grady, Technology Evangelist, Adswerve
Process unstructured data using remote functions
Customers today use remote functions (UDFs) to process structured data for languages and libraries that are not supported in BigQuery. We are extending this capability to process unstructured data using object tables.
Object tables provide signed URLs to allow remote UDFs running on Cloud Functions or Cloud Run to process the object table content. This is particularly useful for running Google’s pre-trained AI models, including Vision AI, Speech-to-Text, Document AI, open source libraries such as Apache Tika, or deploying your own custom models where performance SLAs are important.
Here’s an example of an object table being created over PDF files that are parsed using an open source library running as a remote UDF.
Extending more BigQuery capabilities to unstructured data
Business intelligence - The results of analyzing unstructured data either directly in BigQuery ML or via UDFs can be combined with your structured data to build unified reports using Looker Studio (at no charge), Looker or any of your preferred BI solutions. This allows you to gain more comprehensive business insights. For example, online retailers can analyze product return rates by correlating them with the images of defective products. Similarly, digital advertisers can correlate ad performance with various attributes of ad creatives to make more informed decisions.
BigQuery search index - Customers are increasingly using the search functionality of BigQuery to power search use cases. These capabilities now extend to unstructured data analytics as well. Whether you use BigQueryML to produce inference on images or use remote UDFs with Doc AI to produce document extraction, the results can now be search indexed and used to support search access patterns.
Here’s an example of search index on data that is parsed from PDF files:
Security and governance - We are extending BigQuery’s row-level security capabilities to help you secure objects in Google Cloud Storage. By securing specific rows in an object table, you can restrict the ability of end users to retrieve the signed URLs of corresponding URIs present in the table. This is a shared responsibility security model, for which administrators need to ensure that end users don’t have direct access to Google Cloud Storage, and use signed URLs from object tables as the only access mechanism.
Here’s an example of a policy for PII images that are secured to be first processed through a blur pipeline:
Soon, Dataplex will support object tables, allowing you to automatically create object tables in BigQuery and manage and govern unstructured data at scale.
Data sharing - You can now use Analytics Hub to share unstructured data with partners, customers and suppliers while not compromising on security and governance. Subscribers can consume the rows of object tables that are shared with them, and use signed URLs for unstructured data objects.
Special thanks to engineering leaders Amir Hormati, Justin Levandoski and Yuri Volobuev for contributing to this post.