Accelerate your data to AI journey with new features in BigQuery ML
Abhinav Khushraj
Product Manager, Google Cloud
Firat Tekiner
Product Management, Google
AI is at a tipping point. We are seeing the impact of AI across more and more industries and use cases. Organizations with varying levels of ML expertise are solving business-critical problems with AI — from creating compelling customer experiences, to optimizing operations, to automating routine tasks, these organizations learn to innovate faster and ultimately, get ahead in the marketplace. However, in many organizations, AI and machine learning systems are often separate and siloed from data warehouses and data lakes. This widens the data to AI gap, limiting data-powered innovation.
At Google Cloud, we have harnessed our years of experience in AI development to make the data-to-AI journey as seamless as possible for our customers. Google’s data cloud simplifies the way teams work with data. Our built-in AI/ML capabilities are designed to meet users where they are, with their current skills. And our infrastructure, governance, and MLOps capabilities help organizations to leverage AI at scale.
In this blog, we’ll share how you can simplify your ML workflows using BigQuery ML and Vertex AI and showcase the latest innovations in BigQuery ML.
Simplify machine learning workflows with BigQuery ML and Vertex AI
Organizations that follow a siloed approach to managing databases, analytics and machine learning often need to move data from one system to another. This leads to data duplication with no single source of truth and makes it difficult to adhere to security and governance requirements.
Additionally, when building ML pipelines, you need to train and deploy your models. Therefore you need to plan your infrastructure for scale. You also need to make sure that your ML models are tuned and optimized to run efficiently on your infrastructure. For example, you may need a large set of kubernetes clusters or access to GPU-based clusters so that you can train your models quickly. This forces organizations to hire highly skilled professionals with deep knowledge of Python, Java and other programming languages.
Google’s data cloud provides a unified data and AI solution to help you overcome these challenges and simplify your machine learning workflows. BigQuery’s serverless, scalable architecture helps you create a powerful single source of truth for your data. BigQuery ML brings machine learning capabilities directly into your data warehouse through a familiar SQL interface. BigQuery ML's native integration with Vertex AI allows you to leverage MLOps tooling to deploy, scale, and manage your models.
BigQuery ML and Vertex AI help accelerate the adoption of AI across your organization.
Easy data management: Manage ML workflows without moving data from BigQuery, eliminating security and governance problems. The ability to manage workflows within your datastore removes a big barrier to ML development and adoption.
Reduce infrastructure management overhead: BigQuery takes advantage of the massive scale of Google's compute and storage infrastructure. You don't need to manage huge clusters or HPC infrastructure to do ML effectively.
Remove skillset barrier: BigQuery ML is SQL based. This allows many model types to be directly available in SQL, such as regression, classification, recommender systems, deep learning, time series, anomaly detection, and more.
Deploy models and operationalize ML workflows: Vertex AI Model Registry makes it easy to deploy BigQuery ML models to a Vertex AI REST endpoint for online or batch predictions. Further, Vertex AI Pipelines automate your ML workflows, helping you reliably go from data ingestion to deploying your model in a way that lets you monitor and understand your ML system.
Get started with BigQuery ML in three steps
Step 1: Bring your data into BigQuery automatically via Pub/Sub in real time or in batch using BigQuery utilities or through one of our partner solutions. In addition, BigQuery can access data that may be residing in open source format such as Parquet/Hudi residing in object storage using BigLake. Learn more about loading data into BigQuery.
Step 2: Train a model by running a simple SQL query (create model) in BigQuery and point to the dataset. BigQuery is highly scalable in terms of compute and storage, whether it is a dataset with 1000 rows or billions of rows. Learn more about model training in BigQuery ML.
Step 3: Start running predictions. Use a simple SQL query to run predictions on the new data. There are a vast number of use cases supported through BigQuery ML such as demand forecasting, anomaly detection or even can be used for predicting new segments for your customer. Check out the list of supported models. Learn more about running predictions, detecting anomalies or predicting demand with forecasting.
Increase impact with new capabilities in BigQuery ML
At Next ‘22, we announced several innovations in BigQuery ML that help you to quickly and easily operationalize ML at scale. To get early access and check out these new capabilities, submit this interest form.
1. Scale with MLOps and pipelines
When you are training a lot of models across your organization, managing models, comparing results, and creating repeatable training processes can be incredibly difficult. New capabilities make it easier to operationalize and scale BigQuery ML models with Vertex AI’s MLOps capabilities.
Vertex AI Model Registry is now GA, providing a central place to manage and govern the deployment of all your models, including BigQuery ML models. You can use Vertex AI Model Registry for version control and ML metadata tracking, model evaluation and validation, deployment and model reporting. Learn more here.
Another capability that further helps operationalize ML at scale is Vertex AI Pipelines, a serverless tool for orchestrating ML tasks so that they can be executed as a single pipeline, rather than manually triggered each task (e.g. train a model, evaluate the model, deploy to an endpoint) separately. We are introducing more than 20 BigQuery ML components to simplify orchestrating BigQuery ML operations. This eliminates the need for developers and ML engineers to write their own custom components to invoke BigQuery ML jobs. Additionally, if you are Data Scientist who prefers running code over SQL, you can now use these operators to train and predict in BigQuery ML.
2. Derive insights from unstructured data
We recently announced the preview of object tables, a new table type in BigQuery that enables you to directly run analytics on unstructured data including images, audio, documents and other file types. Using the same underlying framework, BigQuery ML will now help you to unlock that value from unstructured data. You can now execute SQL on image data and predict results from machine learning models using BigQuery ML. For example, you can import either state of the art TensorFlow vision models (e.g. ImageNet and ResNet 50) or your own models to detect objects, annotate photos, extract text from images.
Learn more here and check out this demo of our customer Adswerve, a leading Google Marketing, Analytics and Cloud partner and their client Twiddy & Co, a vacation rental company in North Carolina, who combined structured and unstructured data using BigQuery ML to analyze images of rental listings and predict the click-through rate, enabling data-driven photo editorial decisions. In this work images attributed to 57% of the final prediction results.
3. Inference Engine
BigQuery ML acts an inference engine that works in a number of ways, including using existing models and can be extended to bring your own model:
BigQuery ML trained models
Imported models of various formats
Remote models (will be available soon)
BigQuery ML supports several models out-of-the-box. However, some customers want to inference with models that are already trained in other platforms. Therefore, we are introducing new capabilities that allow users to import models beyond TensorFlow into BigQuery ML, starting with TFLite and XGBoost.
Alternatively, if your model is too big to import (see current limitations here) or already deployed at an endpoint and you don't have the ability to bring that model into BigQuery, BigQuery ML now allows you to do inference on remote models ( resources that you've trained outside of Vertex AI, or that you've trained using Vertex AI and exported). You can deploy a model on Vertex AI or Cloud Functions and then use BigQuery ML to do prediction.
4. Faster, more powerful feature engineering
Feature preprocessing is one of the most important steps in developing a machine learning model. It consists of the creation of features and the cleaning of the data. Sometimes, the creation of features is also referred to as "feature engineering". In other words, Feature engineering is all about taking data and representing it in ways that model training results in great models. BQML performs automatic feature preprocessing during training, based on the feature data types. This consists of missing value imputation and feature transformations. Besides these, all numerical and categorical features will be CASTed to double and string, respectively, for BQML training and inference. We are taking feature engineering to the next level by introducing several new numerical functions (such as MAX_ABS_SCALER, IMPUTER, ROBUST_SCALER, NORMALIZER) and categorical functions (such as ONE_HOT_ENCODER, LABEL_ENCODER).
BigQuery ML supports two types of feature preprocessing:
Automatic preprocessing. BigQuery ML performs automatic preprocessing during training. For more information, see Automatic feature preprocessing.
Manual preprocessing. BigQuery ML provides the TRANSFORM clause for you to define custom preprocessing using the manual preprocessing functions. You can also use these functions outside the TRANSFORM clause.
Further, when you export BigQuery ML models by registering with Vertex AI Model Registry or manually, transform clauses will also be exported with it. This really simplifies online model deployment to Vertex.
5. Multivariate time series forecasting
Many BigQuery customers use the natively supported ARIMA PLUS model to forecast future demand and plan their business operations. Until now customers could forecast using only a single input variable. For example, to forecast ice cream sales, along with target metrics the past sales, customers could not forecast using external covariates such as weather. With this launch, users can now make more accurate forecasts by taking more than one variable into account through multivariate time series forecasting with ARIMA_PLUS_XREG (ARIMA_PLUS with external regressors (such as weather, location, etc).
Getting Started
Submit this form to try these new capabilities that help you accelerate your data to AI journey with BigQuery ML. Check out this video to learn more about these features and see a demo of how ML on structured and unstructured data can really transform marketing analytics.
Acknowledgements: It was an honor and privilege to work on this with Amir Hormati, Polong Lin, Candice Chen, Mingge Deng, Yan Sun. We further acknowledge Manoj Gunti, Shana Matthews and Neama Dadkhahnikoo for support, work they have done and their inputs.