Introduction to AI and ML in BigQuery

BigQuery ML lets you create and run machine learning (ML) models by using GoogleSQL queries. BigQuery ML models are stored in BigQuery datasets, similar to tables and views. BigQuery ML also lets you access Vertex AI models and Cloud AI APIs to perform artificial intelligence (AI) tasks like text generation or machine translation. Gemini for Google Cloud also provides AI-powered assistance for BigQuery tasks. To see a list of AI-powered features in BigQuery, see Gemini in BigQuery overview.

Usually, performing ML or AI on large datasets requires extensive programming and knowledge of ML frameworks. These requirements restrict solution development to a very small set of people within each company, and they exclude data analysts who understand the data but have limited ML knowledge and programming expertise. However, with BigQuery ML, SQL practitioners can use existing SQL tools and skills to build and evaluate models, and to generate results from LLMs and Cloud AI APIs.

You can work with BigQuery ML capabilities by using the following:

  • The Google Cloud console
  • The bq command-line tool
  • The BigQuery REST API
  • Integrated Colab Enterprise notebooks in BigQuery
  • External tools such as a Jupyter notebook or business intelligence platform

Advantages of BigQuery ML

BigQuery ML offers several advantages over other approaches to using ML or AI with a cloud-based data warehouse:

  • BigQuery ML democratizes the use of ML and AI by empowering data analysts, the primary data warehouse users, to build and run models using existing business intelligence tools and spreadsheets. Predictive analytics can guide business decision-making across the organization.
  • You don't need to program an ML or AI solution using Python or Java. You train models and access AI resources by using SQL—a language that's familiar to data analysts.
  • BigQuery ML increases the speed of model development and innovation by removing the need to move data from the data warehouse. Instead, BigQuery ML brings ML to the data, which offers the following advantages:

    • Reduced complexity because fewer tools are required.
    • Increased speed to production because moving and formatting large amounts of data for Python-based ML frameworks isn't required to train a model in BigQuery.

    For more information, watch the video How to accelerate machine learning development with BigQuery ML.

By using the default settings in the CREATE MODEL statements and the inference functions, you can create and use BigQuery ML models even without much ML knowledge. However, having basic knowledge about the ML development lifecycle, such as feature engineering and model training, helps you optimize both your data and your model to deliver better results. We recommend using the following resources to develop familiarity with ML techniques and processes:

Generative AI and pretrained models

You can use remote models to access Vertex AI models and Cloud AI APIs from BigQuery ML in order to perform the following tasks:

Supported models

A model in BigQuery ML represents what an ML system has learned from training data. The following sections describe the types of models that BigQuery ML supports. For more information about creating reservation assignments for the different types of models, see Assign slots to BigQuery ML workloads.

Internally trained models

The following models are built in to BigQuery ML:

  • Contribution analysis (Preview) is for determining the effect of one or more dimensions on the value for a given metric. For example, seeing the effect of store location and sales date on store revenue. For more information, see Contribution analysis overview.
  • Linear regression is for predicting the value of a numerical metric for new data by using a model trained on similar remote data. Labels are real-valued, meaning they cannot be positive infinity or negative infinity or a NaN (Not a Number).
  • Logistic regression is for the classification of two or more possible values such as whether an input is low-value, medium-value, or high-value. Labels can have up to 50 unique values.
  • K-means clustering is for data segmentation. For example, this model identifies customer segments. K-means is an unsupervised learning technique, so model training doesn't require labels or split data for training or evaluation.
  • Matrix factorization is for creating product recommendation systems. You can create product recommendations using historical customer behavior, transactions, and product ratings, and then use those recommendations for personalized customer experiences.
  • Principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data. It's commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data's variation as possible.
  • Time series is for performing time series forecasts. You can use this feature to create millions of time series models and use them for forecasting. The model automatically handles anomalies, seasonality, and holidays.

You can perform a dry run on the CREATE MODEL statements for internally trained models to get an estimate of how much data they will process if you run them.

Externally trained models

The following models are external to BigQuery ML and trained in Vertex AI:

  • Deep neural network (DNN) is for creating TensorFlow-based deep neural networks for classification and regression models.
  • Wide & Deep is useful for generic large-scale regression and classification problems with sparse inputs (categorical features with a large number of possible feature values), such as recommender systems, search, and ranking problems.
  • Autoencoder is for creating TensorFlow-based models with the support of sparse data representations. You can use the models in BigQuery ML for tasks such as unsupervised anomaly detection and non-linear dimensionality reduction.
  • Boosted Tree is for creating classification and regression models that are based on XGBoost.
  • Random forest is for constructing multiple learning method decision trees for classification, regression, and other tasks at training time.
  • AutoML is a supervised ML service that builds and deploys classification and regression models on tabular data at high speed and scale.

You can't perform a dry run on the CREATE MODEL statements for externally trained models to get an estimate of how much data they will process if you run them.

Remote models

You can create remote models in BigQuery that use models deployed to Vertex AI. You reference the deployed model by specifying the model's HTTPS endpoint in the remote model's CREATE MODEL statement.

The CREATE MODEL statements for remote models don't process any bytes and don't incur BigQuery charges.

Imported models

BigQuery ML lets you import custom models that are trained outside of BigQuery and then perform prediction within BigQuery. You can import the following models into BigQuery from Cloud Storage:

  • Open Neural Network Exchange (ONNX) is an open standard format for representing ML models. Using ONNX, you can make models that are trained with popular ML frameworks like PyTorch and scikit-learn available in BigQuery ML.
  • TensorFlow is a free, open source software library for ML and artificial intelligence. You can use TensorFlow across a range of tasks, but it has a particular focus on training and inference of deep neural networks. You can load previously trained TensorFlow models into BigQuery as BigQuery ML models and then perform prediction in BigQuery ML.
  • TensorFlow Lite is a light version of TensorFlow for deployment on mobile devices, microcontrollers, and other edge devices. TensorFlow optimizes existing TensorFlow models for reduced model size and faster inference.
  • XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It implements ML algorithms under the gradient boosting framework.

The CREATE MODEL statements for imported models don't process any bytes and don't incur BigQuery charges.

In BigQuery ML, you can use a model with data from multiple BigQuery Datasets for training and for prediction.

Model selection guide

This decision tree maps ML models to actions that you want to accomplish. Download the model selection decision tree.

BigQuery ML and Vertex AI

BigQuery ML integrates with Vertex AI, which is the end-to-end platform for AI and ML in Google Cloud. You can register your BigQuery ML models to Model Registry in order to deploy these models to endpoints for online prediction. For more information, see the following:

BigQuery ML and Colab Enterprise

You can now use Colab Enterprise notebooks to perform ML workflows in BigQuery. Notebooks let you use SQL, Python, and other popular libraries and languages to accomplish your ML tasks. For more information, see Create notebooks.

Supported regions

BigQuery ML is supported in the same regions as BigQuery. For more information, see BigQuery ML locations.

Pricing

You are charged for the compute resources that you use to train models and to run queries against models. The type of model that you create affects where the model is trained and the pricing that applies to that operation. Queries against models always run in BigQuery and use BigQuery compute pricing. Because remote models make calls to Vertex AI models, queries against remote models also incur charges from Vertex AI.

You are charged for the storage used by trained models, using BigQuery storage pricing.

For more information, see BigQuery ML pricing.

Quotas

In addition to BigQuery ML-specific limits, queries that use BigQuery ML functions and CREATE MODEL statements are subject to the quotas and limits on BigQuery query jobs.

Limitations

What's next