Make predictions with scikit-learn models in ONNX format


This tutorial shows you how to import an Open Neural Network Exchange (ONNX) model that's trained with scikit-learn. You import the model into a BigQuery dataset and use it to make predictions using a SQL query.

ONNX provides a uniform format that is designed to represent any machine learning (ML) framework. BigQuery ML support for ONNX lets you do the following:

  • Train a model using your favorite framework.
  • Convert the model into the ONNX model format.
  • Import the ONNX model into BigQuery and make predictions using BigQuery ML.

Objectives

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  4. Make sure that billing is enabled for your Google Cloud project.

  5. Enable the BigQuery and Cloud Storage APIs.

    Enable the APIs

  6. Ensure that you have the necessary permissions to perform the tasks in this document.

Required roles

If you create a new project, you're the project owner, and you're granted all of the required Identity and Access Management (IAM) permissions that you need to complete this tutorial.

If you're using an existing project, do the following.

Make sure that you have the following role or roles on the project:

Check for the roles

  1. In the Google Cloud console, go to the IAM page.

    Go to IAM
  2. Select the project.
  3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.

  4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

Grant the roles

  1. In the Google Cloud console, go to the IAM page.

    Go to IAM
  2. Select the project.
  3. Click Grant access.
  4. In the New principals field, enter your user identifier. This is typically the email address for a Google Account.

  5. In the Select a role list, select a role.
  6. To grant additional roles, click Add another role and add each additional role.
  7. Click Save.

For more information about IAM permissions in BigQuery, see IAM permissions.

Optional: Train a model and convert it to ONNX format

The following code samples show you how to train a classification model with scikit-learn and how to convert the resulting pipeline into ONNX format. This tutorial uses a prebuilt example model that's stored at gs://cloud-samples-data/bigquery/ml/onnx/pipeline_rf.onnx. You don't have to complete these steps if you're using the sample model.

Train a classification model with scikit-learn

Use the following sample code to create and train a scikit-learn pipeline on the Iris dataset. For instructions about installing and using scikit-learn, see the scikit-learn installation guide.

import numpy
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

data = load_iris()
X = data.data[:, :4]
y = data.target

ind = numpy.arange(X.shape[0])
numpy.random.shuffle(ind)
X = X[ind, :].copy()
y = y[ind].copy()

pipe = Pipeline([('scaler', StandardScaler()),
                ('clr', RandomForestClassifier())])
pipe.fit(X, y)

Convert the pipeline into an ONNX model

Use the following sample code in sklearn-onnx to convert the scikit-learn pipeline into an ONNX model that's named pipeline_rf.onnx.

from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

# Disable zipmap as it is not supported in BigQuery ML.
options = {id(pipe): {'zipmap': False}}

# Define input features. scikit-learn does not store information about the
# training dataset. It is not always possible to retrieve the number of features
# or their types. That's why the function needs another argument called initial_types.
initial_types = [
   ('sepal_length', FloatTensorType([None, 1])),
   ('sepal_width', FloatTensorType([None, 1])),
   ('petal_length', FloatTensorType([None, 1])),
   ('petal_width', FloatTensorType([None, 1])),
]

# Convert the model.
model_onnx = convert_sklearn(
   pipe, 'pipeline_rf', initial_types=initial_types, options=options
)

# And save.
with open('pipeline_rf.onnx', 'wb') as f:
 f.write(model_onnx.SerializeToString())

Upload the ONNX model to Cloud Storage

After you save your model, do the following:

Create a dataset

Create a BigQuery dataset to store your ML model.

Console

  1. In the Google Cloud console, go to the BigQuery page.

    Go to the BigQuery page

  2. In the Explorer pane, click your project name.

  3. Click View actions > Create dataset.

    The Create dataset menu option.

  4. On the Create dataset page, do the following:

    • For Dataset ID, enter bqml_tutorial.

    • For Location type, select Multi-region, and then select US (multiple regions in United States).

    The public datasets are stored in the US multi-region. For simplicity, store your dataset in the same location.

    • Leave the remaining default settings as they are, and click Create dataset.

    The Create dataset page with the values populated.

bq

To create a new dataset, use the bq mk command with the --location flag. For a full list of possible parameters, see the bq mk --dataset command reference.

  1. Create a dataset named bqml_tutorial with the data location set to US and a description of BigQuery ML tutorial dataset:

    bq --location=US mk -d \
     --description "BigQuery ML tutorial dataset." \
     bqml_tutorial

    Instead of using the --dataset flag, the command uses the -d shortcut. If you omit -d and --dataset, the command defaults to creating a dataset.

  2. Confirm that the dataset was created:

    bq ls

API

Call the datasets.insert method with a defined dataset resource.

{
  "datasetReference": {
     "datasetId": "bqml_tutorial"
  }
}

Import the ONNX model into BigQuery

The following steps show you how to import the sample ONNX model from Cloud Storage by using a CREATE MODEL statement.

To import the ONNX model into your dataset, select one of the following options:

Console

  1. In the Google Cloud console, go to the BigQuery Studio page.

    Go to BigQuery Studio

  2. In the query editor, enter the following CREATE MODEL statement.

     CREATE OR REPLACE MODEL `bqml_tutorial.imported_onnx_model`
      OPTIONS (MODEL_TYPE='ONNX',
       MODEL_PATH='BUCKET_PATH')

    Replace BUCKET_PATH with the path to the model that you uploaded to Cloud Storage. If you're using the sample model, replace BUCKET_PATH with the following value: gs://cloud-samples-data/bigquery/ml/onnx/pipeline_rf.onnx.

    When the operation is complete, you see a message similar to the following: Successfully created model named imported_onnx_model.

    Your new model appears in the Resources panel. Models are indicated by the model icon: The model icon in the Resources panel If you select the new model in the Resources panel, information about the model appears adjacent to the Query editor.

    The information panel for `imported_onnx_model`

bq

  1. Import the ONNX model from Cloud Storage by entering the following CREATE MODEL statement.

    bq query --use_legacy_sql=false \
    "CREATE OR REPLACE MODEL
    `bqml_tutorial.imported_onnx_model`
    OPTIONS
    (MODEL_TYPE='ONNX',
      MODEL_PATH='BUCKET_PATH')"

    Replace BUCKET_PATH with the path to the model that you uploaded to Cloud Storage. If you're using the sample model, replace BUCKET_PATH with the following value: gs://cloud-samples-data/bigquery/ml/onnx/pipeline_rf.onnx.

    When the operation is complete, you see a message similar to the following: Successfully created model named imported_onnx_model.

  2. After you import the model, verify that the model appears in the dataset.

    bq ls bqml_tutorial

    The output is similar to the following:

    tableId               Type
    --------------------- -------
    imported_onnx_model  MODEL

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up ADC for a local development environment.

Import the model by using the ONNXModel object.

import bigframes
from bigframes.ml.imported import ONNXModel

bigframes.options.bigquery.project = PROJECT_ID
# You can change the location to one of the valid locations: https://cloud.google.com/bigquery/docs/locations#supported_locations
bigframes.options.bigquery.location = "US"

imported_onnx_model = ONNXModel(
    model_path="gs://cloud-samples-data/bigquery/ml/onnx/pipeline_rf.onnx"
)

For more information about importing ONNX models into BigQuery, including format and storage requirements, see The CREATE MODEL statement for importing ONNX models.

Make predictions with the imported ONNX model

After importing the ONNX model, you use the ML.PREDICT function to make predictions with the model.

The query in the following steps uses imported_onnx_model to make predictions using input data from the iris table in the ml_datasets public dataset. The ONNX model expects four FLOAT values as input:

  • sepal_length
  • sepal_width
  • petal_length
  • petal_width

These inputs match the initial_types that were defined when you converted the model into ONNX format.

The outputs include the label and probabilities columns, and the columns from the input table. label represents the predicted class label. probabilities is an array of probabilities representing probabilities for each class.

To make predictions with the imported TensorFlow model, choose one of the following options:

Console

  1. Go to the BigQuery Studio page.

    Go to BigQuery Studio

  2. In the query editor, enter this query that uses the ML.PREDICT function.

    SELECT *
      FROM ML.PREDICT(MODEL `bqml_tutorial.imported_onnx_model`,
        (
        SELECT * FROM `bigquery-public-data.ml_datasets.iris`
        )
    )

    The query results are similar to the following:

    The output of the ML.PREDICT query

bq

Run the query that uses ML.PREDICT.

bq query --use_legacy_sql=false \
'SELECT *
FROM ML.PREDICT(
MODEL `example_dataset.imported_onnx_model`,
(SELECT * FROM `bigquery-public-data.ml_datasets.iris`))'

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up ADC for a local development environment.

Use the predict function to run the remote model.

import bigframes.pandas as bpd

df = bpd.read_gbq("bigquery-public-data.ml_datasets.iris")
predictions = imported_onnx_model.predict(df)
predictions.peek(5)

The result is similar to the following:

The output of the predict function

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the project

Console

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

gcloud

    Delete a Google Cloud project:

    gcloud projects delete PROJECT_ID

Delete individual resources

Alternatively, to remove the individual resources used in this tutorial, do the following:

  1. Delete the imported model.

  2. Optional: Delete the dataset.

What's next