The ML.TFDV_DESCRIBE function
This document describes the ML.TFDV_DESCRIBE
function, which you can use
to generate fine-grained statistics for the columns in a table. For example, you
might want to know statistics for a table of training or serving data
statistics that you plan to use with a machine learning (ML) model. Calling
this function provides the same behavior as calling the TensorFlow
TensorFlow tfdv.generate_statistics_from_csv
API.
You can use the data output by this function for such purposes as
feature preprocessing or
model monitoring.
Syntax
ML.TFDV_DESCRIBE( { TABLE `project_id.dataset.table` | (query_statement) }, STRUCT( [num_histogram_buckets AS num_histogram_buckets] [, num_quantiles_histogram_buckets AS num_quantiles_histogram_buckets] [, num_values_histogram_buckets AS num_values_histogram_buckets] [, num_rank_histogram_buckets AS num_rank_histogram_buckets]) )
Arguments
ML.TFDV_DESCRIBE
takes the following arguments:
project_id
: your project ID.dataset
: aSTRING
value that specifies the BigQuery dataset that contains the table.table
: the name of the input table that contains the training or serving data to calculate statistics for.query_statement
: a query that generates the training or serving data to calculate statistics for. For the supported SQL syntax of thequery_statement
clause, see GoogleSQL query syntax.num_histogram_buckets
: anINT64
value that specifies the number of buckets to use for a histogram with equal-width buckets. Only applies to numerical,ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
columns. Thenum_histogram_buckets
value must be in the range[1, 1,000]
. The default value is10
.num_quantiles_histogram_buckets
: anINT64
value that specifies the number of buckets to use for a quantiles histogram. Only applies to numerical,ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
columns. Thenum_quantiles_histogram_buckets
value must be in the range[1, 1,000]
. The default value is10
.num_values_histogram_buckets
: anINT64
value that specifies the number of buckets to use for a quantiles histogram. Only applies toARRAY
columns. Thenum_values_histogram_buckets
value must be in the range[1, 1,000]
. The default value is10
.num_rank_histogram_buckets
: anINT64
value that specifies the number of buckets to use for a rank histogram. Only applies to categorical andARRAY<categorical>
columns. Thenum_rank_histogram_buckets
value must be in the range[1, 10,000]
. The default value is50
.
Output
ML.TFDV_DESCRIBE
returns a column named dataset_feature_statistics_list
that contains a TensorFlow
DatasetFeatureStatisticsList
protocol buffer
in JSON format.
Example
The following example returns statistics for the penguins
public dataset and
uses 20 buckets for rank histograms for string values:
SELECT * FROM ML.TFDV_DESCRIBE( TABLE `bigquery-public-data.ml_datasets.penguins`, STRUCT(20 AS num_rank_histogram_buckets) );
Limitations
Input data for the ML.TFDV_DESCRIBE
function can only contain columns of the
following data types:
- Numeric types
STRING
BOOL
BYTE
DATE
DATETIME
TIME
TIMESTAMP
ARRAY<STRUCT<INT64, FLOAT64>>
(a sparse tensor)STRUCT
columns that contain any of the following types:- Numeric types
STRING
BOOL
BYTE
DATE
DATETIME
TIME
TIMESTAMP
ARRAY
columns that contain any of the following types:- Numeric types
STRING
BOOL
BYTE
DATE
DATETIME
TIME
TIMESTAMP