The ML.TFDV_DESCRIBE function
This document describes the ML.TFDV_DESCRIBE
function, which you can use
to generate fine-grained statistics for the columns in a table. For example, you
might want to know statistics for a table of training or serving data
statistics that you plan to use with a machine learning (ML) model. Calling
this function provides the same behavior as calling the TensorFlow
TensorFlow tfdv.generate_statistics_from_csv
API.
You can use the data output by this function for such purposes as
feature preprocessing or
model monitoring.
Syntax
ML.TFDV_DESCRIBE( { TABLE `PROJECT_ID.DATASET.TABLE_NAME` | (QUERY_STATEMENT) }, STRUCT( [NUM_HISTOGRAM_BUCKETS AS num_histogram_buckets] [, NUM_QUANTILES_HISTOGRAM_BUCKETS AS num_quantiles_histogram_buckets] [, NUM_VALUES_HISTOGRAM_BUCKETS AS num_values_histogram_buckets] [, NUM_RANK_HISTOGRAM_BUCKETS AS num_rank_histogram_buckets]) )
Arguments
ML.TFDV_DESCRIBE
takes the following arguments:
PROJECT_ID
: your project ID.DATASET
: the BigQuery dataset that contains the table.TABLE_NAME
: the name of the input table that contains the training or serving data to calculate statistics for.QUERY_STATEMENT
: a query that generates the training or serving data to calculate statistics for. For the supported SQL syntax of theQUERY_STATEMENT
clause, see GoogleSQL query syntax.NUM_HISTOGRAM_BUCKETS
: anINT64
value that specifies the number of buckets to use for a histogram with equal-width buckets. Only applies to numerical,ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
columns. Thenum_histogram_buckets
value must be in the range[1, 1,000]
. The default value is10
.NUM_QUANTILES_HISTOGRAM_BUCKETS
: anINT64
value that specifies the number of buckets to use for a quantiles histogram. Only applies to numerical,ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
columns. Thenum_quantiles_histogram_buckets
value must be in the range[1, 1,000]
. The default value is10
.NUM_VALUES_HISTOGRAM_BUCKETS
: anINT64
value that specifies the number of buckets to use for a quantiles histogram. Only applies toARRAY
columns. Thenum_values_histogram_buckets
value must be in the range[1, 1,000]
. The default value is10
.NUM_RANK_HISTOGRAM_BUCKETS
: anINT64
value that specifies the number of buckets to use for a rank histogram. Only applies to categorical andARRAY<categorical>
columns. Thenum_rank_histogram_buckets
value must be in the range[1, 10,000]
. The default value is50
.
Output
ML.TFDV_DESCRIBE
returns a column named dataset_feature_statistics_list
that contains a TensorFlow
DatasetFeatureStatisticsList
protocol buffer
in JSON format.
Example
The following example returns statistics for the penguins
public dataset and
uses 20 buckets for rank histograms for string values:
SELECT * FROM ML.TFDV_DESCRIBE( TABLE `bigquery-public-data.ml_datasets.penguins`, STRUCT(20 AS num_rank_histogram_buckets) );
Limitations
Input data for the ML.TFDV_DESCRIBE
function can only contain columns of the
following data types:
- Numeric types
STRING
BOOL
BYTE
DATE
DATETIME
TIME
TIMESTAMP
ARRAY<STRUCT<INT64, FLOAT64>>
(a sparse tensor)STRUCT
columns that contain any of the following types:- Numeric types
STRING
BOOL
BYTE
DATE
DATETIME
TIME
TIMESTAMP
ARRAY
columns that contain any of the following types:- Numeric types
STRING
BOOL
BYTE
DATE
DATETIME
TIME
TIMESTAMP