The ML.DESCRIBE_DATA function
This document describes the ML.DESCRIBE_DATA
function, which you can use to
generate descriptive statistics for the columns in a table or subquery. For
example, you might want to know statistics for a table of training or serving
data that you plan to use with a machine learning (ML) model. You can use the
data output by this function for such purposes as
feature preprocessing or model
monitoring.
Syntax
ML.DESCRIBE_DATA( { TABLE `project_id.dataset.table` | (query_statement) }, STRUCT( [num_quantiles AS num_quantiles] [, num_array_length_quantiles AS num_array_length_quantiles] [, top_k AS top_k]) )
Arguments
ML.DESCRIBE_DATA
takes the following arguments:
project_id
: your project ID.dataset
: the BigQuery dataset that contains the table.table
: the name of the input table that contains the training or serving data to calculate statistics for.query_statement
: a query that generates the training or serving data to calculate statistics for. For the supported SQL syntax of thequery_statement
clause, see GoogleSQL query syntax.num_quantiles
: anINT64
value that specifies the number of quantiles to return for numerical,ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
columns. This affects the number of results shown in thequantiles
output column. These quantiles describe the distribution of the data in the column. Specify a lower value for coarser-grained distribution information and a higher value for finer-grained distribution information. Thenum_quantiles
value must be in the range[1, 100,000]
. The default value is2
.num_array_length_quantiles
: anINT64
value that specifies the number of quantiles to return forARRAY
columns. This affects the number of results shown in thearray_length_quantiles
output column. These quantiles describe the distribution of the length of the arrays in the column. Specify a lower value for coarser-grained distribution information and a higher value for finer-grained distribution information. Thenum_array_length_quantiles
value must be in the range[1, 100,000]
. The default value is10
.top_k
: anINT64
value that specifies the number of top values to return for categorical andARRAY<categorical>
columns. This affects the number of results shown in thetop_values
output column. The top values are the values that are shown most frequently in the column. Thetop_k
value must be in the range[1, 10,000]
. The default value is1
.
Details
ML.DESCRIBE_DATA
handles input columns as follows:
ARRAY
columns are unnested before statistics are computed on them.ARRAY<STRUCT<INT64, numerical>>
. TheINT64
value is the index, and the numerical value is the value. For statistics computation, BigQuery ML treats columns of this type asARRAY<numerical>
based on the value. The value of the dimension column in the output isMAX(index) + 1
.STRUCT
fields are expanded, and then categorical columns are cast toSTRING
and numerical columns are cast toFLOAT64
.- Columns of the following data types are
cast to
STRING
and return the same statistics asSTRING
columns:BOOL
BYTE
DATE
DATETIME
TIME
TIMESTAMP
Columns of the following data types are cast toFLOAT64
and return the same statistics asFLOAT64
columns:INT64
NUMERIC
BIGNUMERIC
Output
ML.DESCRIBE_DATA
returns one row for each column in the input data.
ML.DESCRIBE_DATA
output contains the following columns:
name
: aSTRING
column that contains the name of the input column.num_rows
: anINT64
column that contains the total number of rows for the input column.num_nulls
: anINT64
column that returns the number ofNULL
values found in the column.num_zeros
: anINT64
column that contains one of the following:- For numerical,
ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
input columns, returns the number of0
values found in the column. - For categorical or
ARRAY<categorical>
input columns, returnsNULL
.
- For numerical,
min
: aSTRING
column that contains theMIN
value for the column.max
: aSTRING
column that contains theMAX
value for the column.mean
: aFLOAT64
column that contains one of the following:- For numerical,
ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
input columns, returns the mean value calculated for the column. - For categorical or
ARRAY<categorical>
input columns, returnsNULL
.
- For numerical,
stdev
: aFLOAT64
column that contains one of the following:- For numerical,
ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
input columns, returns the standard deviation value calculated for the column. - For categorical or
ARRAY<categorical>
input columns, returnsNULL
.
- For numerical,
median
: aFLOAT64
column that contains one of the following:- For numerical,
ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
input columns, returns the median value calculated for the column. - For categorical or
ARRAY<categorical>
input columns, returnsNULL
.
- For numerical,
quantiles
: anARRAY<FLOAT64>
column that contains one of the following values:- For numerical,
ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
input columns, returns the quantiles computed for the column. - For categorical or
ARRAY<categorical>
input columns, returnsNULL
.
- For numerical,
unique
: anINT64
column that contains one of the following:- For numerical,
ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
input columns, returnsNULL
. - For categorical or
ARRAY<categorical>
input columns, returns the number of unique values in the input column.
- For numerical,
avg_string_length
: aFLOAT64
column that contains one of the following:- For numerical,
ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
input columns, returnsNULL
. - For categorical or
ARRAY<categorical>
input columns, returns the average length of the values in the column.
- For numerical,
num_values
: anINT64
column that contains the number of array elements forARRAY
columns, and the number of values in the column for other types of columns.top_values
: aARRAY<STRUCT<STRING, INT64>>
column that contains information about the top values in an input column.top_values
contains the following fields:top_values.value
: aSTRING
field that contains one of the following values:- For numerical,
ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
input columns, returnsNULL
. - For categorical or
ARRAY<categorical>
input columns, returns one of the top values in the input column.
- For numerical,
top_values.count
: anINT64
field that contains one of the following values:- For numerical,
ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
input columns, returnsNULL
. - For categorical or
ARRAY<categorical>
input columns, returns the number of times the related top value appears.
- For numerical,
min_array_length
: anINT64
column that contains one of the following values:- For
ARRAY
input columns, returns the minimum length of an array in the column. - For other types of input columns, returns
NULL
.
- For
max_array_length
: anINT64
column that contains one of the following values:- For
ARRAY
input columns, returns the maximum length of an array in the column. - For other types of input columns, returns
NULL
.
- For
avg_array_length
: aFLOAT64
column that contains one of the following values:- For
ARRAY
input columns, returns the average length of an array in the column. - For other types of input columns, returns
NULL
.
- For
total_array_length
: anINT64
column that contains one of the following values:- For
ARRAY
input columns, returns the sum of the size of the arrays in the column. - For other types of input columns, returns
NULL
.
- For
array_length_quantiles
: anARRAY<INT64>
column that contains one of the following values:- For
ARRAY
input columns, returns the quantiles for the array length computed for the column. - For other types of input columns, returns
0
.
- For
dimension
: anINT64
column that contains one of the following:- For
ARRAY<STRUCT<INT64, numerical>>
input columns, returns the dimension computed for the column, which isMAX(index) + 1
for sparse input. - For other types of input columns, returns
NULL
.
- For
Example
The following example returns statistics for a table with five quantiles calculated for numeric columns and three top values returned for non-numeric columns:
SELECT * FROM ML.DESCRIBE_DATA( TABLE `myproject.mydataset.mytable`, STRUCT(5 AS num_quantiles, 3 AS top_k) );
Limitations
Input data for the ML.DESCRIBE_DATA
function can only contain columns of the
following data types:
- Numeric types
STRING
BOOL
BYTE
DATE
DATETIME
TIME
TIMESTAMP
ARRAY<STRUCT<INT64, FLOAT64>>
(a sparse tensor)STRUCT
columns that contain any of the following types:- Numeric types
STRING
BOOL
BYTE
DATE
DATETIME
TIME
TIMESTAMP
ARRAY
columns that contain any of the following types:- Numeric types
STRING
BOOL
BYTE
DATE
DATETIME
TIME
TIMESTAMP