在统计学上,插补使用替换值来取代缺失的数据。在 BigQuery ML 中训练模型时,NULL 值被视为缺失的数据。在 BigQuery ML 中预测结果时,如果 BigQuery ML 遇到 NULL 值或之前未见过的值,则表示出现缺失值。BigQuery ML 会根据列中的数据类型,以不同方式处理缺失的数据。
虚拟编码类似于独热编码,其中分类特征将转换为一组占位符变量。虚拟编码使用 N-1 个占位符变量(而不是 N 个占位符变量)来表示特征的 N 个类别。例如,如果您将前面的独热编码示例中显示的同一 fruit 特征列的 CATEGORY_ENCODING_METHOD 设置为 'DUMMY_ENCODING',则表将转换为以下内部表示法:
[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-09-04。"],[[["\u003cp\u003eBigQuery ML automatically preprocesses data during model training using missing value imputation and feature transformations.\u003c/p\u003e\n"],["\u003cp\u003eMissing values (\u003ccode\u003eNULL\u003c/code\u003e) in numeric columns are imputed with the mean of the column, while one-hot/multi-hot encoded columns get an additional category for \u003ccode\u003eNULL\u003c/code\u003e values, and \u003ccode\u003eTIMESTAMP\u003c/code\u003e columns use a mix of methods depending on the generated value.\u003c/p\u003e\n"],["\u003cp\u003eFeature transformations in BigQuery ML include standardization for numeric types, one-hot encoding for categorical data (except for boosted tree and random forest models), multi-hot encoding for arrays, and \u003ccode\u003eTIMESTAMP\u003c/code\u003e transformations into various components.\u003c/p\u003e\n"],["\u003cp\u003eDifferent category encoding methods, including one-hot, dummy, label, and target encoding, are supported in BigQuery ML, with options to specify the encoding method through \u003ccode\u003eCATEGORY_ENCODING_METHOD\u003c/code\u003e for generalized linear models.\u003c/p\u003e\n"],["\u003cp\u003e\u003ccode\u003eSTRUCT\u003c/code\u003e columns are expanded into single columns, with each field of the struct being imputed according to its type, and nested structs and arrays of numeric types/structs are not transformed.\u003c/p\u003e\n"]]],[],null,["# Automatic feature preprocessing\n===============================\n\nBigQuery ML performs automatic preprocessing during training by using the\n[`CREATE MODEL` statement](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create).\nAutomatic preprocessing consists of\n[missing value imputation](/bigquery/docs/auto-preprocessing#imputation)\nand [feature transformations](/bigquery/docs/auto-preprocessing#feature-transform).\n\nFor information about feature preprocessing support in BigQuery ML,\nsee [Feature preprocessing overview](/bigquery/docs/preprocess-overview).\n\nFor information about the supported SQL statements and functions for each model\ntype, see [End-to-end user journey for each model](/bigquery/docs/e2e-journey).\n\nMissing data imputation\n-----------------------\n\nIn statistics, imputation is used to replace missing data with substituted\nvalues. When you train a model in BigQuery ML, `NULL` values are\ntreated as missing data. When you predict outcomes in BigQuery ML,\nmissing values can occur when BigQuery ML encounters a `NULL`\nvalue or a previously unseen value. BigQuery ML handles missing\ndata differently, based on the type of data in the column.\n\nFeature transformations\n-----------------------\n\nBy default, BigQuery ML transforms input features as follows:\n\n### `TIMESTAMP` feature transformation\n\nThe following table shows the components extracted from `TIMESTAMP` columns and\nthe corresponding transformation method.\n\nCategory feature encoding\n-------------------------\n\nFor features that are one-hot encoded, you can specify a different default\nencoding method by using the model option `CATEGORY_ENCODING_METHOD`. For\ngeneralized linear models (GLM) models, you can set `CATEGORY_ENCODING_METHOD`\nto one of the following values:\n\n- [`ONE_HOT_ENCODING`](#one_hot_encoding)\n- [`DUMMY_ENCODING`](#dummy_encoding)\n- [`LABEL_ENCODING`](#label_encoding)\n- [`TARGET_ENCODING`](#dummy_encoding)\n\n### One-hot encoding\n\nOne-hot encoding maps each category that a feature has to its own binary\nfeature, where `0` represents the absence of the feature and `1` represents the\npresence (known as a *dummy variable* ). This mapping creates `N` new feature\ncolumns, where `N` is the number of unique categories for the feature across\nthe training table.\n\nFor example, suppose your training table has a feature column that's called\n`fruit` with the categories `Apple`, `Banana`, and `Cranberry`, such as the\nfollowing:\n\nIn this case, the `CATEGORY_ENCODING_METHOD='ONE_HOT_ENCODING'` option\ntransforms the table to the following internal representation:\n\nOne-hot encoding is supported by\n[linear and logistic regression](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-glm)\nand\n[boosted tree](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-boosted-tree)\nmodels.\n\n### Dummy encoding\n\n[Dummy encoding](https://en.wikiversity.org/wiki/Dummy_variable_(statistics)) is\nsimilar to one-hot encoding, where a categorical feature is transformed into a\nset of placeholder variables. Dummy encoding uses `N-1` placeholder variables\ninstead of `N` placeholder variables to represent `N` categories for a feature.\nFor example, if you set `CATEGORY_ENCODING_METHOD` to `'DUMMY_ENCODING'` for\nthe same `fruit` feature column shown in the preceding one-hot encoding example,\nthen the table is transformed to the following internal representation:\n\nThe category with the most occurrences in\nthe training dataset is dropped. When multiple categories have the\nmost occurrences, a random category within that set is dropped.\n\nThe final set of weights from\n[`ML.WEIGHTS`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-weights)\nstill includes the dropped category, but its weight is always `0.0`. For\n[`ML.ADVANCED_WEIGHTS`](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-advanced-weights),\nthe standard error and p-value for the dropped variable is `NaN`.\n\nIf `warm_start` is used on a model that was initially trained with\n`'DUMMY_ENCODING'`, the same placeholder variable is dropped from the first\ntraining run. Models cannot change encoding methods between training runs.\n\nDummy encoding is supported by\n[linear and logistic regression models](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-glm).\n\n### Label encoding\n\nLabel encoding transforms the value of a categorical feature to an `INT64` value\nin `[0, \u003cnumber of categories\u003e]`.\n\nFor example, if you had a book dataset like the following:\n\nThe label encoded values might look similar to the following:\n\nThe encoding vocabulary is sorted alphabetically. `NULL` values and categories\nthat aren't in the vocabulary are encoded to `0`.\n\nLabel encoding is supported by\n[boosted tree models](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-boosted-tree).\n\n### Target encoding\n\nTarget encoding replaces the categorical feature value with the probability of\nthe target for classification models, or with the expected value of the target\nfor regression models.\n\nFeatures that have been target encoded might look similar to the following\nexample: \n\n```\n# Classification model\n+------------------------+----------------------+\n| original value | target encoded value |\n+------------------------+----------------------+\n| (category_1, target_1) | 0.5 |\n| (category_1, target_2) | 0.5 |\n| (category_2, target_1) | 0.0 |\n+------------------------+----------------------+\n\n# Regression model\n+------------------------+----------------------+\n| original value | target encoded value |\n+------------------------+----------------------+\n| (category_1, 2) | 2.5 |\n| (category_1, 3) | 2.5 |\n| (category_2, 1) | 1.5 |\n| (category_2, 2) | 1.5 |\n+------------------------+----------------------+\n```\n\nTarget encoding is supported by\n[boosted tree models](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-boosted-tree)."]]