Class PCA (1.39.0)

PCA(
    n_components: typing.Optional[typing.Union[int, float]] = None,
    *,
    svd_solver: typing.Literal["full", "randomized", "auto"] = "auto"
)

Principal component analysis (PCA).

Examples:

>>> import bigframes.pandas as bpd
>>> from bigframes.ml.decomposition import PCA
>>> bpd.options.display.progress_bar = None
>>> X = bpd.DataFrame({"feat0": [-1, -2, -3, 1, 2, 3], "feat1": [-1, -1, -2, 1, 1, 2]})
>>> pca = PCA(n_components=2).fit(X)
>>> pca.predict(X) # doctest:+SKIP
    principal_component_1  principal_component_2
0              -0.755243               0.157628
1               -1.05405              -0.141179
2              -1.809292               0.016449
3               0.755243              -0.157628
4                1.05405               0.141179
5               1.809292              -0.016449
<BLANKLINE>
[6 rows x 2 columns]
>>> pca.explained_variance_ratio_ # doctest:+SKIP
    principal_component_id  explained_variance_ratio
0                       1                   0.00901
1                       0                   0.99099
<BLANKLINE>
[2 rows x 2 columns]

Parameters
Name	Description
`n_components`	`int, float or None, default None` Number of components to keep. If n_components is not set, all components are kept, n_components = min(n_samples, n_features). If 0 < n_components < 1, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.
`svd_solver`	`"full", "randomized" or "auto", default "auto"` The solver to use to calculate the principal components. Details: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-pca#pca_solver.

Properties

components_

Principal axes in feature space, representing the directions of maximum variance in the data.

Returns

Type Description

bigframes.dataframe.DataFrame DataFrame of principal components, containing following columns: principal_component_id: An integer that identifies the principal component. feature: The column name that contains the feature. numerical_value: If feature is numeric, the value of feature for the principal component that principal_component_id identifies. If feature isn't numeric, the value is NULL. categorical_value: A list of mappings containing information about categorical features. Each mapping contains the following fields: categorical_value.category: The name of each category. categorical_value.value: The value of categorical_value.category for the centroid that centroid_id identifies. The output contains one row per feature per component.

explained_variance_

The amount of variance explained by each of the selected components.

Returns
Type	Description
`bigframes.dataframe.DataFrame`	DataFrame containing following columns: principal_component_id: An integer that identifies the principal component. explained_variance: The factor by which the eigenvector is scaled. Eigenvalue and explained variance are the same concepts in PCA.

explained_variance_ratio_

Percentage of variance explained by each of the selected components.

Returns

Type Description

bigframes.dataframe.DataFrame DataFrame containing following columns: principal_component_id: An integer that identifies the principal component. explained_variance_ratio: the total variance is the sum of variances, also known as eigenvalues, of all of the individual principal components. The explained variance ratio by a principal component is the ratio between the variance, also known as eigenvalue, of that principal component and the total variance.

Methods

repr

__repr__()

Print the estimator's constructor with all non-default parameter values.

detect_anomalies

detect_anomalies(
    X: typing.Union[
        bigframes.dataframe.DataFrame,
        bigframes.series.Series,
        pandas.core.frame.DataFrame,
        pandas.core.series.Series,
    ],
    *,
    contamination: float = 0.1
) -> bigframes.dataframe.DataFrame

Detect the anomaly data points of the input.

Parameters
Name	Description
`X`	`bigframes.dataframe.DataFrame or bigframes.series.Series` Series or a DataFrame to detect anomalies.
`contamination`	`float, default 0.1` Identifies the proportion of anomalies in the training dataset that are used to create the model. The value must be in the range [0, 0.5].

Returns
Type	Description
`bigframes.dataframe.DataFrame`	detected DataFrame.

fit

fit(
    X: typing.Union[
        bigframes.dataframe.DataFrame,
        bigframes.series.Series,
        pandas.core.frame.DataFrame,
        pandas.core.series.Series,
    ],
    y: typing.Optional[
        typing.Union[
            bigframes.dataframe.DataFrame,
            bigframes.series.Series,
            pandas.core.frame.DataFrame,
            pandas.core.series.Series,
        ]
    ] = None,
) -> bigframes.ml.base._T

Fit the model according to the given training data.

Parameters
Name	Description
`X`	`bigframes.dataframe.DataFrame or bigframes.series.Series or pandas.core.frame.DataFrame or pandas.core.series.Series` Series or DataFrame of shape (n_samples, n_features). Training vector, where `n_samples` is the number of samples and `n_features` is the number of features.
`y`	`default None` Ignored.

Returns
Type	Description
`PCA`	Fitted estimator.

get_params

get_params(deep: bool = True) -> typing.Dict[str, typing.Any]

Get parameters for this estimator.

Parameter
Name	Description
`deep`	`bool, default True` Default `True`. If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
Type	Description
`Dictionary`	A dictionary of parameter names mapped to their values.

predict

predict(
    X: typing.Union[
        bigframes.dataframe.DataFrame,
        bigframes.series.Series,
        pandas.core.frame.DataFrame,
        pandas.core.series.Series,
    ],
) -> bigframes.dataframe.DataFrame

Predict the closest cluster for each sample in X.

Parameter
Name	Description
`X`	`bigframes.dataframe.DataFrame or bigframes.series.Series or pandas.core.frame.DataFrame or pandas.core.series.Series` Series or a DataFrame to predict.

Returns
Type	Description
`bigframes.dataframe.DataFrame`	Predicted DataFrames.

register

register(vertex_ai_model_id: typing.Optional[str] = None) -> bigframes.ml.base._T

After register, go to the Google Cloud console (https://console.cloud.google.com/vertex-ai/models) to manage the model registries. Refer to https://cloud.google.com/vertex-ai/docs/model-registry/introduction for more options.

Parameter
Name	Description
`vertex_ai_model_id`	`Optional[str], default None` Optional string id as model id in Vertex. If not set, will default to 'bigframes_{bq_model_id}'. Vertex Ai model id will be truncated to 63 characters due to its limitation.

score

score(X=None, y=None) -> bigframes.dataframe.DataFrame

Calculate evaluation metrics of the model.

Parameters
Name	Description
`X`	`default None` Ignored.
`y`	`default None` Ignored.

Returns
Type	Description
`bigframes.dataframe.DataFrame`	DataFrame that represents model metrics.

to_gbq

to_gbq(model_name: str, replace: bool = False) -> bigframes.ml.decomposition.PCA

Save the model to BigQuery.

Parameters
Name	Description
`model_name`	`str` The name of the model.
`replace`	`bool, default False` Determine whether to replace if the model already exists. Default to False.

Returns
Type	Description
`PCA`	Saved model.