Prepare training data

This page shows you how to prepare your tabular data for training classification and regression models in Vertex AI. The quality of your training data impacts the effectiveness of the models you create.

The following topics are covered:

  1. Data structure requirements
  2. Prepare your import source
  3. Add weights to your training data

By default, Vertex AI uses a random split algorithm to separate your data into three data splits. Vertex AI randomly selects 80% of your data rows for the training set, 10% for the validation set, and 10% for the test set. Alternatively, you can use a manual split or a chronological split, but this requires you to prepare a data split column or a time column. Learn more about data splits.

Data structure requirements

Your training data must conform to the following basic requirements:

Requirement Type Requirement
Size The dataset must be 100 GB or smaller.
# of columns The dataset must have at least 2 and no more than 1,000 columns. The dataset must have a target and at least one feature for training the model. Ideally, your training data has many more than two columns. The maximum number of columns includes both feature and non-feature columns.
Target column You must specify a target column. The target column lets Vertex AI associate the training data with the desired result. It must not contain null values and must be either Categorical or Numerical. If it is Categorical, it must have at least 2 and no more than 500 distinct values.
Column name format The column name can include any alphanumeric character or an underscore (_). The column name cannot begin with an underscore.
# of rows The dataset must have at least 1,000 and no more than 100,000,000 rows. Depending on how many features your dataset has, 1,000 rows might not be enough to train a high-performing model. Learn more.
Data format You must use the appropriate data format (wide or narrow) for your objective. Wide format is generally best, with every row representing one training data item (product, person, and so on). Learn how to choose the data format.

Prepare your import source

You can provide model training data to Vertex AI in two formats:

  • BigQuery tables
  • Comma-separated values (CSV)

Which source you use depends on how your data is stored, and the size and complexity of your data. If your dataset is small, and you don't need more complex data types, CSV might be easier. For larger datasets that include arrays and structs, you must use BigQuery.

BigQuery

Your BigQuery table or view must conform to the BigQuery location requirements.

If your BigQuery table or view is in a different project than the project where you are creating your Vertex AI dataset, or your BigQuery table or view is backed by an external data source, you might need to add one or more roles to the Vertex AI Service Agent. See Role addition requirements for BigQuery.

You do not need to specify a schema for your BigQuery table. Vertex AI automatically infers the schema for your table when you import your data.

Your BigQuery uri (specifying the location of your training data) must conform to the following format:

bq://<project_id>.<dataset_id>.<table_id>

The uri cannot contain any other special characters.

For information about BigQuery data types and how they map into Vertex AI, see BigQuery tables. For more information about using BigQuery external data sources, see Introduction to external data sources.

CSV

CSV files can be in Cloud Storage, or on your local computer. They must conform to the following requirements:

  • The first line of the first file must be a header, containing the names of the columns. If the first row of a subsequent file is the same as the header, then it is also treated as a header, otherwise it will be treated as data.
  • Column names can include any alphanumeric character or an underscore (_). The column name cannot begin with an underscore.
  • Each file must not be larger than 10 GB.

    You can include multiple files, up to a maximum amount of 100 GB.

  • The delimiter must be a comma (",").

You do not need to specify a schema for your CSV data. Vertex AI automatically infers the schema for your table when you import your data, and uses the header row for column names.

For more information about CSV file format and data types, see CSV files.

If you are importing your data from Cloud Storage, it must be in a bucket that meets the following requirements:

If you are importing your data from your local computer, you must have a Cloud Storage bucket that meets the following requirements:

Add weights to your training data

By default, Vertex AI weighs each row of your training data equally. For training purposes, no row is considered more important than another.

Sometimes, you might want some rows to have more importance for training. For example, if you are using spending data, you might want the data associated with higher spenders to have a larger impact on the model. If missing a specific outcome is something you particularly want to avoid, then you can weight rows with that outcome more heavily.

You give rows a relative weight by adding a weight column to your dataset. The weight column must be a numeric column. The weight value can be 0‑10,000. Higher values indicate that the row is more important when training the model. A weight of 0 causes the row to be ignored. If you include a weight column, it must contain a value for every row.

Later, when you train your model, you specify this column as the Weight column.

Custom weighting schemes are used only for training the model; they do not affect the test set used for model evaluation.

What's next