The Vertex AI SDK includes classes that store and read data used to train a model. Each data-related class represents a Vertex AI managed dataset that has structured data, unstructured data, or Vertex AI Feature Store data. After you create a dataset, you use it to train your model.
The following topics provide brief explanations of each data-related class in the Vertex AI SDK. The topic for each class includes a code example that shows how to create an instance of that class. After you create a dataset, you can use its ID to retrieve it:
dataset = aiplatform.ImageDataset('projects/my-project/location/my-region/datasets/{DATASET_ID}')
Structured data classes
The following classes work with structured data, which is organized in rows and columns. Structured data is often used to store numbers, dates, values, and strings.
TabularDataset
Use this class to work with tabular datasets. You can use a CSV file,
BigQuery, or a pandas
DataFrame
to create a tabular dataset. For more information about paging through BigQuery
data, see
Read data with BigQuery API using pagination.
For more information about tabular data, see
Tabular data.
The following code shows you how to create a tabular dataset by importing a CSV file.
my_dataset = aiplatform.TabularDataset.create(
display_name="my-dataset", gcs_source=['gs://path/to/my/dataset.csv'])
The following code shows you how to create a tabular dataset by importing a CSV file in two distinct steps.
my_dataset = aiplatform.TextDataset.create(
display_name="my-dataset")
my_dataset.import(
gcs_source=['gs://path/to/my/dataset.csv']
import_schema_uri=aiplatform.schema.dataset.ioformat.text.multi_label_classification
)
If you create a tabular dataset with a pandas
DataFrame
,
you need to use a BigQuery table to stage the data for Vertex AI:
my_dataset = aiplatform.TabularDataset.create_from_dataframe(
df_source=my_pandas_dataframe,
staging_path=f"bq://{bq_dataset_id}.table-unique"
)
TimeSeriesDataset
Use this class to work with time series datasets. A time series is a dataset that contains data recorded at different time intervals. The dataset includes time and at least one variable that's dependent on time. You use a time series dataset for forecasting predictions. For more information, see Forecasting overview.
You can create a managed time series dataset from CSV files in a Cloud Storage bucket or from a BigQuery table.
The following code shows you how to create a TimeSeriesDataset
by importing a
CSV data source file that has the time series dataset:
my_dataset = aiplatform.TimeSeriesDataset.create(
display_name="my-dataset", gcs_source=['gs://path/to/my/dataset.csv'])
The following code shows you how to create a TimeSeriesDataset
by importing a
BigQuery table file that has the time series dataset:
my_dataset = aiplatform.TimeSeriesDataset.create(
display_name="my-dataset", bq_source=['bq://path/to/my/bigquerydataset.train'])
Unstructured data classes
The following classes work with unstructured data, which can't be stored in a traditional relational database. It's often stored as audio, text, video files, or as a NoSQL database.
ImageDataset
Use this class to work with a managed image dataset. To create a managed image dataset, you need a data source file in CSV format and a schema file in YAML format. A schema is optional for a custom model. The CSV file and the schema are accessed in Cloud Storage buckets.
Use image data for the following objectives:
- Single-label classification. For more information, see Prepare image training data for single-label classification.
- Multi-label classification. For more information, see Prepare image training data for multi-label classification.
- Object detection. For more information, see Prepare image training data for object detection.
The following code shows you how to create image dataset by importing a CSV data source file and a YAML schema file. The schema file you use depends on whether your image dataset is used for single-label classification, multi-label classification, or object detection.
my_dataset = aiplatform.ImageDataset.create(
display_name="my-image-dataset",
gcs_source=['gs://path/to/my/image-dataset.csv'],
import_schema_uri=['gs://path/to/my/schema.yaml']
)
TextDataset
Use this class to work with a managed text dataset. To create a text dataset, you need a data source in CSV format and a schema in YAML format. A schema is optional for a custom model. The CSV file and the schema are accessed in Cloud Storage buckets.
Use text data for the following objectives:
- Classification. For more information, see Prepare text training data for classification.
- Entity extraction. For more information, see Prepare text training data for entity extraction.
- Sentiment analysis. For more information, see Prepare text training data for sentiment analysis.
The following code shows you how to create a text dataset by importing a CSV data source file and a YAML schema file. The schema file you use depends on whether your text dataset is used for classification, entity extraction, or sentiment analysis.
my_dataset = aiplatform.TextDataset.create(
display_name="my-image-dataset",
gcs_source=['gs://path/to/my/text-dataset.csv'],
import_schema_uri=['gs://path/to/my/schema.yaml']
)
VideoDataset
Use this class to work with a managed video dataset. To create a video dataset, you need a CSV data source file and a schema in YAML format. The CSV file and the schema are accessed in Cloud Storage buckets.
Use video data for the following objectives:
- Classification. For more information, see Classification schema files.
- Action recognition. For more information, see Action recognition schema files.
- Object tracking. For more information, see Object tracking schema files.
The following code shows you how to create a dataset to train a video classification model by importing a CSV data source file. The schema file you use depends on whether you use your video dataset for action classification, recognition, or object tracking.
my_dataset = aiplatform.VideoDataset.create(
gcs_source=['gs://path/to/my/dataset.csv'],
import_schema_uri=['gs://aip.schema.dataset.ioformat.video.classification.yaml']
)
Vertex AI Feature Store data classes
Vertex AI Feature Store is a managed service used to store, serve, manage, and share ML features at scale.
Vertex AI Feature Store uses a time series data model composed of three classes that maintain features as they change over time. The three classes are organized in the following hierarchical order:
For more information about the Vertex AI Feature Store data model, see Data model and resources. To learn about Vertex AI Feature Store data source requirements, see Source data requirements.
The following classes are used with Vertex AI Feature Store data:
Featurestore
The featurestore resource, represented by the Featurestore
class, is the
top-level class in the Vertex AI Feature Store data model hierarchy.
The next level resource in the data model is entity type, which is a collection
of semantically related features that you create. The following are some of the
Featurestore
methods that work with entity types:
Create an entity type
Use the
Featurestore
.create_entity_type
method with an entity_type_id
to create an entity type resource. An entity
type resource is represented by the EntityType
class. The entity_type_id
is
alphanumeric and must be unique in a featurestore. The following is an example
of how you can create an entity type:
entity_type = aiplatform.featurestore.create_entity_type(
entity_type_id=my_entity_type_name, description=my_entity_type_description
)
Serve entity types
Use one of three Featurestore
methods to serve
entity data items:
batch_serve_to_bq
serves data to a BigQuery table.batch_serve_to_df
serves data to a pandasDataFrame
.batch_serve_to_gcs
serves data to a CSV file or a TensorFlowTFRecord
file.
EntityType
The EntityType
class represents an entity type
resource, which is a collection of semantically related features that you
define. For example, a music service might have the entity types
musical_artist
and user
. You can use use the
FeatureStore.create_entity_type
method or
the
EntityType.create
method to create an entity type. The following code shows how to use
EntityType.create
:
entity_type = aiplatform.EntityType.create(
entity_type_id=my_entity_type_name, featurestore_name=featurestore_name
)
Feature
The Feature
class represents a feature resource which
is a measurable property or attribute of an entity type. For example, the
musical_artist
entity type might have features, such as date_of_birth
and
last_name
, to track various properties of musical artists. Features must be
unique to an entity type, but don't need to be globally unique.
When you create a Feature
, you must specify its value
type (for example, BOOL_ARRAY
, DOUBLE
, DOUBLE_ARRAY
, or STRING
). The
following code shows an example of how to to create a feature:
my_feature = aiplatform.Feature.create(
feature_id='my_feature_id',
value_type='INT64',
entity_type_name='my_entity_type_id',
featurestore_id='my_featurestore_id',
)
What's next
- Learn about the Vertex AI SDK.