Source data requirements

Vertex AI Feature Store can ingest data from tables in BigQuery or files in Cloud Storage. For files in Cloud Storage, they must be in the Avro or CSV format.

Each item (or row) must adhere to the following requirements:

  • You must have a column for entity IDs, and the values must be of type STRING. This column contains the entity IDs that the feature values are for.

  • Your source data value types must match the value types of the destination feature in the featurestore. For example, boolean values must be ingested into a feature that is of type BOOL.

  • All columns must have a header that are of type STRING. There are no restrictions on the name of the headers.

    • For BigQuery tables, the column header is the column name.
    • For Avro, the column header is defined by the Avro schema that is associated with the binary data.
    • For CSV files, the column header is the first row.
  • If you provide a column for feature generation timestamps, use one of the following timestamp formats:

    • For BigQuery tables, timestamps must be in the TIMESTAMP column.
    • For Avro, timestamps must be of type long and logical type timestamp-micros.
    • For CSV files, timestamps must be in the RFC 3339 format.
  • CSV files cannot include array data types. Use Avro or BigQuery instead.

  • For array types, you cannot include a null value in the array. Though, you can include an empty array.

Feature value timestamps

For batch ingestions, Vertex AI Feature Store requires user-provided timestamps for the ingested feature values. You can specify a particular timestamp for each value or specify the same timestamp for all values:

  • If the timestamps for feature values are different, specify the timestamps in a column in your source data. Each row must have its own timestamp indicating when the feature value was generated. In your ingestion request, you specify the column name to identify the timestamp column.
  • If the timestamp for all feature values is the same, you can specify it as a parameter in your ingestion request. You can also specify the timestamp in a column in your source data, where each row has the same timestamp.

Data source region

If your source data is in either BigQuery or Cloud Storage, the source dataset or bucket must be in the same region or in the same multi-regional location as your featurestore. For example, a featurestore in us-central1 can ingest data only from Cloud Storage buckets or BigQuery datasets that are in us-central1 or in the US multi-region location. You cannot ingest data from, for example, us-east1. Also, source data from dual-region buckets is not supported.

What's next