Class TabularDataset (1.17.0)

TabularDataset(
    dataset_name: str,
    project: Optional[str] = None,
    location: Optional[str] = None,
    credentials: Optional[google.auth.credentials.Credentials] = None,
)

Managed tabular dataset resource for Vertex AI.

Inheritance

builtins.object > google.cloud.aiplatform.base.VertexAiResourceNoun > builtins.object > google.cloud.aiplatform.base.FutureManager > google.cloud.aiplatform.base.VertexAiResourceNounWithFutureManager > google.cloud.aiplatform.datasets.dataset._Dataset > google.cloud.aiplatform.datasets.column_names_dataset._ColumnNamesDataset > TabularDataset

Methods

TabularDataset

TabularDataset(
    dataset_name: str,
    project: Optional[str] = None,
    location: Optional[str] = None,
    credentials: Optional[google.auth.credentials.Credentials] = None,
)

Retrieves an existing managed dataset given a dataset name or ID.

Parameters
NameDescription
dataset_name str

Required. A fully-qualified dataset resource name or dataset ID. Example: "projects/123/locations/us-central1/datasets/456" or "456" when project and location are initialized or passed.

project str

Optional project to retrieve dataset from. If not set, project set in aiplatform.init will be used.

location str

Optional location to retrieve dataset from. If not set, location set in aiplatform.init will be used.

credentials auth_credentials.Credentials

Custom credentials to use to retrieve this Dataset. Overrides credentials set in aiplatform.init.

create

create(
    display_name: Optional[str] = None,
    gcs_source: Optional[Union[str, Sequence[str]]] = None,
    bq_source: Optional[str] = None,
    project: Optional[str] = None,
    location: Optional[str] = None,
    credentials: Optional[google.auth.credentials.Credentials] = None,
    request_metadata: Optional[Sequence[Tuple[str, str]]] = (),
    labels: Optional[Dict[str, str]] = None,
    encryption_spec_key_name: Optional[str] = None,
    sync: bool = True,
    create_request_timeout: Optional[float] = None,
)

Creates a new tabular dataset.

Parameters
NameDescription
display_name str

Optional. The user-defined name of the Dataset. The name can be up to 128 characters long and can be consist of any UTF-8 characters.

gcs_source Union[str, Sequence[str]]

Google Cloud Storage URI(-s) to the input file(s). .. rubric:: Examples str: "gs://bucket/file.csv" Sequence[str]: ["gs://bucket/file1.csv", "gs://bucket/file2.csv"]

bq_source str

BigQuery URI to the input table. .. rubric:: Example "bq://project.dataset.table_name"

project str

Project to upload this dataset to. Overrides project set in aiplatform.init.

location str

Location to upload this dataset to. Overrides location set in aiplatform.init.

credentials auth_credentials.Credentials

Custom credentials to use to upload this dataset. Overrides credentials set in aiplatform.init.

request_metadata Sequence[Tuple[str, str]]

Strings which should be sent along with the request as metadata.

labels Dict[str, str]

Optional. Labels with user-defined metadata to organize your Tensorboards. Label keys and values can be no longer than 64 characters (Unicode codepoints), can only contain lowercase letters, numeric characters, underscores and dashes. International characters are allowed. No more than 64 user labels can be associated with one Tensorboard (System labels are excluded). See https://goo.gl/xmQnxf for more information and examples of labels. System reserved label keys are prefixed with "aiplatform.googleapis.com/" and are immutable.

encryption_spec_key_name Optional[str]

Optional. The Cloud KMS resource identifier of the customer managed encryption key used to protect the dataset. Has the form: projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key. The key needs to be in the same region as where the compute resource is created. If set, this Dataset and all sub-resources of this Dataset will be secured by this key. Overrides encryption_spec_key_name set in aiplatform.init.

sync bool

Whether to execute this method synchronously. If False, this method will be executed in concurrent Future and any downstream object will be immediately returned and synced when the Future has completed.

create_request_timeout float

Optional. The timeout for the create request in seconds.

Returns
TypeDescription
tabular_dataset (TabularDataset)Instantiated representation of the managed tabular dataset resource.

create_from_dataframe

create_from_dataframe(
    df_source: pd.DataFrame,
    staging_path: str,
    bq_schema: Optional[Union[str, google.cloud.bigquery.schema.SchemaField]] = None,
    display_name: Optional[str] = None,
    project: Optional[str] = None,
    location: Optional[str] = None,
    credentials: Optional[google.auth.credentials.Credentials] = None,
)

Creates a new tabular dataset from a Pandas DataFrame.

Parameters
NameDescription
staging_path str

Required. The BigQuery table to stage the data for Vertex. Because Vertex maintains a reference to this source to create the Vertex Dataset, this BigQuery table should not be deleted. Example: bq://my-project.my-dataset.my-table. If the provided BigQuery table doesn't exist, this method will create the table. If the provided BigQuery table already exists, and the schemas of the BigQuery table and your DataFrame match, this method will append the data in your local DataFrame to the table. The location of the provided BigQuery table should conform to the location requirements specified here: https://cloud.google.com/vertex-ai/docs/general/locations#bq-locations.

bq_schema Optional[Union[str, bigquery.SchemaField]]

Optional. If not set, BigQuery will autodetect the schema using your DataFrame's column types. If set, BigQuery will use the schema you provide when creating the staging table. For more details, see: https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.LoadJobConfig#google_cloud_bigquery_job_LoadJobConfig_schema

display_name str

Optional. The user-defined name of the Dataset. The name can be up to 128 characters long and can be consist of any UTF-8 charact

project str

Optional. Project to upload this dataset to. Overrides project set in aiplatform.init.

location str

Optional. Location to upload this dataset to. Overrides location set in aiplatform.init.

credentials auth_credentials.Credentials

Optional. Custom credentials to use to upload this dataset. Overrides credentials set in aiplatform.init.

df_source pd.DataFrame

Required. Pandas DataFrame containing the source data for ingestion as a TabularDataset. This method will use the data types from the provided DataFrame when creating the dataset.

Returns
TypeDescription
tabular_dataset (TabularDataset)Instantiated representation of the managed tabular dataset resource.

import_data

import_data()

Upload data to existing managed dataset.

Parameters
NameDescription
gcs_source Union[str, Sequence[str]]

Required. Google Cloud Storage URI(-s) to the input file(s). May contain wildcards. For more information on wildcards, see https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames. .. rubric:: Examples str: "gs://bucket/file.csv" Sequence[str]: ["gs://bucket/file1.csv", "gs://bucket/file2.csv"]

import_schema_uri str

Required. Points to a YAML file stored on Google Cloud Storage describing the import format. Validation will be done against the schema. The schema is defined as an OpenAPI 3.0.2 Schema Object <https://tinyurl.com/y538mdwt>__.

data_item_labels Dict

Labels that will be applied to newly imported DataItems. If an identical DataItem as one being imported already exists in the Dataset, then these labels will be appended to these of the already existing one, and if labels with identical key is imported before, the old label value will be overwritten. If two DataItems are identical in the same import data operation, the labels will be combined and if key collision happens in this case, one of the values will be picked randomly. Two DataItems are considered identical if their content bytes are identical (e.g. image bytes or pdf bytes). These labels will be overridden by Annotation labels specified inside index file referenced by import_schema_uri, e.g. jsonl file.

sync bool

Whether to execute this method synchronously. If False, this method will be executed in concurrent Future and any downstream object will be immediately returned and synced when the Future has completed.

import_request_timeout float

Optional. The timeout for the import request in seconds.

Returns
TypeDescription
dataset (Dataset)Instantiated representation of the managed dataset resource.