This page shows you how to create a Vertex AI dataset from your tabular data so you can start training classification and regression models. You can create a dataset using either the Google Cloud console or the Vertex AI API.
Before you begin
Before you can create a Vertex AI dataset from your tabular data, you must first prepare your data. For details, see:
- Prepare tabular training data for classification and regression models
- Best practices for creating tabular training data.
Create an empty dataset and associate your prepared data
To create a machine learning model for classification or regression, you must first have a representative collection of data to train with. Use the Google Cloud console or the API to associate your prepared data into the dataset. Associating your data, you can make modifications and start model training.
Google Cloud console
- In the Google Cloud console, in the Vertex AI section, go to the Datasets page.
- Click Create to open the create dataset details page.
- Modify the Dataset name field to create a descriptive dataset display name.
- Select the Tabular tab.
- Select the Regression/classification objective.
- Select a region from the Region drop-down list.
- If you want to use customer-managed encryption keys (CMEK) with your dataset, open Advanced options and provide your key. (Preview)
- Click Create to create your empty dataset, and advance to the Source tab.
- Choose one of the following options, based on your data source.
CSV files on your computer
- Click Upload CSV files from your computer.
- Click Select files and choose all the local files to upload to a Cloud Storage bucket.
- In the Select a Cloud Storage path section enter the path to the Cloud Storage bucket or click Browse to choose a bucket location.
CSV files in Cloud Storage
- Click Select CSV files from Cloud Storage.
- In the Select CSV files from Cloud Storage section enter the path to the Cloud Storage bucket or click Browse to choose the location of your CSV files.
A table or view in BigQuery
- Click Select a table or view from BigQuery.
- Enter the project, dataset, and table IDs for your input file.
- Click Continue.
Your data source is associated with your dataset.
API
When you create a dataset, you also associate it with its data source. The code needed to create a dataset depends on whether the training data resides in Cloud Storage or BigQuery. If the data source resides in a different project, make sure you set up the required permissions.Creating a dataset with data in Cloud Storage
REST
You use the datasets.create method to create a dataset.
Before using any of the request data, make the following replacements:
-
LOCATION: Region where the dataset will be stored. This must be a
region that supports
dataset resources. For example,
us-central1
. - PROJECT: Your project ID.
- DATASET_NAME: Display name for the dataset.
-
METADATA_SCHEMA_URI: The URI to the schema file for your objective.
gs://google-cloud-aiplatform/schema/dataset/metadata/tabular_1.0.0.yaml
-
URI: Paths (URIs) to the Cloud Storage buckets containing the training data.
There can be more than one. Each URI has the form:
gs://GCSprojectId/bucketName/fileName
- PROJECT_NUMBER: Your project's automatically generated project number.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT/locations/LOCATION/datasets
Request JSON body:
{ "display_name": "DATASET_NAME", "metadata_schema_uri": "METADATA_SCHEMA_URI", "metadata": { "input_config": { "gcs_source": { "uri": [URI1, URI2, ...] } } } }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT/locations/LOCATION/datasets"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT/locations/LOCATION/datasets" | Select-Object -Expand Content
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_NUMBER/locations/LOCATION/datasets/DATASET_ID/operations/OPERATION_ID", "metadata": { "@type": "type.googleapis.com/google.cloud.aiplatform.v1.CreateDatasetOperationMetadata", "genericMetadata": { "createTime": "2020-07-07T21:27:35.964882Z", "updateTime": "2020-07-07T21:27:35.964882Z" } }
Java
Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
Creating a dataset with data in BigQuery
REST
You use the datasets.create method to create a dataset.Before using any of the request data, make the following replacements:
-
LOCATION: Region where the dataset will be stored. This must be a
region that supports
dataset resources. For example,
us-central1
. - PROJECT: Your project ID.
- DATASET_NAME: Display name for the dataset.
-
METADATA_SCHEMA_URI: The URI to the schema file for your objective.
gs://google-cloud-aiplatform/schema/dataset/metadata/tabular_1.0.0.yaml
-
URI: Path to the BigQuery table containing the training data. In the form:
bq://bqprojectId.bqDatasetId.bqTableId
- PROJECT_NUMBER: Your project's automatically generated project number.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT/locations/LOCATION/datasets
Request JSON body:
{ "display_name": "DATASET_NAME", "metadata_schema_uri": "METADATA_SCHEMA_URI", "metadata": { "input_config": { "bigquery_source" :{ "uri": "URI } } } }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT/locations/LOCATION/datasets"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT/locations/LOCATION/datasets" | Select-Object -Expand Content
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_NUMBER/locations/LOCATION/datasets/DATASET_ID/operations/OPERATION_ID", "metadata": { "@type": "type.googleapis.com/google.cloud.aiplatform.v1.CreateDatasetOperationMetadata", "genericMetadata": { "createTime": "2020-07-07T21:27:35.964882Z", "updateTime": "2020-07-07T21:27:35.964882Z" } }
Java
Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
Get operation status
Some requests start long-running operations that require time to complete. These requests return an operation name, which you can use to view the operation's status or cancel the operation. Vertex AI provides helper methods to make calls against long-running operations. For more information, see Working with long-running operations.