Managing datasets

A dataset contains representative samples of the type of content you want to classify, labeled with the category labels you want your custom model to use. The dataset serves as the input for training a model.

The main steps for building a dataset are:

  1. Create a dataset and specify whether to allow multiple labels on each item.
  2. Import data items into the dataset.
  3. Label the items.

A project can have multiple datasets, each used to train a separate model. You can get a list of the available datasets and can delete datasets you no longer need.

Creating a dataset

The first step in creating a custom model is to create an empty dataset that will eventually hold the training data for the model.

Web UI

The AutoML Video Classification UI enables you to create a new dataset and import items into it from the same page.

  1. Open the AutoML Video Classification UI. The Datasets page shows the status of previously created datasets for the current project. To add a dataset for a different project, select the project from the drop-down list in the upper right of the title bar.
  2. On the Datasets page, click Create Dataset.
    Create dataset icon

    The following screen appears: Click_new_dataset
  3. Enter information about the dataset:
    1. Specify a name for this dataset.
    2. Select Video Classification.
    3. Click Create Dataset.

      The following screen appears: Page for dataset titled "my_dataset"
  4. Enter the following information:
    1. Provide the Cloud Storage URI of the CSV file that contains the URIs of your training data (see Prepare data).
      In this quickstart, use:
      automl-video-demo-data/hmdb_split1.csv

    2. Click Continue to begin importing your data.
      The following screen appears:
      Importing data

The import process can take a while to complete, depending on the number and length of the videos that you've provided.

REST & CMD LINE

Before using any of the request data below, make the following replacements:

  • dataset-name: name of the dataset to show in the interface
  • Note:
    • project-number: number of your project
    • location-id: the Cloud region where annotation should take place. Supported cloud regions are: us-east1, us-west1, europe-west1, asia-east1. If no region is specified, a region will be determined based on video file location.

HTTP method and URL:

POST  https://automl.googleapis.com/v1beta1/projects/project-number/locations/location-id/datasets

Request JSON body:

{
  "displayName": "dataset-name",
  "videoClassificationDatasetMetadata": {
  }
}

To send your request, choose one of these options:

curl

Save the request body in a file called request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
https://automl.googleapis.com/v1beta1/projects/project-number/locations/location-id/datasets

PowerShell

Save the request body in a file called request.json, and execute the following command:

$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri " https://automl.googleapis.com/v1beta1/projects/project-number/locations/location-id/datasets" | Select-Object -Expand Content
If the response is successful, the AutoML Video Intelligence Classification API returns the name for your operation. The following shows an example of such a response, where project-number is the number of your project and operation-id is the ID of the long-running operation created for the request.

Java

import com.google.cloud.automl.v1beta1.AutoMlClient;
import com.google.cloud.automl.v1beta1.Dataset;
import com.google.cloud.automl.v1beta1.LocationName;
import com.google.cloud.automl.v1beta1.VideoClassificationDatasetMetadata;
import java.io.IOException;

class VideoClassificationCreateDataset {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "YOUR_PROJECT_ID";
    String displayName = "YOUR_DATASET_NAME";
    createDataset(projectId, displayName);
  }

  // Create a dataset
  static void createDataset(String projectId, String displayName) throws IOException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (AutoMlClient client = AutoMlClient.create()) {
      // A resource that represents Google Cloud Platform location.
      LocationName projectLocation = LocationName.of(projectId, "us-central1");
      VideoClassificationDatasetMetadata metadata =
          VideoClassificationDatasetMetadata.newBuilder().build();
      Dataset dataset =
          Dataset.newBuilder()
              .setDisplayName(displayName)
              .setVideoClassificationDatasetMetadata(metadata)
              .build();

      Dataset createdDataset = client.createDataset(projectLocation, dataset);

      // Display the dataset information.
      System.out.format("Dataset name: %s%n", createdDataset.getName());
      // To get the dataset id, you have to parse it out of the `name` field. As dataset Ids are
      // required for other methods.
      // Name Form: `projects/{project_id}/locations/{location_id}/datasets/{dataset_id}`
      String[] names = createdDataset.getName().split("/");
      String datasetId = names[names.length - 1];
      System.out.format("Dataset id: %s%n", datasetId);
    }
  }
}

Node.js

/**
 * TODO(developer): Uncomment these variables before running the sample.
 */
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'us-central1';
// const displayName = 'YOUR_DISPLAY_NAME';

// Imports the Google Cloud AutoML library
const {AutoMlClient} = require('@google-cloud/automl').v1beta1;

// Instantiates a client
const client = new AutoMlClient();

async function createDataset() {
  // Construct request
  const request = {
    parent: client.locationPath(projectId, location),
    dataset: {
      displayName: displayName,
      videoClassificationDatasetMetadata: {},
    },
  };

  // Create dataset
  const [response] = await client.createDataset(request);

  console.log(`Dataset name: ${response.name}`);
  console.log(`
    Dataset id: ${
      response.name
        .split('/')
        [response.name.split('/').length - 1].split('\n')[0]
    }`);
}

createDataset();

Python

from google.cloud import automl_v1beta1 as automl


def create_dataset(
    project_id="YOUR_PROJECT_ID", display_name="your_datasets_display_name"
):
    """Create a automl video classification dataset."""

    client = automl.AutoMlClient()

    # A resource that represents Google Cloud Platform location.
    project_location = client.location_path(project_id, "us-central1")
    metadata = automl.types.VideoClassificationDatasetMetadata()
    dataset = automl.types.Dataset(
        display_name=display_name,
        video_classification_dataset_metadata=metadata,
    )

    # Create a dataset with the dataset metadata in the region.
    created_dataset = client.create_dataset(project_location, dataset)

    # Display the dataset information
    print("Dataset name: {}".format(created_dataset.name))

    # To get the dataset id, you have to parse it out of the `name` field.
    # As dataset Ids are required for other methods.
    # Name Form:
    #    `projects/{project_id}/locations/{location_id}/datasets/{dataset_id}`
    print("Dataset id: {}".format(created_dataset.name.split("/")[-1]))

Importing items into a dataset

After you have created a dataset, you can import labeled data from CSV files stored in a Cloud Storage bucket. For details on preparing your data and creating a CSV files for import, see Preparing your training data.

You can import items into an empty dataset or import additional items into an existing dataset.

Web UI

Your data is imported when you create your dataset.

REST & CMD LINE

Before using any of the request data below, make the following replacements:

  • input-uri: a Cloud Storage bucket that contains the file you want to annotate, including the file name. Must start with gs://. For example:
    "inputUris": ["gs://automl-video-demo-data/hmdb_split1.csv"]
  • dataset-id: replace with the dataset identifier for your dataset (not the display name). For example: VCN4798585402963263488
  • Note:
    • project-number: number of your project
    • location-id: the Cloud region where annotation should take place. Supported cloud regions are: us-east1, us-west1, europe-west1, asia-east1. If no region is specified, a region will be determined based on video file location.

HTTP method and URL:

POST  https://automl.googleapis.com/v1beta1/projects/project-number/locations/location-id/datasets/dataset-id:importData

Request JSON body:

{
   "inputConfig": {
      "gcsSource": {
         "inputUris": input-uri
      }
   }
}

To send your request, choose one of these options:

curl

Save the request body in a file called request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
https://automl.googleapis.com/v1beta1/projects/project-number/locations/location-id/datasets/dataset-id:importData

PowerShell

Save the request body in a file called request.json, and execute the following command:

$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri " https://automl.googleapis.com/v1beta1/projects/project-number/locations/location-id/datasets/dataset-id:importData" | Select-Object -Expand Content
You should receive an operation ID for your import data operation. The example shows a response that contains the import operation ID VCN7506374678919774208.

You can use the operation ID to get the status of the task. For an example, see Getting the status of an operation.

Java

import com.google.api.gax.longrunning.OperationFuture;
import com.google.api.gax.retrying.RetrySettings;
import com.google.cloud.automl.v1beta1.AutoMlClient;
import com.google.cloud.automl.v1beta1.AutoMlSettings;
import com.google.cloud.automl.v1beta1.DatasetName;
import com.google.cloud.automl.v1beta1.GcsSource;
import com.google.cloud.automl.v1beta1.InputConfig;
import com.google.cloud.automl.v1beta1.OperationMetadata;
import com.google.protobuf.Empty;
import java.io.IOException;
import java.util.Arrays;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;
import org.threeten.bp.Duration;

class ImportDataset {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException, TimeoutException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "YOUR_PROJECT_ID";
    String datasetId = "YOUR_DATASET_ID";
    String path = "gs://BUCKET_ID/path_to_training_data.csv";
    importDataset(projectId, datasetId, path);
  }

  // Import a dataset
  static void importDataset(String projectId, String datasetId, String path)
      throws IOException, ExecutionException, InterruptedException, TimeoutException {
    Duration totalTimeout = Duration.ofMinutes(45);
    RetrySettings retrySettings = RetrySettings.newBuilder().setTotalTimeout(totalTimeout).build();
    AutoMlSettings.Builder builder = AutoMlSettings.newBuilder();
    builder.importDataSettings().setRetrySettings(retrySettings).build();
    AutoMlSettings settings = builder.build();

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (AutoMlClient client = AutoMlClient.create(settings)) {
      // Get the complete path of the dataset.
      DatasetName datasetFullId = DatasetName.of(projectId, "us-central1", datasetId);

      // Get multiple Google Cloud Storage URIs to import data from
      GcsSource gcsSource =
          GcsSource.newBuilder().addAllInputUris(Arrays.asList(path.split(","))).build();

      // Import data from the input URI
      InputConfig inputConfig = InputConfig.newBuilder().setGcsSource(gcsSource).build();
      System.out.println("Processing import...");

      // Start the import job
      OperationFuture<Empty, OperationMetadata> operation = client
          .importDataAsync(datasetFullId, inputConfig);

      System.out.format("Operation name: %s%n", operation.getName());

      // If you want to wait for the operation to finish, adjust the timeout appropriately. The
      // operation will still run if you choose not to wait for it to complete. You can check the
      // status of your operation using the operation's name.
      Empty response = operation.get(45, TimeUnit.MINUTES);
      System.out.format("Dataset imported. %s%n", response);
    } catch (TimeoutException e) {
      System.out.println("The operation's polling period was not long enough.");
      System.out.println("You can use the Operation's name to get the current status.");
      System.out.println("The import job is still running and will complete as expected.");
      throw e;
    }
  }
}

Node.js

/**
 * TODO(developer): Uncomment these variables before running the sample.
 */
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'us-central1';
// const datasetId = 'YOUR_DISPLAY_ID';
// const path = 'gs://BUCKET_ID/path_to_training_data.csv';

// Imports the Google Cloud AutoML library
const {AutoMlClient} = require('@google-cloud/automl').v1beta1;

// Instantiates a client
const client = new AutoMlClient();

async function importDataset() {
  // Construct request
  const request = {
    name: client.datasetPath(projectId, location, datasetId),
    inputConfig: {
      gcsSource: {
        inputUris: path.split(','),
      },
    },
  };

  // Import dataset
  console.log('Proccessing import');
  const [operation] = await client.importData(request);

  // Wait for operation to complete.
  const [response] = await operation.promise();
  console.log(`Dataset imported: ${response}`);
}

importDataset();

Python

from google.cloud import automl_v1beta1 as automl


def import_dataset(
    project_id="YOUR_PROJECT_ID",
    dataset_id="YOUR_DATASET_ID",
    path="gs://YOUR_BUCKET_ID/path/to/data.csv",
):
    """Import a dataset."""
    client = automl.AutoMlClient()
    # Get the full path of the dataset.
    dataset_full_id = client.dataset_path(
        project_id, "us-central1", dataset_id
    )
    # Get the multiple Google Cloud Storage URIs
    input_uris = path.split(",")
    gcs_source = automl.types.GcsSource(input_uris=input_uris)
    input_config = automl.types.InputConfig(gcs_source=gcs_source)
    # Import data from the input URI
    response = client.import_data(dataset_full_id, input_config)

    print("Processing import...")
    print("Data imported. {}".format(response.result()))

Labeling training items

To be useful for training a model, each item in a dataset must have at least one category label assigned to it. AutoML Video Classification ignores items without a category label. You can provide labels for your training items in two ways:

  • Include labels in your CSV file
  • Label your items in the AutoML Video Classification UI

For details about labeling items in your CSV file, see Preparing your training data.

To label items in the AutoML Video Classification UI, select the dataset from the dataset listing page to see its details. The display name of the selected dataset appears in the title bar, and the page lists the individual items in the dataset along with their labels. The navigation bar along the left summarizes the number of labeled and unlabeled items. It also enables you to filter the item list by label.

Videos in a dataset

To assign labels to unlabeled videos or change video labels, do the following:

  1. On the page for the dataset, click the video that you want to add or change labels for.
  2. On the page for the video, do the following:

    1. Click Add Segment.
    2. Drag the arrows on either side of the video timeline to define the region that you want to label. By default, the entire duration of the video is selected.
    3. From the list of labels, click the labels that you want to apply to the video. The color bar for the label turns solid after you select it.
    4. Click Save.

Applying labels to a video of someone running up stairs

If you need to add a new label for the dataset, on the page for the dataset, above the list of existing labels, click the three dots next to Filter labels and then click Add new label.

Listing datasets

A project can include numerous datasets. This section describes how to retrieve a list of the available datasets for a project.

Web UI

To see a list of the available datasets using the AutoML Video Classification UI, navigate to the Datasets page.

To see the datasets for a different project, select the project from the drop-down list in the upper right of the title bar.

REST & CMD LINE

Before using any of the request data below, make the following replacements:

  • project-number: number of your project
  • location-id: the Cloud region where annotation should take place. Supported cloud regions are: us-east1, us-west1, europe-west1, asia-east1. If no region is specified, a region is determined based on video file location.

HTTP method and URL:

 https://automl.googleapis.com/v1beta1/projects/project-number/locations/location-id/datasets

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

Java

import com.google.cloud.automl.v1beta1.AutoMlClient;
import com.google.cloud.automl.v1beta1.Dataset;
import com.google.cloud.automl.v1beta1.ListDatasetsRequest;
import com.google.cloud.automl.v1beta1.LocationName;
import java.io.IOException;

class ListDatasets {

  static void listDatasets() throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "YOUR_PROJECT_ID";
    listDatasets(projectId);
  }

  // List the datasets
  static void listDatasets(String projectId) throws IOException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (AutoMlClient client = AutoMlClient.create()) {
      // A resource that represents Google Cloud Platform location.
      LocationName projectLocation = LocationName.of(projectId, "us-central1");
      ListDatasetsRequest request =
          ListDatasetsRequest.newBuilder().setParent(projectLocation.toString()).build();

      // List all the datasets available in the region by applying filter.
      System.out.println("List of datasets:");
      for (Dataset dataset : client.listDatasets(request).iterateAll()) {
        // Display the dataset information
        System.out.format("%nDataset name: %s%n", dataset.getName());
        // To get the dataset id, you have to parse it out of the `name` field. As dataset Ids are
        // required for other methods.
        // Name Form: `projects/{project_id}/locations/{location_id}/datasets/{dataset_id}`
        String[] names = dataset.getName().split("/");
        String retrievedDatasetId = names[names.length - 1];
        System.out.format("Dataset id: %s%n", retrievedDatasetId);
        System.out.format("Dataset display name: %s%n", dataset.getDisplayName());
        System.out.println("Dataset create time:");
        System.out.format("\tseconds: %s%n", dataset.getCreateTime().getSeconds());
        System.out.format("\tnanos: %s%n", dataset.getCreateTime().getNanos());

        System.out.format(
            "Video classification dataset metadata: %s%n",
            dataset.getVideoClassificationDatasetMetadata());
      }
    }
  }
}

Node.js

/**
 * TODO(developer): Uncomment these variables before running the sample.
 */
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'us-central1';

// Imports the Google Cloud AutoML library
const {AutoMlClient} = require('@google-cloud/automl').v1beta1;

// Instantiates a client
const client = new AutoMlClient();

async function listDatasets() {
  // Construct request
  const request = {
    parent: client.locationPath(projectId, location),
    filter: 'translation_dataset_metadata:*',
  };

  const [response] = await client.listDatasets(request);

  console.log('List of datasets:');
  for (const dataset of response) {
    console.log(`Dataset name: ${dataset.name}`);
    console.log(
      `Dataset id: ${
        dataset.name.split('/')[dataset.name.split('/').length - 1]
      }`
    );
    console.log(`Dataset display name: ${dataset.displayName}`);
    console.log('Dataset create time');
    console.log(`\tseconds ${dataset.createTime.seconds}`);
    console.log(`\tnanos ${dataset.createTime.nanos / 1e9}`);

    console.log(
      `Video classification dataset metadata: ${dataset.videoClassificationDatasetMetadata}`
    );
  }
}

listDatasets();

Python

from google.cloud import automl_v1beta1 as automl


def list_datasets(project_id="YOUR_PROJECT_ID"):
    """List datasets."""
    client = automl.AutoMlClient()
    # A resource that represents Google Cloud Platform location.
    project_location = client.location_path(project_id, "us-central1")

    # List all the datasets available in the region.
    response = client.list_datasets(project_location, "")

    print("List of datasets:")
    for dataset in response:
        print("Dataset name: {}".format(dataset.name))
        print("Dataset id: {}".format(dataset.name.split("/")[-1]))
        print("Dataset display name: {}".format(dataset.display_name))
        print("Dataset create time:")
        print("\tseconds: {}".format(dataset.create_time.seconds))
        print("\tnanos: {}".format(dataset.create_time.nanos))

        print(
            "Video classification dataset metadata: {}".format(
                dataset.video_classification_dataset_metadata
            )
        )

Deleting a dataset

The following code demonstrates how to delete a dataset.

Web UI

  1. Navigate to the Datasets page in the AutoML Video Classification UI.

    Datasets tab
  2. Click the three-dot menu at the far right of the row that you want to delete and select Delete dataset.
  3. Click Confirm in the confirmation dialog box.

REST & CMD LINE

Before using any of the request data below, make the following replacements:

  • dataset-name: the full name of your dataset, from the response when you created the dataset. The full name has the format:
    projects/project-number/locations/location-id/datasets/dataset-id
    • project-number: number of your project
    • location-id: the Cloud region where annotation should take place. Supported cloud regions are: us-east1, us-west1, europe-west1, asia-east1. If no region is specified, a region is determined based on video file location.
    • dataset-id: the id provided when you created the dataset

HTTP method and URL:

DELETE  https://automl.googleapis.com/v1beta1/dataset-name

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

Java

import com.google.cloud.automl.v1beta1.AutoMlClient;
import com.google.cloud.automl.v1beta1.DatasetName;
import com.google.protobuf.Empty;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

class DeleteDataset {

  static void deleteDataset() throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "YOUR_PROJECT_ID";
    String datasetId = "YOUR_DATASET_ID";
    deleteDataset(projectId, datasetId);
  }

  // Delete a dataset
  static void deleteDataset(String projectId, String datasetId)
      throws IOException, ExecutionException, InterruptedException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (AutoMlClient client = AutoMlClient.create()) {
      // Get the full path of the dataset.
      DatasetName datasetFullId = DatasetName.of(projectId, "us-central1", datasetId);
      Empty response = client.deleteDatasetAsync(datasetFullId).get();
      System.out.format("Dataset deleted. %s%n", response);
    }
  }
}

Node.js

/**
 * TODO(developer): Uncomment these variables before running the sample.
 */
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'us-central1';
// const datasetId = 'YOUR_DATASET_ID';

// Imports the Google Cloud AutoML library
const {AutoMlClient} = require('@google-cloud/automl').v1beta1;

// Instantiates a client
const client = new AutoMlClient();

async function deleteDataset() {
  // Construct request
  const request = {
    name: client.datasetPath(projectId, location, datasetId),
  };

  const [operation] = await client.deleteDataset(request);

  // Wait for operation to complete.
  const [response] = await operation.promise();
  console.log(`Dataset deleted: ${response}`);
}

deleteDataset();

Python

from google.cloud import automl_v1beta1 as automl


def delete_dataset(project_id="YOUR_PROJECT_ID", dataset_id="YOUR_DATASET_ID"):
    """Delete a dataset."""
    client = automl.AutoMlClient()
    # Get the full path of the dataset
    dataset_full_id = client.dataset_path(
        project_id, "us-central1", dataset_id
    )
    response = client.delete_dataset(dataset_full_id)

    print("Dataset deleted. {}".format(response.result()))