Managing datasets

A dataset contains representative samples of the type of content you want to classify, labeled with the category labels you want your custom model to use. The dataset serves as the input for training a model.

The main steps for building a dataset are:

  1. Create a dataset and specify whether to allow multiple labels on each item.
  2. Import data items into the dataset.
  3. Label the items.

A project can have multiple datasets, each used to train a separate model. You can get a list of the available datasets and can delete datasets you no longer need.

Creating a dataset

The first step in creating a custom model is to create an empty dataset that will eventually hold the training data for the model.

Web UI

The AutoML Video UI enables you to create a new dataset and import items into it from the same page.

  1. Open the AutoML Video UI. The Datasets page shows the status of previously created datasets for the current project. To add a dataset for a different project, select the project from the drop-down list in the upper right of the title bar.
  2. On the Datasets page, click Create Dataset.
    Create dataset icon.
  3. Specify a name for this dataset and then click Create Dataset. Create dataset dialog box with two fields and two buttons
  4. On the page for your dataset, provide the Cloud Storage URI of the CSV file that contains the URIs of your training data, without the gs:// prefix at the beginning.
  5. Also on the page for your dataset, click Continue to begin training. Page for dataset titled "my_dataset"

The import process can take a while to complete, depending on the number and length of the videos that you've provided.

Command-line

The following example creates a dataset named test_dataset that supports one label per item (see MULTICLASS). The newly created dataset doesn't contain any data until you import items into it.

Save the "name" of the new dataset (from the response) for use with other operations, such as importing items into your dataset and training a model.

curl \
  -X POST \
  -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
  -H "Content-Type: application/json" \
  https://automl.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/datasets \
  -d '{
    "displayName": "test_dataset",
    "videoClassificationDatasetMetadata": {
    },
  }'

You should see output similar to the following:

{
  "name": "projects/drothaus-testing/locations/us-central1/datasets/VCN3940649673949184000",
  "displayName": "test_dataset",
  "createTime": "2018-10-18T21:18:13.975412Z",
  "videoClassificationDatasetMetadata": {}
}

Java

/**
 * Demonstrates using the AutoML client to create a dataset.
 *
 * @param projectId the Id of the project.
 * @param computeRegion the Region name. (e.g., "us-central1")
 * @param datasetName the name of the dataset to be created.
 * @throws IOException
 */
public static void createDataset(String projectId, String computeRegion, String datasetName)
    throws IOException {
  // Instantiates a client
  AutoMlClient client = AutoMlClient.create();

  // A resource that represents Google Cloud Platform location.
  LocationName projectLocation = LocationName.of(projectId, computeRegion);

  // Set model metadata.
  VideoClassificationDatasetMetadata videoClassificationDatasetMetadata =
      VideoClassificationDatasetMetadata.newBuilder().build();

  // Set dataset name and dataset metadata.
  Dataset myDataset =
      Dataset.newBuilder()
          .setDisplayName(datasetName)
          .setVideoClassificationDatasetMetadata(videoClassificationDatasetMetadata)
          .build();

  // Create a dataset with the dataset metadata in the region.
  Dataset dataset = client.createDataset(projectLocation, myDataset);

  // Display the dataset information.
  System.out.println(String.format("Dataset name: %s", dataset.getName()));
  System.out.println(
      String.format(
          "Dataset Id: %s",
          dataset.getName().split("/")[dataset.getName().split("/").length - 1]));
  System.out.println(String.format("Dataset display name: %s", dataset.getDisplayName()));
  System.out.println("Video classification dataset metadata:");
  System.out.print(String.format("\t%s", dataset.getVideoClassificationDatasetMetadata()));
  System.out.println(String.format("Dataset example count: %d", dataset.getExampleCount()));
  DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZ");
  String createTime =
      dateFormat.format(new java.util.Date(dataset.getCreateTime().getSeconds() * 1000));
  System.out.println(String.format("Dataset create time: %s", createTime));
}

Node.js

const automl = require(`@google-cloud/automl`);
const util = require(`util`);
const client = new automl.v1beta1.AutoMlClient();

/**
 * Demonstrates using the AutoML client to create a dataset.
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const projectId = '[PROJECT_ID]' e.g., "my-gcloud-project";
// const computeRegion = '[REGION_NAME]' e.g., "us-central1";
// const datasetName = '[DATASET_NAME]' e.g., "myDataset”;

// A resource that represents Google Cloud Platform location.
const projectLocation = client.locationPath(projectId, computeRegion);

// Set dataset name and metadata.
const myDataset = {
  displayName: datasetName,
  videoClassificationDatasetMetadata: {},
};

// Create a dataset with the dataset metadata in the region.
client
  .createDataset({parent: projectLocation, dataset: myDataset})
  .then(responses => {
    const dataset = responses[0];

    // Display the dataset information.
    console.log(`Dataset name: ${dataset.name}`);
    console.log(`Dataset Id: ${dataset.name.split(`/`).pop(-1)}`);
    console.log(`Dataset display name: ${dataset.displayName}`);
    console.log(`Dataset example count: ${dataset.exampleCount}`);
    console.log(
      `Video classification dataset metadata: ${util.inspect(
        dataset.videoClassificationDatasetMetadata,
        false,
        null
      )}`
    );
  })
  .catch(err => {
    console.error(err);
  });

Python

# TODO(developer): Uncomment and set the following variables
# project_id = 'PROJECT_ID_HERE'
# compute_region = 'COMPUTE_REGION_HERE'
# dataset_name = 'DATASET_NAME_HERE'

from google.cloud import automl_v1beta1 as automl

client = automl.AutoMlClient()

# A resource that represents Google Cloud Platform location.
project_location = client.location_path(project_id, compute_region)

# Set dataset name and metadata of the dataset.
my_dataset = {
    "display_name": dataset_name,
    "video_classification_dataset_metadata": {},
}

# Create a dataset with the dataset metadata in the region.
dataset = client.create_dataset(project_location, my_dataset)

# Display the dataset information.
print("Dataset name: {}".format(dataset.name))
print("Dataset id: {}".format(dataset.name.split("/")[-1]))
print("Dataset display name: {}".format(dataset.display_name))
print("Dataset example count: {}".format(dataset.example_count))
print("Dataset create time:")
print("\tseconds: {}".format(dataset.create_time.seconds))
print("\tnanos: {}".format(dataset.create_time.nanos))

Importing items into a dataset

After you have created a dataset, you can import labeled data from CSV files stored in a Google Cloud Storage bucket. For details on preparing your data and creating a CSV files for import, see Preparing your training data.

You can import items into an empty dataset or import additional items into an existing dataset.

Web UI

Your data is imported when you create your dataset.

Command-line

  • Replace dataset-name with the full name of your dataset, from the response when you created the dataset. The full name has the format: projects/{project-id}/locations/us-central1/datasets/{dataset-id}

  • Replace bucket-name with the name of the Google Cloud Storage bucket where you have stored your model training file list CSV file.

  • Replace csv-file-name with the name of your model training file list CSV file.

    curl \
      -X POST \
      -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
      -H "Content-Type: application/json" \
      https://automl.googleapis.com/v1beta1/dataset-name:importData \
      -d '{
        "inputConfig": {
          "gcsSource": {
             "inputUris": ["gs://bucket-name/csv-file-name.csv"]
           }
        }
      }'
    

    You should see output similar to the following. You can use the operation ID to get the status of the task. For an example, see Getting the status of an operation.

    {
      "name": "projects/434039606874/locations/us-central1/operations/VCN7506374678919774208",
      "metadata": {
        "@type": "type.googleapis.com/google.cloud.automl.v1beta1.OperationMetadata",
        "createTime": "2018-04-27T01:28:36.128120Z",
        "updateTime": "2018-04-27T01:28:36.128150Z",
        "cancellable": true
      }
    }
    

Java

/**
 * Demonstrates using the AutoML client to import labeled items.
 *
 * @param projectId the Id of the project.
 * @param computeRegion the Region name. (e.g., "us-central1")
 * @param datasetId the Id of the dataset into which the training content are to be imported.
 * @param path the Google Cloud Storage URIs. Target files must be in AutoML Video Intelligence
 *     Classification CSV format.
 * @throws IOException
 * @throws ExecutionException
 * @throws InterruptedException
 */
public static void importData(
    String projectId, String computeRegion, String datasetId, String path)
    throws IOException, InterruptedException, ExecutionException {
  // Instantiates a client
  AutoMlClient client = AutoMlClient.create();

  // Get the complete path of the dataset.
  DatasetName datasetFullId = DatasetName.of(projectId, computeRegion, datasetId);

  GcsSource.Builder gcsSource = GcsSource.newBuilder();

  // Get multiple training data files to be imported from gcsSource.
  String[] inputUris = path.split(",");
  for (String inputUri : inputUris) {
    gcsSource.addInputUris(inputUri);
  }

  // Import data from the input URI
  InputConfig inputConfig = InputConfig.newBuilder().setGcsSource(gcsSource).build();
  System.out.println("Processing import...");

  Empty response = client.importDataAsync(datasetFullId, inputConfig).get();
  System.out.println(String.format("Dataset imported. %s", response));
}

Node.js

const automl = require(`@google-cloud/automl`);
const client = new automl.v1beta1.AutoMlClient();

/**
 * Demonstrates using the AutoML client to import labeled items.
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const projectId = '[PROJECT_ID]' e.g., "my-gcloud-project";
// const computeRegion = '[REGION_NAME]' e.g., "us-central1";
// const datasetId = '[DATASET_ID]' e.g.,"VCN7209576908164431872";
// const gcsPath = '[GCS_PATH]' e.g., "gs://<bucket-name>/<csv file>”,
// `.csv paths in AutoML Video Intelligence Detection CSV format`;

// Get the full path of the dataset.
const datasetFullId = client.datasetPath(projectId, computeRegion, datasetId);

// Get the multiple Google Cloud Storage URIs.
const inputUris = gcsPath.split(`,`);
const inputConfig = {
  gcsSource: {
    inputUris: inputUris,
  },
};

// Import the data from the input URI.
client
  .importData({name: datasetFullId, inputConfig: inputConfig})
  .then(responses => {
    const operation = responses[0];
    console.log(`Processing import...`);
    return operation.promise();
  })
  .then(responses => {
    // The final result of the operation.
    const operationDetails = responses[2];

    // Get the data import details.
    console.log('Data import details:');
    console.log(`\tOperation details:`);
    console.log(`\t\tName: ${operationDetails.name}`);
    console.log(`\t\tDone: ${operationDetails.done}`);
  })
  .catch(err => {
    console.error(err);
  });

Python

# TODO(developer): Uncomment and set the following variables
# project_id = 'PROJECT_ID_HERE'
# compute_region = 'COMPUTE_REGION_HERE'
# dataset_id = 'DATASET_ID_HERE'
# path = 'gs://path/to/file.csv'

from google.cloud import automl_v1beta1 as automl

client = automl.AutoMlClient()

# Get the full path of the dataset.
dataset_full_id = client.dataset_path(
    project_id, compute_region, dataset_id
)

# Get the multiple Google Cloud Storage URIs.
input_uris = path.split(",")
input_config = {"gcs_source": {"input_uris": input_uris}}

# Import data from the input URI.
response = client.import_data(dataset_full_id, input_config)

print("Processing import...")
# synchronous check of operation status.
print("Data imported. {}".format(response.result()))

Labeling training items

To be useful for training a model, each item in a dataset must have at least one category label assigned to it. AutoML Video ignores items without a category label. You can provide labels for your training items in three ways:

  • Include labels in your CSV file
  • Label your items in the AutoML Video UI

For details about labeling items in your CSV file, see Preparing your training data.

To label items in the AutoML Video UI, select the dataset from the dataset listing page to see its details. The display name of the selected dataset appears in the title bar, and the page lists the individual items in the dataset along with their labels. The navigation bar along the left summarizes the number of labeled and unlabeled items. It also enables you to filter the item list by label.

Videos in a dataset

To assign labels to unlabeled videos or change video labels, do the following:

  1. On the page for the dataset, click the video that you want to add or change labels for.
  2. On the page for the video, do the following:

    1. Click Add Segment.
    2. Drag the arrows on either side of the video timeline to define the region that you want to label. By default, the entire duration of the video is selected.
    3. From the list of labels, click the labels that you want to apply to the video. The color bar for the label turns solid after you select it.
    4. Click Save.

Applying labels to a video of someone running up stairs

If you need to add a new label for the dataset, on the page for the dataset, above the list of existing labels, click the three dots next to Filter labels and then click Add new label.

Listing datasets

A project can include numerous datasets. This section describes how to retrieve a list of the available datasets for a project.

Web UI

To see a list of the available datasets using the AutoML Video UI, navigate to the Datasets page.

To see the datasets for a different project, select the project from the drop-down list in the upper right of the title bar.

Command-line

curl \
  -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
  -H "Content-Type: application/json" \
  https://automl.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/us-central1/datasets

You should see output similar to the following:

{
  "datasets": [
    {
      "name": "projects/434039606874/locations/us-central1/datasets/356587829854924648",
      "displayName": "test_dataset",
      "createTime": "2018-04-26T18:02:59.825060Z",
      "videoClassificationDatasetMetadata": {
      }
    },
    {
      "name": "projects/434039606874/locations/us-central1/datasets/3104518874390609379",
      "displayName": "test",
      "createTime": "2017-12-16T01:10:38.328280Z",
      "videoClassificationDatasetMetadata": {
      }
    }
  ]
}

Java

/**
 * Demonstrates using the AutoML client to list all datasets.
 *
 * @param projectId the Id of the project.
 * @param computeRegion the Region name. (e.g., "us-central1")
 * @param filter the Filter expression.
 * @throws IOException
 */
public static void listDatasets(String projectId, String computeRegion, String filter)
    throws IOException {
  // Instantiates a client
  AutoMlClient client = AutoMlClient.create();

  // A resource that represents Google Cloud Platform location.
  LocationName projectLocation = LocationName.of(projectId, computeRegion);

  // Build the List datasets request
  ListDatasetsRequest request =
      ListDatasetsRequest.newBuilder()
          .setParent(projectLocation.toString())
          .setFilter(filter)
          .build();

  // List all the datasets available in the region by applying filter.
  System.out.println("List of datasets:");
  for (Dataset dataset : client.listDatasets(request).iterateAll()) {
    // Display the dataset information.
    System.out.println(String.format("\nDataset name: %s", dataset.getName()));
    System.out.println(
        String.format(
            "Dataset Id: %s",
            dataset.getName().split("/")[dataset.getName().split("/").length - 1]));
    System.out.println(String.format("Dataset display name: %s", dataset.getDisplayName()));
    System.out.println(
        String.format(
            "Video classification dataset metadata: %s",
            dataset.getVideoClassificationDatasetMetadata()));
    System.out.println(String.format("Dataset example count: %d", dataset.getExampleCount()));
    DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZ");
    String createTime =
        dateFormat.format(new java.util.Date(dataset.getCreateTime().getSeconds() * 1000));
    System.out.println(String.format("Dataset create time: %s", createTime));
  }
}

Node.js

const automl = require(`@google-cloud/automl`);
const util = require(`util`);
const client = new automl.v1beta1.AutoMlClient();

/**
 * Demonstrates using the AutoML client to list all datasets.
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const projectId = '[PROJECT_ID]' e.g., "my-gcloud-project";
// const computeRegion = '[REGION_NAME]' e.g., "us-central1";
// const filter_ = '[FILTER_EXPRESSIONS]'
// e.g., "videoClassificationDatasetMetadata:*";

// A resource that represents Google Cloud Platform location.
const projectLocation = client.locationPath(projectId, computeRegion);

// List all the datasets available in the region by applying filter.
client
  .listDatasets({parent: projectLocation, filter: filter})
  .then(responses => {
    const dataset = responses[0];

    // Display the dataset information.
    console.log(`List of datasets:`);
    for (let i = 0; i < dataset.length; i++) {
      console.log(`\nDataset name: ${dataset[i].name}`);
      console.log(`Dataset Id: ${dataset[i].name.split(`/`).pop(-1)}`);
      console.log(`Dataset display name: ${dataset[i].displayName}`);
      console.log(`Dataset example count: ${dataset[i].exampleCount}`);
      console.log(
        `Video classification dataset metadata: ${util.inspect(
          dataset[i].videoClassificationDatasetMetadata,
          false,
          null
        )}`
      );
    }
  })
  .catch(err => {
    console.error(err);
  });

Python

# TODO(developer): Uncomment and set the following variables
# project_id = 'PROJECT_ID_HERE'
# compute_region = 'COMPUTE_REGION_HERE'
# filter_ = 'filter expression here'

from google.cloud import automl_v1beta1 as automl

client = automl.AutoMlClient()

# A resource that represents Google Cloud Platform location.
project_location = client.location_path(project_id, compute_region)

# List all the datasets available in the region by applying filter.
response = client.list_datasets(project_location, filter_)

print("List of datasets:")
for dataset in response:
    # Display the dataset information.
    print("Dataset name: {}".format(dataset.name))
    print("Dataset id: {}".format(dataset.name.split("/")[-1]))
    print("Dataset display name: {}".format(dataset.display_name))
    print("Dataset example count: {}".format(dataset.example_count))
    print("Dataset create time:")
    print("\tseconds: {}".format(dataset.create_time.seconds))
    print("\tnanos: {}".format(dataset.create_time.nanos))

Deleting a dataset

The following code demonstrates how to delete a dataset.

Web UI

  1. Navigate to the Datasets page in the AutoML Video UI.

    Datasets tab
  2. Click the three-dot menu at the far right of the row that you want to delete and select Delete dataset.
  3. Click Confirm in the confirmation dialog box.

Command-line

  • Replace dataset-name with the full name of your dataset, from the response when you created the dataset. The full name has the format: projects/{project-id}/locations/us-central1/datasets/{dataset-id}
curl -X DELETE \
  -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
  -H "Content-Type: application/json" https://automl.googleapis.com/v1beta1/dataset-name

You should see output similar to the following:

{
  "name": "projects/434039606874/locations/us-central1/operations/3512013641657611176",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.automl.v1beta1.OperationMetadata",
    "createTime": "2018-05-04T01:45:16.735340Z",
    "updateTime": "2018-05-04T01:45:16.735360Z",
    "cancellable": true
  }
}

Java

/**
 * Demonstrates using the AutoML client to delete a dataset.
 *
 * @param projectId the Id of the project.
 * @param computeRegion the Region name. (e.g., "us-central1")
 * @param datasetId the Id of the dataset.
 * @throws IOException
 * @throws ExecutionException
 * @throws InterruptedException
 */
public static void deleteDataset(String projectId, String computeRegion, String datasetId)
    throws IOException, InterruptedException, ExecutionException {
  // Instantiates a client
  AutoMlClient client = AutoMlClient.create();

  // Get the complete path of the dataset.
  DatasetName datasetFullId = DatasetName.of(projectId, computeRegion, datasetId);

  // Delete a dataset.
  Empty response = client.deleteDatasetAsync(datasetFullId).get();

  System.out.println(String.format("Dataset deleted. %s", response));
}

Node.js

const automl = require(`@google-cloud/automl`);
const client = new automl.v1beta1.AutoMlClient();

/**
 * Demonstrates using the AutoML client to delete a dataset.
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const projectId = '[PROJECT_ID]' e.g., "my-gcloud-project";
// const computeRegion = '[REGION_NAME]' e.g., "us-central1";
// const datasetId = '[DATASET_ID]' e.g., "VCN7209576908164431872";

// Get the full path of the dataset.
const datasetFullId = client.datasetPath(projectId, computeRegion, datasetId);

// Delete a dataset.
client
  .deleteDataset({name: datasetFullId})
  .then(responses => {
    const operation = responses[0];
    return operation.promise();
  })
  .then(responses => {
    // The final result of the operation.
    const operationDetails = responses[2];

    // Get the dataset delete details.
    console.log('Dataset delete details:');
    console.log(`\tOperation details:`);
    console.log(`\t\tName: ${operationDetails.name}`);
    console.log(`\t\tDone: ${operationDetails.done}`);
  })
  .catch(err => {
    console.error(err);
  });

Python

# TODO(developer): Uncomment and set the following variables
# project_id = 'PROJECT_ID_HERE'
# compute_region = 'COMPUTE_REGION_HERE'
# dataset_id = 'DATASET_ID_HERE'

from google.cloud import automl_v1beta1 as automl

client = automl.AutoMlClient()

# Get the full path of the dataset.
dataset_full_id = client.dataset_path(
    project_id, compute_region, dataset_id
)

# Delete a dataset.
response = client.delete_dataset(dataset_full_id)

# synchronous check of operation status.
print("Dataset deleted. {}".format(response.result()))
Var denne side nyttig? Giv os en anmeldelse af den:

Send feedback om...

AutoML Video Intelligence