Managing datasets

A dataset contains representative samples of the type of content you want to classify, labeled with the category labels you want your custom model to use. The dataset serves as the input for training a model.

The main steps for building a dataset are:

  1. Create a dataset and specify whether to allow multiple labels on each item.
  2. Import data items into the dataset.
  3. Label the items.

In many cases, steps 2 and 3 are combined: you import data items with their labels already assigned.

A project can have multiple datasets, each used to train a separate model. You can get a list of the available datasets and can delete datasets you no longer need.

Creating a dataset

The first step in creating a custom model is to create an empty dataset that will eventually hold the training data for the model. When you create a dataset, you specify the type of classification you want your custom model to perform:

  • MULTICLASS assigns a single label to each classified document
  • MULTILABEL allows a document to be assigned multiple labels

Web UI

The AutoML Vision UI enables you to create a new dataset and import items into it from the same page. If you would rather import items later, select Import images later at step 3 below.

  1. Open the AutoML Vision UI.

    The Datasets page shows the status of previously created datasets for the current project.

    Dataset list page

    To add a dataset for a different project, select the project from the drop-down list in the upper right of the title bar.

  2. Click the New Dataset button in the title bar.

  3. On the Create dataset page, enter a name for the dataset and specify where to find the labeled images to use for training the model.

    You can:

    • Upload a .csv file that contains the training images and their associated category labels from your local computer or from Google Cloud Storage.

    • Upload a collection of .txt or .zip files that contain the training images from your local computer.

    • Postpone uploading images and labels until later. Use this option for manual labeling through the UI or through the human labeling service.

  4. Specify whether to enable multi-label classification.

    Click the check box if you want the model to assign multiple labels to a document.

  5. Click Create dataset.

    You're returned to the Datasets page; your dataset will show an in progress animation while your documents are being imported. This process should take approximately 10 minutes per 1000 documents, but may take more or less time.

    If the service returns a 405 error, reduce the number of documents you're uploading at once. You'll need to refresh the page before trying again.

Command-line

The following example creates a dataset named test_dataset that supports one label per item (see MULTICLASS). The newly created dataset doesn't contain any data until you import items into it.

Save the "name" of the new dataset (from the response) for use with other operations, such as importing items into your dataset and training a model.

curl \
  -X POST \
  -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
  -H "Content-Type: application/json" \
  https://automl.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/us-central1/datasets \
  -d '{
    "displayName": "test_dataset",
    "imageClassificationDatasetMetadata": {
      "classificationType": "MULTICLASS"
    },
  }'

You should see output similar to the following:

{
  "name": "projects/434039606874/locations/us-central1/datasets/356587829854924648",
  "displayName": "test_dataset",
  "createTime": "2018-04-26T18:02:59.825060Z",
  "imageClassificationDatasetMetadata": {
    "classificationType": "MULTICLASS"
  }
}

Python

Before you can run this code example, you must install the Python Client Libraries.

The following example creates a dataset that supports one label per item (see MULTICLASS). The newly created dataset doesn't contain any data until you import items into it.

Save the name of the new dataset (response.name) for use with other operations, such as importing items into your dataset and training a model.

# TODO(developer): Uncomment and set the following variables
# project_id = 'PROJECT_ID_HERE'
# compute_region = 'COMPUTE_REGION_HERE'
# dataset_name = 'DATASET_NAME_HERE'
# multilabel = True for multilabel or False for multiclass

from google.cloud import automl_v1beta1 as automl

client = automl.AutoMlClient()

# A resource that represents Google Cloud Platform location.
project_location = client.location_path(project_id, compute_region)

# Classification type is assigned based on multilabel value.
classification_type = "MULTICLASS"
if multilabel:
    classification_type = "MULTILABEL"

# Specify the image classification type for the dataset.
dataset_metadata = {"classification_type": classification_type}
# Set dataset name and metadata of the dataset.
my_dataset = {
    "display_name": dataset_name,
    "image_classification_dataset_metadata": dataset_metadata,
}

# Create a dataset with the dataset metadata in the region.
dataset = client.create_dataset(project_location, my_dataset)

# Display the dataset information.
print("Dataset name: {}".format(dataset.name))
print("Dataset id: {}".format(dataset.name.split("/")[-1]))
print("Dataset display name: {}".format(dataset.display_name))
print("Image classification dataset metadata:")
print("\t{}".format(dataset.image_classification_dataset_metadata))
print("Dataset example count: {}".format(dataset.example_count))
print("Dataset create time:")
print("\tseconds: {}".format(dataset.create_time.seconds))
print("\tnanos: {}".format(dataset.create_time.nanos))

Java

/**
 * Demonstrates using the AutoML client to create a dataset
 *
 * @param projectId the Google Cloud Project ID.
 * @param computeRegion the Region name. (e.g., "us-central1")
 * @param datasetName the name of the dataset to be created.
 * @param multiLabel the type of classification problem. Set to FALSE by default. False -
 *     MULTICLASS , True - MULTILABEL
 * @throws IOException on Input/Output errors.
 */
public static void createDataset(
    String projectId, String computeRegion, String datasetName, Boolean multiLabel)
    throws IOException {
  // Instantiates a client
  AutoMlClient client = AutoMlClient.create();

  // A resource that represents Google Cloud Platform location.
  LocationName projectLocation = LocationName.of(projectId, computeRegion);

  // Classification type assigned based on multiLabel value.
  ClassificationType classificationType =
      multiLabel ? ClassificationType.MULTILABEL : ClassificationType.MULTICLASS;

  // Specify the image classification type for the dataset.
  ImageClassificationDatasetMetadata imageClassificationDatasetMetadata =
      ImageClassificationDatasetMetadata.newBuilder()
          .setClassificationType(classificationType)
          .build();

  // Set dataset with dataset name and set the dataset metadata.
  Dataset myDataset =
      Dataset.newBuilder()
          .setDisplayName(datasetName)
          .setImageClassificationDatasetMetadata(imageClassificationDatasetMetadata)
          .build();

  // Create dataset with the dataset metadata in the region.
  Dataset dataset = client.createDataset(projectLocation, myDataset);

  // Display the dataset information
  System.out.println(String.format("Dataset name: %s", dataset.getName()));
  System.out.println(
      String.format(
          "Dataset id: %s",
          dataset.getName().split("/")[dataset.getName().split("/").length - 1]));
  System.out.println(String.format("Dataset display name: %s", dataset.getDisplayName()));
  System.out.println("Image classification dataset specification:");
  System.out.print(String.format("\t%s", dataset.getImageClassificationDatasetMetadata()));
  System.out.println(String.format("Dataset example count: %d", dataset.getExampleCount()));
  System.out.println("Dataset create time:");
  System.out.println(String.format("\tseconds: %s", dataset.getCreateTime().getSeconds()));
  System.out.println(String.format("\tnanos: %s", dataset.getCreateTime().getNanos()));
}

Node.js

  const automl = require(`@google-cloud/automl`).v1beta1;

  const client = new automl.AutoMlClient();

  /**
   * TODO(developer): Uncomment the following line before running the sample.
   */
  // const projectId = `The GCLOUD_PROJECT string, e.g. "my-gcloud-project"`;
  // const computeRegion = `region-name, e.g. "us-central1"`;
  // const datasetName = `name of the dataset to create, e.g. “myDataset”`;
  // const multiLabel = `type of classification problem, true for multilabel and false for multiclass e.g. "false"`;

  // A resource that represents Google Cloud Platform location.
  const projectLocation = client.locationPath(projectId, computeRegion);

  // Classification type is assigned based on multilabel value.
  let classificationType = `MULTICLASS`;
  if (multiLabel) {
    classificationType = `MULTILABEL`;
  }

  // Specify the text classification type for the dataset.
  const datasetMetadata = {
    classificationType: classificationType,
  };

  // Set dataset name and metadata.
  const myDataset = {
    displayName: datasetName,
    imageClassificationDatasetMetadata: datasetMetadata,
  };

  // Create a dataset with the dataset metadata in the region.
  client
    .createDataset({parent: projectLocation, dataset: myDataset})
    .then(responses => {
      const dataset = responses[0];

      // Display the dataset information.
      console.log(`Dataset name: ${dataset.name}`);
      console.log(`Dataset id: ${dataset.name.split(`/`).pop(-1)}`);
      console.log(`Dataset display name: ${dataset.displayName}`);
      console.log(`Dataset example count: ${dataset.exampleCount}`);
      console.log(`Image Classification type:`);
      console.log(
        `\t ${dataset.imageClassificationDatasetMetadata.classificationType}`
      );
      console.log(`Dataset create time:`);
      console.log(`\tseconds: ${dataset.createTime.seconds}`);
      console.log(`\tnanos: ${dataset.createTime.nanos}`);
    })
    .catch(err => {
      console.error(err);
    });

Importing items into a dataset

After you have created a dataset, you can import item URIs and labels for items from a CSV file stored in a Google Cloud Storage bucket. For details on preparing your data and creating a CSV file for import, see Preparing your training data.

You can import items into an empty dataset or import additional items into an existing dataset.

Web UI

The AutoML Vision UI enables you to create a new dataset and import items into it from the same page; see Creating a dataset. The steps below import items into an existing dataset.

  1. Open the AutoML Vision UI and select the dataset from the Datasets page.

    Dataset list page

  2. On the Images page, click Add items in the title bar and select the import method from the drop-down list.

    You can:

    • Upload a .csv file that contains the training images and their associated category labels from your local computer or from Google Cloud Storage.

    • Upload .txt or .zip files that contain the training images from your local computer.

  3. Select the file(s) to import.

Command-line

  • Replace dataset-name with the full name of your dataset, from the response when you created the dataset. The full name has the format: projects/{project-id}/locations/us-central1/datasets/{dataset-id}

  • Replace bucket-name with the name of the Google Cloud Storage bucket where you have stored your CSV file.

  • Replace csv-file-name with the name of your CSV file.

    curl \
      -X POST \
      -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
      -H "Content-Type: application/json" \
      https://automl.googleapis.com/v1beta1/dataset-name:import \
      -d '{
        "inputUris": "gs://bucket-name-vcm/csv/csv-file-name.csv",
      }'
    

    You should see output similar to the following. You can use the operation ID to get the status of the task. For an example, see Getting the status of an operation.

    {
      "name": "projects/434039606874/locations/us-central1/operations/1979469554520650937",
      "metadata": {
        "@type": "type.googleapis.com/google.cloud.automl.v1beta1.OperationMetadata",
        "createTime": "2018-04-27T01:28:36.128120Z",
        "updateTime": "2018-04-27T01:28:36.128150Z",
        "cancellable": true
      }
    }
    

Python

Before you can run this code example, you must install the Python Client Libraries.

  • dataset_full_id is the full name of the dataset, with the format: projects/{project-id}/locations/us-central1/datasets/{dataset-id}

  • The value for input_uris must be the path to the CSV file in the Google Cloud Storage bucket associated with this project. The format is: `gs://{project-id}-lcm/{document-name}.csv

# TODO(developer): Uncomment and set the following variables
# project_id = 'PROJECT_ID_HERE'
# compute_region = 'COMPUTE_REGION_HERE'
# dataset_id = 'DATASET_ID_HERE'
# path = 'gs://path/to/file.csv'

from google.cloud import automl_v1beta1 as automl

client = automl.AutoMlClient()

# Get the full path of the dataset.
dataset_full_id = client.dataset_path(
    project_id, compute_region, dataset_id
)

# Get the multiple Google Cloud Storage URIs.
input_uris = path.split(",")
input_config = {"gcs_source": {"input_uris": input_uris}}

# Import data from the input URI.
response = client.import_data(dataset_full_id, input_config)

print("Processing import...")
# synchronous check of operation status.
print("Data imported. {}".format(response.result()))

Java

/**
 * Demonstrates using the AutoML client to import labeled images.
 *
 * @param projectId the Id of the project.
 * @param computeRegion the Region name.
 * @param datasetId the Id of the dataset to which the training data will be imported.
 * @param path the Google Cloud Storage URIs. Target files must be in AutoML vision CSV format.
 * @throws Exception on AutoML Client errors
 */
public static void importData(
    String projectId, String computeRegion, String datasetId, String path) throws Exception {
  // Instantiates a client
  AutoMlClient client = AutoMlClient.create();

  // Get the complete path of the dataset.
  DatasetName datasetFullId = DatasetName.of(projectId, computeRegion, datasetId);

  GcsSource.Builder gcsSource = GcsSource.newBuilder();

  // Get multiple training data files to be imported
  String[] inputUris = path.split(",");
  for (String inputUri : inputUris) {
    gcsSource.addInputUris(inputUri);
  }

  // Import data from the input URI
  InputConfig inputConfig = InputConfig.newBuilder().setGcsSource(gcsSource).build();
  System.out.println("Processing import...");
  Empty response = client.importDataAsync(datasetFullId.toString(), inputConfig).get();
  System.out.println(String.format("Dataset imported. %s", response));
}

Node.js

  const automl = require(`@google-cloud/automl`).v1beta1;

  const client = new automl.AutoMlClient();

  /**
   * TODO(developer): Uncomment the following line before running the sample.
   */
  // const projectId = `The GCLOUD_PROJECT string, e.g. "my-gcloud-project"`;
  // const computeRegion = `region-name, e.g. "us-central1"`;
  // const datasetId = `Id of the dataset`;
  // const path = `string or array of .csv paths in AutoML Vision CSV format, e.g. “gs://myproject/traindata.csv”;`

  // Get the full path of the dataset.
  const datasetFullId = client.datasetPath(projectId, computeRegion, datasetId);

  // Get one or more Google Cloud Storage URI(s).
  const inputUris = path.split(`,`);
  const inputConfig = {
    gcsSource: {
      inputUris: inputUris,
    },
  };

  // Import the dataset from the input URI.
  client
    .importData({name: datasetFullId, inputConfig: inputConfig})
    .then(responses => {
      const operation = responses[0];
      console.log(`Processing import...`);
      return operation.promise();
    })
    .then(responses => {
      // The final result of the operation.
      if (responses[2].done) {
        console.log(`Data imported.`);
      }
    })
    .catch(err => {
      console.error(err);
    });

Labeling training items

To be useful for training a model, each item in a dataset must have at least one category label assigned to it. AutoML Vision ignores items without a category label. You can provide labels for your training items in three ways:

  • Include labels in your .csv file
  • Label your items in the AutoML Vision UI
  • Request labeling from a human labeling service.

The AutoML API does not include methods for labeling.

For details about labeling items in your .csv file, see Preparing your training data.

To label items in the AutoML Vision UI, select the dataset from the dataset listing page to see its details. The display name of the selected dataset appears in the title bar, and the page lists the individual items in the dataset along with their labels. The navigation bar along the left summarizes the number of labeled and unlabeled items and enables you to filter the item list by label.

Images page

To assign labels to unlabeled items or change item labels, select the items you want to update and the label(s) you want to assign to them.

Request labeling through Human Labeling

You can leverage Google's Human Labeling service to label your images. Currently, there is no charge for the service, which is limited to 5000 images per task. Turnaround time is on the order of days and depends on the number of images and the complexity of the labeling.

The requirements for human labeling are:

  • At least 100 unlabeled images in your dataset.
  • Between 2 and 20 labels defined.
  • Descriptions for each label.
  • At least 3 example images per label.

To request human labeling:

From the Label tab of the dataset import flow:

  1. Define labels. Enter the label names for this dataset. You may also need to add a None_of_the_above label for images that do not match any of the other labels. Providing images that don't match any of your labels and labeling them as None_of_the_above can improve the quality of your model.

  2. Once created, click each label name from the Define labels list and provide a label description and one or more example images. Images can be pulled from your dataset, or uploaded from local disk.

  3. Select the Use human labeling service option. Enter a task name and click Start. The system will begin processing your request.

Once human labeling is complete, a Todo tab is added to your AutoML Vision interface. From this tab, you can approve the labels or take any other required action.

Listing datasets

A project can include numerous datasets. This section describes how to retrieve a list of the available datasets for a project.

Web UI

To see a list of the available datasets using the AutoML Vision UI, click the Datasets link at the top of the left navigation menu.

Dataset list page

To see the datasets for a different project, select the project from the drop-down list in the upper right of the title bar.

Command-line

curl \
  -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
  -H "Content-Type: application/json" \
  https://automl.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/us-central1/datasets

You should see output similar to the following:

{
  "datasets": [
    {
      "name": "projects/434039606874/locations/us-central1/datasets/356587829854924648",
      "displayName": "test_dataset",
      "createTime": "2018-04-26T18:02:59.825060Z",
      "imageClassificationDatasetMetadata": {
        "classificationType": "MULTICLASS"
      }
    },
    {
      "name": "projects/434039606874/locations/us-central1/datasets/3104518874390609379",
      "displayName": "test",
      "createTime": "2017-12-16T01:10:38.328280Z",
      "imageClassificationDatasetMetadata": {
        "classificationType": "MULTICLASS"
      }
    }
  ]
}

Python

Before you can run this code example, you must install the Python Client Libraries.
# TODO(developer): Uncomment and set the following variables
# project_id = 'PROJECT_ID_HERE'
# compute_region = 'COMPUTE_REGION_HERE'
# filter_ = 'filter expression here'

from google.cloud import automl_v1beta1 as automl

client = automl.AutoMlClient()

# A resource that represents Google Cloud Platform location.
project_location = client.location_path(project_id, compute_region)

# List all the datasets available in the region by applying filter.
response = client.list_datasets(project_location, filter_)

print("List of datasets:")
for dataset in response:
    # Display the dataset information.
    print("Dataset name: {}".format(dataset.name))
    print("Dataset id: {}".format(dataset.name.split("/")[-1]))
    print("Dataset display name: {}".format(dataset.display_name))
    print("Image classification dataset metadata:")
    print("\t{}".format(dataset.image_classification_dataset_metadata))
    print("Dataset example count: {}".format(dataset.example_count))
    print("Dataset create time:")
    print("\tseconds: {}".format(dataset.create_time.seconds))
    print("\tnanos: {}".format(dataset.create_time.nanos))

Java

/**
 * Demonstrates using the AutoML client to list all datasets.
 *
 * @param projectId the Id of the project.
 * @param computeRegion the Region name.
 * @param filter the Filter expression.
 * @throws IOException on Input/Output errors.
 */
public static void listDatasets(String projectId, String computeRegion, String filter)
    throws IOException {
  // Instantiates a client
  AutoMlClient client = AutoMlClient.create();

  // A resource that represents Google Cloud Platform location.
  LocationName projectLocation = LocationName.of(projectId, computeRegion);

  // Build the List datasets request
  ListDatasetsRequest request =
      ListDatasetsRequest.newBuilder()
          .setParent(projectLocation.toString())
          .setFilter(filter)
          .build();

  // List all the datasets available in the region by applying the filter.
  System.out.print("List of datasets:");
  for (Dataset dataset : client.listDatasets(request).iterateAll()) {
    // Display the dataset information
    System.out.println(String.format("\nDataset name: %s", dataset.getName()));
    System.out.println(
        String.format(
            "Dataset id: %s",
            dataset.getName().split("/")[dataset.getName().split("/").length - 1]));
    System.out.println(String.format("Dataset display name: %s", dataset.getDisplayName()));
    System.out.println("Image classification dataset specification:");
    System.out.print(String.format("\t%s", dataset.getImageClassificationDatasetMetadata()));
    System.out.println(String.format("Dataset example count: %d", dataset.getExampleCount()));
    System.out.println("Dataset create time:");
    System.out.println(String.format("\tseconds: %s", dataset.getCreateTime().getSeconds()));
    System.out.println(String.format("\tnanos: %s", dataset.getCreateTime().getNanos()));
  }
}

Node.js

  const automl = require(`@google-cloud/automl`).v1beta1;

  const client = new automl.AutoMlClient();
  /**
   * TODO(developer): Uncomment the following line before running the sample.
   */
  // const projectId = `The GCLOUD_PROJECT string, e.g. "my-gcloud-project"`;
  // const computeRegion = `region-name, e.g. "us-central1"`;
  // const filter = `filter expressions, must specify field e.g. “imageClassificationModelMetadata:*”`;

  // A resource that represents Google Cloud Platform location.
  const projectLocation = client.locationPath(projectId, computeRegion);

  // List all the datasets available in the region by applying filter.
  client
    .listDatasets({parent: projectLocation, filter: filter})
    .then(responses => {
      const datasets = responses[0];

      // Display the dataset information.
      console.log(`List of datasets:`);
      datasets.forEach(dataset => {
        console.log(`Dataset name: ${dataset.name}`);
        console.log(`Dataset Id: ${dataset.name.split(`/`).pop(-1)}`);
        console.log(`Dataset display name: ${dataset.displayName}`);
        console.log(`Dataset example count: ${dataset.exampleCount}`);
        console.log(`Image Classification type:`);
        console.log(
          `\t`,
          dataset.imageClassificationDatasetMetadata.classificationType
        );
        console.log(`Dataset create time: `);
        console.log(`\tseconds: ${dataset.createTime.seconds}`);
        console.log(`\tnanos: ${dataset.createTime.nanos}`);
        console.log(`\n`);
      });
    })
    .catch(err => {
      console.error(err);
    });

Deleting a dataset

Web UI

  1. In the AutoML Vision UI, click the Datasets link at the top of the left navigation menu to display the list of available datasets.

    Dataset list page

  2. Click the three-dot menu at the far right of the row you want to delete and select Delete dataset.

  3. Click Delete in the confirmation dialog box.

Command-line

  • Replace dataset-name with the full name of your dataset, from the response when you created the dataset. The full name has the format: projects/{project-id}/locations/us-central1/datasets/{dataset-id}
curl -X DELETE \
  -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
  -H "Content-Type: application/json" https://automl.googleapis.com/v1beta1/dataset-name

You should see output similar to the following:

{
  "name": "projects/434039606874/locations/us-central1/operations/3512013641657611176",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.automl.v1beta1.OperationMetadata",
    "createTime": "2018-05-04T01:45:16.735340Z",
    "updateTime": "2018-05-04T01:45:16.735360Z",
    "cancellable": true
  }
}

Python

Before you can run this code example, you must install the Python Client Libraries.

  • dataset_id is the full name of the dataset, with the format: projects/{project-id}/locations/us-central1/datasets/{dataset-id}
# TODO(developer): Uncomment and set the following variables
# project_id = 'PROJECT_ID_HERE'
# compute_region = 'COMPUTE_REGION_HERE'
# dataset_id = 'DATASET_ID_HERE'

from google.cloud import automl_v1beta1 as automl

client = automl.AutoMlClient()

# Get the full path of the dataset.
dataset_full_id = client.dataset_path(
    project_id, compute_region, dataset_id
)

# Delete a dataset.
response = client.delete_dataset(dataset_full_id)

# synchronous check of operation status.
print("Dataset deleted. {}".format(response.result()))

Java

/**
 * Delete a dataset.
 *
 * @param projectId the Id of the project.
 * @param computeRegion the Region name.
 * @param datasetId the Id of the dataset.
 * @throws Exception on AutoML Client errors
 */
public static void deleteDataset(String projectId, String computeRegion, String datasetId)
    throws Exception {
  // Instantiates a client
  AutoMlClient client = AutoMlClient.create();

  // Get the complete path of the dataset.
  DatasetName datasetFullId = DatasetName.of(projectId, computeRegion, datasetId);

  // Delete a dataset.
  Empty response = client.deleteDatasetAsync(datasetFullId).get();

  System.out.println(String.format("Dataset deleted. %s", response));
}

Node.js

  const automl = require(`@google-cloud/automl`).v1beta1;

  const client = new automl.AutoMlClient();

  /**
   * TODO(developer): Uncomment the following line before running the sample.
   */
  // const projectId = `The GCLOUD_PROJECT string, e.g. "my-gcloud-project"`;
  // const computeRegion = `region-name, e.g. "us-central1"`;
  // const datasetId = `Id of the dataset`;

  // Get the full path of the dataset.
  const datasetFullId = client.datasetPath(projectId, computeRegion, datasetId);

  // Delete a dataset.
  client
    .deleteDataset({name: datasetFullId})
    .then(responses => {
      const operation = responses[0];
      return operation.promise();
    })
    .then(responses => {
      // The final result of the operation.
      if (responses[2].done) {
        console.log(`Dataset deleted.`);
      }
    })
    .catch(err => {
      console.error(err);
    });

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud AutoML Vision
Need help? Visit our support page.