Managing datasets

A dataset contains representative samples of the type of content you want to classify, labeled with the category labels you want your custom model to use. The dataset serves as the input for training a model.

The main steps for building a dataset are:

  1. Create a dataset and specify whether to allow multiple labels on each item.
  2. Import data items into the dataset.
  3. Label the items.

In many cases, steps 2 and 3 are combined: you import data items with their labels already assigned.

A project can have multiple datasets, each used to train a separate model. You can get a list of the available datasets and can delete datasets you no longer need.

Creating a dataset

The first step in creating a custom model is to create an empty dataset that will eventually hold the training data for the model. When you create a dataset, you specify the type of classification you want your custom model to perform:

  • MULTICLASS assigns a single label to each classified document
  • MULTILABEL allows a document to be assigned multiple labels

Web UI

The AutoML Vision UI enables you to create a new dataset and import items into it from the same page. If you would rather import items later, select Import images later at step 3 below.

  1. Open the AutoML Vision UI.

    The Datasets page shows the status of previously created datasets for the current project.

    Dataset list page

    To add a dataset for a different project, select the project from the drop-down list in the upper right of the title bar.

  2. Click the New Dataset button in the title bar.

  3. On the Create dataset page, enter a name for the dataset and specify where to find the labeled images to use for training the model.

    You can:

    • Upload a .csv file that contains the training images and their associated category labels from your local computer or from Google Cloud Storage.

    • Upload a collection of .txt or .zip files that contain the training images from your local computer.

    • Postpone uploading images and labels until later. Use this option for manual labeling through the UI or through the human labeling service.

  4. Specify whether to enable multi-label classification.

    Click the check box if you want the model to assign multiple labels to a document.

  5. Click Create dataset.

    You're returned to the Datasets page; your dataset will show an in progress animation while your documents are being imported. This process should take approximately 10 minutes per 1000 documents, but may take more or less time.

    If the service returns a 405 error, reduce the number of documents you're uploading at once. You'll need to refresh the page before trying again.

Command-line

The following example creates a dataset named test_dataset that supports one label per item (see MULTICLASS). The newly created dataset doesn't contain any data until you import items into it.

Save the "name" of the new dataset (from the response) for use with other operations, such as importing items into your dataset and training a model.

curl \
  -X POST \
  -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
  -H "Content-Type: application/json" \
  https://automl.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/us-central1/datasets \
  -d '{
    "displayName": "test_dataset",
    "imageClassificationDatasetSpec": {
      "classificationType": "MULTICLASS"
    },
  }'

You should see output similar to the following:

{
  "name": "projects/434039606874/locations/us-central1/datasets/356587829854924648",
  "displayName": "test_dataset",
  "createTime": "2018-04-26T18:02:59.825060Z",
  "imageClassificationDatasetSpec": {
    "classificationType": "MULTICLASS"
  }
}

Python

Before you can run this code example, you must install the Python Client Libraries.

The following example creates a dataset that supports one label per item (see MULTICLASS). The newly created dataset doesn't contain any data until you import items into it.

Save the name of the new dataset (response.name) for use with other operations, such as importing items into your dataset and training a model.

def automl_create_dataset(dataset_name):
  """ Create a dataset"""
  dataset_spec = {
    "classification_type" :  enums.ClassificationType.MULTICLASS
  }
  my_dataset = {
    "display_name" : dataset_name,
    "image_classification_dataset_spec" : dataset_spec
  }
  response = client.create_dataset( parent, my_dataset)
  print("\nDataset creation: {}", response)
  dataset_full_id = response.name

Importing items into a dataset

After you have created a dataset, you can import item URIs and labels for items from a CSV file stored in a Google Cloud Storage bucket. For details on preparing your data and creating a CSV file for import, see Preparing your training data.

You can import items into an empty dataset or import additional items into an existing dataset.

Web UI

The AutoML Vision UI enables you to create a new dataset and import items into it from the same page; see Creating a dataset. The steps below import items into an existing dataset.

  1. Open the AutoML Vision UI and select the dataset from the Datasets page.

    Dataset list page

  2. On the Images page, click Add items in the title bar and select the import method from the drop-down list.

    You can:

    • Upload a .csv file that contains the training images and their associated category labels from your local computer or from Google Cloud Storage.

    • Upload .txt or .zip files that contain the training images from your local computer.

  3. Select the file(s) to import.

Command-line

  • Replace dataset-name with the full name of your dataset, from the response when you created the dataset. The full name has the format: projects/{project-id}/locations/us-central1/datasets/{dataset-id}

  • Replace bucket-name with the name of the Google Cloud Storage bucket where you have stored your CSV file.

  • Replace csv-file-name with the name of your CSV file.

    curl \
      -X POST \
      -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
      -H "Content-Type: application/json" \
      https://automl.googleapis.com/v1beta1/dataset-name:import \
      -d '{
        "inputUris": "gs://bucket-name-vcm/csv/csv-file-name.csv",
      }'
    

    You should see output similar to the following. You can use the operation ID to get the status of the task. For an example, see Getting the status of an operation.

    {
      "name": "projects/434039606874/locations/us-central1/operations/1979469554520650937",
      "metadata": {
        "@type": "type.googleapis.com/google.cloud.automl.v1beta1.OperationMetadata",
        "createTime": "2018-04-27T01:28:36.128120Z",
        "updateTime": "2018-04-27T01:28:36.128150Z",
        "cancellable": true
      }
    }
    

Python

Before you can run this code example, you must install the Python Client Libraries.

  • dataset_full_id is the full name of the dataset, with the format: projects/{project-id}/locations/us-central1/datasets/{dataset-id}

  • The value for input_uris must be the path to the CSV file in the Google Cloud Storage bucket associated with this project. The format is: `gs://{project-id}-lcm/{document-name}.csv

def automl_import(path, dataset_full_id):
  """ import labeled items """
  input_uris = [ path]
  operation = client.import_dataset(dataset_full_id, input_uris)
  print('\nProcessing import')
  #synchronous check of operation status
  result = operation.result()
  print("\nItems imported: {} ", result)

Labeling training items

To be useful for training a model, each item in a dataset must have at least one category label assigned to it. AutoML Vision ignores items without a category label. You can provide labels for your training items in two ways:

  • Include labels in your .csv file
  • Label your items in the AutoML Vision UI
  • Request labeling from a human labeling service.

The AutoML API does not include methods for labeling.

For details about labeling items in your .csv file, see Preparing your training data.

To label items in the AutoML Vision UI, select the dataset from the dataset listing page to see its details. The display name of the selected dataset appears in the title bar, and the page lists the individual items in the dataset along with their labels. The navigation bar along the left summarizes the number of labeled and unlabeled items and enables you to filter the item list by label.

Images page

To assign labels to unlabeled items or change item labels, select the items you want to update and the label(s) you want to assign to them.

Request labeling through Human Labeling

You can leverage Google's Human Labeling service to label your images. Currently, there is no charge for the service, which is limited to 5000 images per task. Turnaround time is on the order of days and depends on the number of images and the complexity of the labeling.

The requirements for human labeling are:

  • At least 100 unlabeled images in your dataset.
  • Between 2 and 20 labels defined.
  • Descriptions for each label.
  • At least 3 example images per label.

To request human labeling:

From the Label tab of the dataset import flow:

  1. Define labels. Enter the label names for this dataset. You may also need to add a None_of_the_above label for images that do not match any of the other labels. Providing images that don't match any of your labels and labeling them as None_of_the_above can improve the quality of your model.

  2. Once created, click each label name from the Define labels list and provide a label description and one or more example images. Images can be pulled from your dataset, or uploaded from local disk.

  3. Select the Use human labeling service option. Enter a task name and click Start. The system will begin processing your request.

Once human labeling is complete, a Todo tab is added to your AutoML Vision interface. From this tab, you can approve the labels or take any other required action.

Listing datasets

A project can include numerous datasets. This section describes how to retrieve a list of the available datasets for a project.

Web UI

To see a list of the available datasets using the AutoML Vision UI, click the Datasets link at the top of the left navigation menu.

Dataset list page

To see the datasets for a different project, select the project from the drop-down list in the upper right of the title bar.

Command-line

curl \
  -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
  -H "Content-Type: application/json" \
  https://automl.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/us-central1/datasets

You should see output similar to the following:

{
  "datasets": [
    {
      "name": "projects/434039606874/locations/us-central1/datasets/356587829854924648",
      "displayName": "test_dataset",
      "createTime": "2018-04-26T18:02:59.825060Z",
      "imageClassificationDatasetSpec": {
        "classificationType": "MULTICLASS"
      }
    },
    {
      "name": "projects/434039606874/locations/us-central1/datasets/3104518874390609379",
      "displayName": "test",
      "createTime": "2017-12-16T01:10:38.328280Z",
      "imageClassificationDatasetSpec": {
        "classificationType": "MULTICLASS"
      }
    }
  ]
}

Python

Before you can run this code example, you must install the Python Client Libraries.
def automl_list_datasets():
  """ list all datasets"""
  filter_ = ''
  response = client.list_datasets(parent, filter_)
  print("\nList of datasets:")
  for element in response:
    print(element)

You should see output similar to the following:

List of datasets:
name: "projects/434039606874/locations/us-central1/datasets/5518493835572039226"
display_name: "sample_dataset"
image_classification_dataset_spec {
  classification_type: MULTILABEL
}
create_time {
  seconds: 1525398440
  nanos: 185980000
}

name: "projects/434039606874/locations/us-central1/datasets/1089191110950722747"
display_name: "New_testing"
image_classification_dataset_spec {
  classification_type: MULTICLASS
}
examples_count: 2
create_time {
  seconds: 1525380871
  nanos: 174890000
}

Deleting a dataset

Web UI

  1. In the AutoML Vision UI, click the Datasets link at the top of the left navigation menu to display the list of available datasets.

    Dataset list page

  2. Click the three-dot menu at the far right of the row you want to delete and select Delete dataset.

  3. Click Delete in the confirmation dialog box.

Command-line

  • Replace dataset-name with the full name of your dataset, from the response when you created the dataset. The full name has the format: projects/{project-id}/locations/us-central1/datasets/{dataset-id}
curl -X DELETE \
  -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
  -H "Content-Type: application/json" https://automl.googleapis.com/v1beta1/dataset-name

You should see output similar to the following:

{
  "name": "projects/434039606874/locations/us-central1/operations/3512013641657611176",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.automl.v1beta1.OperationMetadata",
    "createTime": "2018-05-04T01:45:16.735340Z",
    "updateTime": "2018-05-04T01:45:16.735360Z",
    "cancellable": true
  }
}

Python

Before you can run this code example, you must install the Python Client Libraries.

  • dataset_id is the full name of the dataset, with the format: projects/{project-id}/locations/us-central1/datasets/{dataset-id}
def automl_delete_dataset(dataset_id):
  """ delete a dataset """
  #dataset_id = dataset_full_id.split('/')[-1]

name = client.dataset_path(project_id, 'us-central1', dataset_id) operation = client.delete_dataset(name) #synchronous check of operation status result = operation.result() print( '\nDataset deletion started')

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud AutoML Vision