Creating datasets

A dataset is the collection of data items you want the human labelers to label. It contains representative samples that you want to classify or analyze. Well labeled dataset can be used to train a custom model.

The main steps for building a dataset are:

  1. Upload the data items to your Google Cloud Platform bucket.
  2. Create a comma-separated values (CSV) file that catalogs the data items, and upload it to the same Google Cloud Platform bucket.
  3. Create a dataset resource.
  4. Import the data items into the dataset resource.

A project can have multiple datasets, each used for a different AI Platform Data Labeling Service request. You can get a list of the available datasets and delete datasets you no longer need. More information can be found in the datasets resource page.

Stage the unlabeled data

The first step in creating a dataset is to upload the data items into the Google Cloud Platform bucket for labeling. The way to create a bucket can be found in the Before you begin).

The Data Labeling Service supports labeling of three types of data. You can expand the sections below to see the details about providing quality data items for each type. Currently, only datasets in English are supported for labeling.

Create the input CSV file

In addition to the sample data items, you also need to create a comma-separated values (CSV) file that catalogs all of the data. The CSV file can have any filename, must be UTF-8 encoded, and must end with a .csv extension.

For image and video data, each row in the CSV file is the location (in your project's Google Cloud Storage bucket) of one image or video. For example:

gs://my_project_bucket/image1.png
gs://my_project_bucket/image2.png
gs://my_project_bucket/image3.png
gs://my_project_bucket/image4.png

For text data, each row in the CSV file is the storage location of a text file. For example:

gs://my_project_bucket/file1.txt
gs://my_project_bucket/file2.txt
gs://my_project_bucket/file3.txt
gs://my_project_bucket/file4.txt

Each data file should contain the data that you want to label. The content of each data file will be shown to labelers as one labeling question.

After you create the CSV file that catalogs the data items, upload it to the same Cloud Storage bucket as the data items.

Create the dataset resource

The next step is to create a dataset resource that will eventually hold the data items. The newly created dataset is empty until you import data items into it at the next step.

Web UI

On the Data Labeling Service UI, you create a dataset and import items into it from the same page.

  1. Open the Data Labeling Service UI.

    The Datasets page shows the status of previously created datasets for the current project.

    To add a dataset for a different project, select the project from the drop-down list in the upper right of the title bar.

  2. Click the Create button in the title bar.

  3. On the Add a dataset page, enter a name and description for the dataset.

  4. From the Dataset type drop-down list, choose the type of data items you're uploading into this dataset: images, video, or text.

  5. In the CSV file location box, enter the full path to the input CSV file.

    The CSV file must be in the same Google Cloud Storage bucket as the data items it lists.

  6. Click Create.

    You are returned to the Datasets page; your dataset will show an in progress status while your documents are being imported. This process should take approximately 10 minutes per 1000 items, but may take more or less time.

    If the service returns a 405 error, reduce the number of documents you are uploading at once. You need to refresh the page before trying again.

Command-line

The following example creates a dataset named test_dataset. The newly created dataset does not contain any data until you import items into it.

Save the "name" of the new dataset (from the response) for use with other operations, such as importing items into your dataset.

curl -X POST \
   -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
   -H "Content-Type: application/json" \
   https://datalabeling.googleapis.com/v1beta1/projects/${PROJECT_ID}/datasets \
   -d '{
     "dataset": {
     "displayName": "test_dataset",
     "description": "dataset for curl commands testing",
     },
   }'

You should see output similar to the following:

{
  "name": "projects/data-labeling-codelab/datasets/5c897e1e_0000_2ab5_9159_94eb2c0b4daa",
  "displayName": "test_dataset",
  "description": "dataset for curl commands testing",
  "createTime": "2019-03-14T03:11:50.926475415Z"
}

Python

Before you can run this code example, you must install the Python Client Libraries.

def create_dataset(project_id):
    """Creates a dataset for the given Google Cloud project."""
    from google.cloud import datalabeling_v1beta1 as datalabeling
    client = datalabeling.DataLabelingServiceClient()

    formatted_project_name = client.project_path(project_id)

    dataset = datalabeling.types.Dataset(
        display_name='YOUR_DATASET_SET_DISPLAY_NAME',
        description='YOUR_DESCRIPTION'
    )

    response = client.create_dataset(formatted_project_name, dataset)

    # The format of resource name:
    # project_id/{project_id}/datasets/{dataset_id}
    print('The dataset resource name: {}'.format(response.name))
    print('Display name: {}'.format(response.display_name))
    print('Description: {}'.format(response.description))
    print('Create time:')
    print('\tseconds: {}'.format(response.create_time.seconds))
    print('\tnanos: {}\n'.format(response.create_time.nanos))

    return response

Java

Before you can run this code example, you must install the Java Client Libraries.
import com.google.cloud.datalabeling.v1beta1.CreateDatasetRequest;
import com.google.cloud.datalabeling.v1beta1.DataLabelingServiceClient;
import com.google.cloud.datalabeling.v1beta1.DataLabelingServiceSettings;
import com.google.cloud.datalabeling.v1beta1.Dataset;
import com.google.cloud.datalabeling.v1beta1.ProjectName;
import java.io.IOException;

class CreateDataset {

  // Create a dataset that is initially empty.
  static void createDataset(String projectId, String datasetName) throws IOException {
    // String projectId = "YOUR_PROJECT_ID";
    // String datasetName = "YOUR_DATASET_DISPLAY_NAME";


    DataLabelingServiceSettings settings = DataLabelingServiceSettings
        .newBuilder()
        .build();
    try (DataLabelingServiceClient dataLabelingServiceClient =
             DataLabelingServiceClient.create(settings)) {
      ProjectName projectName = ProjectName.of(projectId);

      Dataset dataset =
          Dataset.newBuilder()
              .setDisplayName(datasetName)
              .setDescription("YOUR_DESCRIPTION")
              .build();

      CreateDatasetRequest createDatasetRequest =
          CreateDatasetRequest.newBuilder()
              .setParent(projectName.toString())
              .setDataset(dataset)
              .build();

      Dataset createdDataset = dataLabelingServiceClient.createDataset(createDatasetRequest);

      System.out.format("Name: %s\n", createdDataset.getName());
      System.out.format("DisplayName: %s\n", createdDataset.getDisplayName());
      System.out.format("Description: %s\n", createdDataset.getDescription());
    } catch (IOException e) {
      e.printStackTrace();
    }
  }
}

Import the data items into the dataset

After you have created a dataset, you can import the data items into it using the CSV file.

Web UI

On the Data Labeling Service UI, you can skip this step since the import has been done in previous step already.

Command-line

  • Replace DATASET_ID with the ID of your dataset, from the response when you created the dataset. The ID appears at the end of the full dataset name: projects/{project-id}/locations/us-central1/datasets/{dataset-id}

  • Replace CSV_FILE with the full path to the input CSV file.

    curl -X POST \
       -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
       -H "Content-Type: application/json" \
       https://datalabeling.googleapis.com/v1beta1/projects/${PROJECT_ID}/datasets/${DATASET_ID}:importData \
       -d '{
         "inputConfig": {
           "dataType": "IMAGE",
           "gcsSource": {
              "inputUri": "${CSV_FILE}",
              "mimeType": "text/csv",
            }
           },
       }'
    

    You should see output similar to the following. You can use the operation ID to get the status of the task. Getting the status of an operation is an example.

    {
      "name": "projects/data-labeling-codelab/operations/5c73dd6b_0000_2b34_a920_883d24fa2064",
      "metadata": {
        "@type": "type.googleapis.com/google.cloud.data-labeling.v1beta1.ImportDataOperationMetadata",
        "dataset": "projects/data-labeling-codelab/datasets/5c73db3d_0000_23e0_a25b_94eb2c119c4c"
      }
    }
    

Python

Before you can run this code example, you must install the Python Client Libraries.

def import_data(dataset_resource_name, data_type, input_gcs_uri):
    """Imports data to the given Google Cloud project and dataset."""
    from google.cloud import datalabeling_v1beta1 as datalabeling
    client = datalabeling.DataLabelingServiceClient()

    gcs_source = datalabeling.types.GcsSource(
        input_uri=input_gcs_uri, mime_type='text/csv')

    csv_input_config = datalabeling.types.InputConfig(
        data_type=data_type, gcs_source=gcs_source)

    response = client.import_data(dataset_resource_name, csv_input_config)

    result = response.result()

    # The format of resource name:
    # project_id/{project_id}/datasets/{dataset_id}
    print('Dataset resource name: {}\n'.format(result.dataset))

    return result

Java

Before you can run this code example, you must install the Java Client Libraries.
import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.datalabeling.v1beta1.DataLabelingServiceClient;
import com.google.cloud.datalabeling.v1beta1.DataLabelingServiceSettings;
import com.google.cloud.datalabeling.v1beta1.DataType;
import com.google.cloud.datalabeling.v1beta1.GcsSource;
import com.google.cloud.datalabeling.v1beta1.ImportDataOperationMetadata;
import com.google.cloud.datalabeling.v1beta1.ImportDataOperationResponse;
import com.google.cloud.datalabeling.v1beta1.ImportDataRequest;
import com.google.cloud.datalabeling.v1beta1.InputConfig;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

class ImportData {

  // Import data to an existing dataset.
  static void importData(String datasetName, String gcsSourceUri) throws IOException {
    // String datasetName = DataLabelingServiceClient.formatDatasetName(
    //     "YOUR_PROJECT_ID", "YOUR_DATASETS_UUID");
    // String gcsSourceUri = "gs://YOUR_BUCKET_ID/path_to_data";


    DataLabelingServiceSettings settings = DataLabelingServiceSettings
        .newBuilder()
        .build();
    try (DataLabelingServiceClient dataLabelingServiceClient =
             DataLabelingServiceClient.create(settings)) {
      GcsSource gcsSource = GcsSource.newBuilder()
          .setInputUri(gcsSourceUri)
          .setMimeType("text/csv")
          .build();

      InputConfig inputConfig = InputConfig.newBuilder()
          .setDataType(DataType.IMAGE) // DataTypes: AUDIO, IMAGE, VIDEO, TEXT
          .setGcsSource(gcsSource)
          .build();

      ImportDataRequest importDataRequest = ImportDataRequest.newBuilder()
          .setName(datasetName)
          .setInputConfig(inputConfig)
          .build();

      OperationFuture<ImportDataOperationResponse, ImportDataOperationMetadata> operation =
          dataLabelingServiceClient.importDataAsync(importDataRequest);

      ImportDataOperationResponse response = operation.get();

      System.out.format("Imported items: %d\n", response.getImportCount());
    } catch (IOException | InterruptedException | ExecutionException e) {
      e.printStackTrace();
    }
  }
}

View the data items in the dataset

Follow these steps to view the data items in an imported dataset:

  1. Open the Data Labeling Service UI.

    The Datasets page shows the Data Labeling Service datasets for the current project.

  2. In the list of datasets, click the name of the dataset whose items you want to view.

  3. Use the Details tab of the Dataset detail page to view the individual data items included in the dataset.