Creating datasets

A dataset is the collection of data items you want the human labelers to label. It contains representative samples that you want to classify or analyze. Well labeled dataset can be used to train a custom model.

The main steps for building a dataset are:

  1. Upload the data items to a Cloud Storage bucket.
  2. Create a comma-separated values (CSV) file that catalogs the data items, and upload it to the same Cloud Storage bucket.
  3. Create a dataset resource.
  4. Import the data items into the dataset resource.

A project can have multiple datasets, each used for a different AI Platform Data Labeling Service request. You can get a list of the available datasets and delete datasets you no longer need. More information can be found in the datasets resource page.

Stage the unlabeled data

The first step in creating a dataset is to upload the data items into a Cloud Storage bucket for labeling. For information on creating a bucket, see Before you begin.

Data Labeling Service supports labeling of three types of data. You can expand the sections below to see the details about providing quality data items for each type. Currently, only datasets in English are supported for labeling.

Images

Images must use a supported file type:

  • JPEG
  • PNG

The maximum file size is 30MB for all image labeling cases except for image segmentation. The maximum file size is 10MB for image segmentation labeling.

The maximum dimensions of an image are 1920X1080.

The training data should be as close as possible to the data on which predictions are to be made. For example, if your use case involves blurry and low-resolution images (such as from a security camera), your training data should be composed of blurry, low-resolution images. In general, you should also consider providing multiple angles, resolutions, and backgrounds for your training images.

Training a model works best when there are at most 100 times more images for the most common label than for the least common label. We recommend removing very low frequency labels.

Video

Videos must be in MP4 format, with H.264, H.265, and MPEG4 codec. The maximum video size is 2GB.

The training data should be as close as possible to the data on which predictions are to be made. For example, if your use case involves blurry and low-resolution videos (such as from a security camera), your training data should be composed of blurry, low-resolution videos. In general, you should also consider providing multiple angles, resolutions, and backgrounds for your training videos.

We recommend about 1000 training videos per label. The minimum per label is 10, or 50 for advanced models. In general it takes more examples per label to train models with multiple labels per video, and resulting scores are harder to interpret.

The model works best when there are at most 100 times more videos for the most common label than for the least common label. We recommend removing very low frequency labels.

Text

Text files must use the UTF-8 text file encoding format.

Each document must be a separate text file. You cannot provide multiple documents in one text file; for example, you cannot treat each row of a text file as its own document.

The maximum number of characters per text file is 100,000.

Try to make your training data as varied as the data on which predictions will be made. Datasets need to contain different lengths of documents, documents authored by different people, documents that use different wording or style, and so on.

We recommend providing at least 1000 training documents per label. The minimum number of documents per label is 10. However, you can improve the confidence scores from your model by using more examples per label. Better confidence scores are especially helpful when your model returns multiple labels to classify a document.

The model works best when there are at most 100 times more documents for the most common label than for the least common label. We recommend removing very low frequency labels.

Create the input CSV file

In addition to the sample data items, you also need to create a comma-separated values (CSV) file that catalogs all of the data. The CSV file can have any filename, must be UTF-8 encoded, and must end with a .csv extension.

For image and video data, each row in the CSV file is the location (in your project's Google Cloud Storage bucket) of one image or video. For example:

gs://my_project_bucket/image1.png
gs://my_project_bucket/image2.png
gs://my_project_bucket/image3.png
gs://my_project_bucket/image4.png

For text data, each row in the CSV file is the storage location of a text file. For example:

gs://my_project_bucket/file1.txt
gs://my_project_bucket/file2.txt
gs://my_project_bucket/file3.txt
gs://my_project_bucket/file4.txt

Each data file should contain the data that you want to label. The content of each data file will be shown to labelers as one labeling question.

After you create the CSV file that catalogs the data items, upload it to the same Cloud Storage bucket as the data items.

Create the dataset resource

The next step is to create a dataset resource that will eventually hold the data items. The newly created dataset is empty until you import data items into it at the next step.

Web UI

In the Data Labeling Service UI, you create a dataset and import items into it from the same page.

  1. Open the Data Labeling Service UI.

    The Datasets page shows the status of previously created datasets for the current project.

    To add a dataset for a different project, select the project from the drop-down list in the upper right of the title bar.

  2. Click the Create button in the title bar.

  3. On the Add a dataset page, enter a name and description for the dataset.

  4. From the Dataset type drop-down list, choose the type of data items you're uploading into this dataset: images, video, or text.

  5. In the CSV file location box, enter the full path to the input CSV file.

    The CSV file must be in the same Google Cloud Storage bucket as the data items it lists.

  6. Click Create.

    You are returned to the Datasets page; your dataset will show an in progress status while your documents are being imported. This process should take approximately 10 minutes per 1000 items, but may take more or less time.

    If the service returns a 405 error, reduce the number of documents you are uploading at once. You need to refresh the page before trying again.

Command-line

The following example creates a dataset named test_dataset. The newly created dataset does not contain any data until you import items into it.

Save the "name" of the new dataset (from the response) for use with other operations, such as importing items into your dataset.

curl -X POST \
   -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
   -H "Content-Type: application/json" \
   https://datalabeling.googleapis.com/v1beta1/projects/${PROJECT_ID}/datasets \
   -d '{
     "dataset": {
     "displayName": "test_dataset",
     "description": "dataset for curl commands testing",
     },
   }'

You should see output similar to the following:

{
  "name": "projects/data-labeling-codelab/datasets/5c897e1e_0000_2ab5_9159_94eb2c0b4daa",
  "displayName": "test_dataset",
  "description": "dataset for curl commands testing",
  "createTime": "2019-03-14T03:11:50.926475415Z"
}

Python

Before you can run this code example, you must install the Python Client Libraries.

def create_dataset(project_id):
    """Creates a dataset for the given Google Cloud project."""
    from google.cloud import datalabeling_v1beta1 as datalabeling

    client = datalabeling.DataLabelingServiceClient()

    formatted_project_name = f"projects/{project_id}"

    dataset = datalabeling.Dataset(
        display_name="YOUR_DATASET_SET_DISPLAY_NAME", description="YOUR_DESCRIPTION"
    )

    response = client.create_dataset(
        request={"parent": formatted_project_name, "dataset": dataset}
    )

    # The format of resource name:
    # project_id/{project_id}/datasets/{dataset_id}
    print("The dataset resource name: {}".format(response.name))
    print("Display name: {}".format(response.display_name))
    print("Description: {}".format(response.description))
    print("Create time:")
    print("\tseconds: {}".format(response.create_time.timestamp_pb().seconds))
    print("\tnanos: {}\n".format(response.create_time.timestamp_pb().nanos))

    return response

Java

Before you can run this code example, you must install the Java Client Libraries.
import com.google.cloud.datalabeling.v1beta1.CreateDatasetRequest;
import com.google.cloud.datalabeling.v1beta1.DataLabelingServiceClient;
import com.google.cloud.datalabeling.v1beta1.DataLabelingServiceSettings;
import com.google.cloud.datalabeling.v1beta1.Dataset;
import com.google.cloud.datalabeling.v1beta1.ProjectName;
import java.io.IOException;

class CreateDataset {

  // Create a dataset that is initially empty.
  static void createDataset(String projectId, String datasetName) throws IOException {
    // String projectId = "YOUR_PROJECT_ID";
    // String datasetName = "YOUR_DATASET_DISPLAY_NAME";


    DataLabelingServiceSettings settings =
        DataLabelingServiceSettings.newBuilder()
            .build();
    try (DataLabelingServiceClient dataLabelingServiceClient =
        DataLabelingServiceClient.create(settings)) {
      ProjectName projectName = ProjectName.of(projectId);

      Dataset dataset =
          Dataset.newBuilder()
              .setDisplayName(datasetName)
              .setDescription("YOUR_DESCRIPTION")
              .build();

      CreateDatasetRequest createDatasetRequest =
          CreateDatasetRequest.newBuilder()
              .setParent(projectName.toString())
              .setDataset(dataset)
              .build();

      Dataset createdDataset = dataLabelingServiceClient.createDataset(createDatasetRequest);

      System.out.format("Name: %s\n", createdDataset.getName());
      System.out.format("DisplayName: %s\n", createdDataset.getDisplayName());
      System.out.format("Description: %s\n", createdDataset.getDescription());
    } catch (IOException e) {
      e.printStackTrace();
    }
  }
}

Import the data items into the dataset

After you have created a dataset, you can import the data items into it using the CSV file.

Web UI

In the Data Labeling Service UI, you can skip this step since the import has been done in previous step already.

Command-line

  • Replace DATASET_ID with the ID of your dataset, from the response when you created the dataset. The ID appears at the end of the full dataset name: projects/{project-id}/locations/us-central1/datasets/{dataset-id}

  • Replace CSV_FILE with the full path to the input CSV file.

    curl -X POST \
       -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
       -H "Content-Type: application/json" \
       https://datalabeling.googleapis.com/v1beta1/projects/${PROJECT_ID}/datasets/${DATASET_ID}:importData \
       -d '{
         "inputConfig": {
           "dataType": "IMAGE",
           "gcsSource": {
              "inputUri": "${CSV_FILE}",
              "mimeType": "text/csv",
            }
           },
       }'
    

    You should see output similar to the following. You can use the operation ID to get the status of the task. Getting the status of an operation is an example.

    {
      "name": "projects/data-labeling-codelab/operations/5c73dd6b_0000_2b34_a920_883d24fa2064",
      "metadata": {
        "@type": "type.googleapis.com/google.cloud.data-labeling.v1beta1.ImportDataOperationMetadata",
        "dataset": "projects/data-labeling-codelab/datasets/5c73db3d_0000_23e0_a25b_94eb2c119c4c"
      }
    }
    

Python

Before you can run this code example, you must install the Python Client Libraries.

def import_data(dataset_resource_name, data_type, input_gcs_uri):
    """Imports data to the given Google Cloud project and dataset."""
    from google.cloud import datalabeling_v1beta1 as datalabeling

    client = datalabeling.DataLabelingServiceClient()

    gcs_source = datalabeling.GcsSource(input_uri=input_gcs_uri, mime_type="text/csv")

    csv_input_config = datalabeling.InputConfig(
        data_type=data_type, gcs_source=gcs_source
    )

    response = client.import_data(
        request={"name": dataset_resource_name, "input_config": csv_input_config}
    )

    result = response.result()

    # The format of resource name:
    # project_id/{project_id}/datasets/{dataset_id}
    print("Dataset resource name: {}\n".format(result.dataset))

    return result

Java

Before you can run this code example, you must install the Java Client Libraries.
import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.datalabeling.v1beta1.DataLabelingServiceClient;
import com.google.cloud.datalabeling.v1beta1.DataLabelingServiceSettings;
import com.google.cloud.datalabeling.v1beta1.DataType;
import com.google.cloud.datalabeling.v1beta1.GcsSource;
import com.google.cloud.datalabeling.v1beta1.ImportDataOperationMetadata;
import com.google.cloud.datalabeling.v1beta1.ImportDataOperationResponse;
import com.google.cloud.datalabeling.v1beta1.ImportDataRequest;
import com.google.cloud.datalabeling.v1beta1.InputConfig;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

class ImportData {

  // Import data to an existing dataset.
  static void importData(String datasetName, String gcsSourceUri) throws IOException {
    // String datasetName = DataLabelingServiceClient.formatDatasetName(
    //     "YOUR_PROJECT_ID", "YOUR_DATASETS_UUID");
    // String gcsSourceUri = "gs://YOUR_BUCKET_ID/path_to_data";


    DataLabelingServiceSettings settings =
        DataLabelingServiceSettings.newBuilder()
            .build();
    try (DataLabelingServiceClient dataLabelingServiceClient =
        DataLabelingServiceClient.create(settings)) {
      GcsSource gcsSource =
          GcsSource.newBuilder().setInputUri(gcsSourceUri).setMimeType("text/csv").build();

      InputConfig inputConfig =
          InputConfig.newBuilder()
              .setDataType(DataType.IMAGE) // DataTypes: AUDIO, IMAGE, VIDEO, TEXT
              .setGcsSource(gcsSource)
              .build();

      ImportDataRequest importDataRequest =
          ImportDataRequest.newBuilder().setName(datasetName).setInputConfig(inputConfig).build();

      OperationFuture<ImportDataOperationResponse, ImportDataOperationMetadata> operation =
          dataLabelingServiceClient.importDataAsync(importDataRequest);

      ImportDataOperationResponse response = operation.get();

      System.out.format("Imported items: %d\n", response.getImportCount());
    } catch (IOException | InterruptedException | ExecutionException e) {
      e.printStackTrace();
    }
  }
}

View the data items in the dataset

Follow these steps to view the data items in an imported dataset:

  1. Open the Data Labeling Service UI.

    The Datasets page shows Data Labeling Service datasets for the current project.

  2. In the list of datasets, click the name of the dataset whose items you want to view.

  3. Use the Details tab of the Dataset detail page to view the individual data items included in the dataset.