Requesting data labeling

The quality of your training data strongly impacts the effectiveness of the model you create, and by extension, the quality of the predictions returned from that model. The key to high quality training data is ensuring that you have training items that accurately represent the domain you want to make predictions about and that the training items are accurately labeled.

There are three ways to assign labels to your training data items:

  • Add the data items to your dataset with their labels already assigned, for example using a commercially available dataset
  • Assign labels to the data items using the console
  • Request to have human labelers add labels to the data items

Vertex AI data labeling tasks let you work with human labelers to generate highly accurate labels for a collection of data that you can use to train your machine learning models.

To request data labeling by human labelers, you create a data labeling job that provides the human labelers with:

  • The dataset containing the representative data items to label
  • A list of all possible labels to apply to the data items
  • A PDF file containing instructions guiding the human labelers through labeling tasks

Using these resources, the human labelers annotate the items in the dataset according to your instructions. When they are done, you can use the annotation set to train a Vertex AI model or export the labeled data items to use in another machine learning environment.

Create a dataset

You provide the human labelers with the data items to label by creating a dataset and importing data items into it. The data items need not be labeled. The data type (image, video, or text) and objective (for example, classification or object tracking) determines the type of annotations the human labelers apply to the data items.

Provide labels

When you create a data labeling task, you list the set of labels you want the human labelers to use to label your images. For example, if you want to classify images based on whether they contain a dog or a cat, you create a label set with two labels: "Dog" and "Cat". (Actually, as noted below, you might also want labels for "Neither" and "Both".)

Here are some guidelines for creating a high-quality label set.

  • Make each label's display name a meaningful word, such as "dog", "cat", or "building". Do not use abstract names like "label1" and "label2" or unfamiliar acronyms. The more meaningful the label names, the easier it is for human labelers to apply them accurately and consistently.
  • Make sure the labels are easily distinguishable from one another. For classification tasks where a single label is applied to each data item, try not to use labels whose meanings overlap. For example, don't have labels for "Sports" and "Baseball".
  • For classification tasks, it is usually a good idea to include a label named "other" or "none", to use for data that don't match the other labels. If the only available labels are "dog" and "cat", for example, labelers will have to label every image with one of those labels. Your custom model is typically more robust if you include images other than dogs or cats in its training data.
  • Keep in mind that labelers are most efficient and accurate when you have at most 20 labels defined in the label set. You can include up to 100 labels.

Create instructions

Instructions give the human labelers information about how to apply labels to your data. The instructions should contain sample labeled data and other explicit directions.

Instructions are PDF files. PDF instructions can provide sophisticated directions such as positive and negative examples or descriptions for each case. PDF is also a convenient format for providing instructions for complicated tasks such as image bounding boxes or video object tracking.

Write the instructions, create a PDF file, and save the PDF file in your Google Cloud Storage bucket.

Designing good instructions

Good instructions are the most important factor in getting good human labeling results. Since you know your use case best, you need to let the human labelers know what you want them to do. Here are some guidelines for creating good instructions:

  • The human labelers do not have your domain knowledge. The distinctions you ask labelers to make should be easy to understand for someone unfamiliar with your use case.

  • Avoid making the instructions too long. It is best if an labeler can review and understand them within 20 minutes.

  • Instructions should describe the concept of the task as well as details about how to label the data. For example, for a bounding box task, describe how you want labelers to draw the bounding box. Should it be a tight box or a loose box? If there are multiple instances of the object, should they draw one big bounding box or multiple smaller boxes?

  • If your instructions have a corresponding label set, they should cover all labels in that set. The label name in the instructions should match the name in the label set.

  • It often takes several iterations to create good instructions. We recommend having the human labelers work on a small dataset first, then adjust your instructions based on how well the labelers' work matches your expectations.

A good instructions file should include the following sections:

  • Label list and description: list all the labels you would like to use and describe the meaning of each label.
  • Examples: For each label, give at least three positive examples and one negative example. These examples should cover different cases.
  • Cover edge cases. Clarify as many edge cases as you can, This reduces the need for the labeler to interpret the label. For example, if you need to draw a bounding box for a person, it is better to clarify:
    • Do you need a box for each person if there are multiple people?
    • Do you need a box if a person is occluded?
    • Do you need a box for a person who is partially shown in the image?
    • Do you need a box for a person in a picture or painting?
  • Describe how to add annotations. For example:
    • For a bounding box, do you need a tight box or loose box?
    • For text entity extraction, where should the interested entity start and end?
  • Clarification on labels. If two labels are similar or easy to mix up, give examples to clarify the difference.

The examples below show what the PDF instructions may include. Labelers will review the instructions before they start the task.

PDF instructions 1

PDF instructions 2

Create a data labeling task

Web UI

You can request data labeling from two places in the Google Cloud Console:

  • From the dataset detail screen, click Create Labeling Task.
  • From the Labeling tasks list screen, click Create.

The New labeling task pane opens.

  1. Enter a name for the labeling task.

  2. Select the dataset whose items you want to have labeled.

    If you opened the New labeling task pane from the dataset detail screen, you cannot select a different dataset.

  3. Confirm that the objective is correct.

    The Objective box shows the objective for the selected dataset, as determined by its default annotation set. To change the objective, choose a different annotation set.

  4. Choose the annotation set to use for the labeled data.

    The labels applied by the human labelers will be saved to the selected annotation set. You can choose an existing annotation set or create a new one. If you create a new one, you need to provide a name for it.

  5. Specify whether to use active learning.

    Active learning expedites the labeling process by having a human labeler label part of your data set, then applying machine learning to automatically label the rest.

    NOTE: Active learning is available only for Image classification (Single-label) and Image bounding box objectives.

  6. Click Continue.

  7. Enter the labels for the human labelers to apply, and click Continue when done.

    See Designing a label set for guidelines about creating a high-quality set of labels.

  8. Enter the path to the instructions for the human labelers, and click Continue.

    The instructions must be a PDF file stored in a Google Cloud Storage bucket. See Designing instructions for human labelers for guidelines about creating high-quality instructions.

  9. Choose whether to use Google-managed labelers or Provide your own labelers.

    To provide your own labelers, you need to create labeler groups and manage their activities using the DataCompute Console.

  10. Specify how many labelers you want to review each item.

    By default, one human labeler annotates each data item. However, you can request to have multiple labelers annotate and review each item. Select the number of labelers from the Specialists per data item box.

  11. If you chose to use Google-managed labelers, click the check box to confirm that you have read the pricing guide to understand the cost of labeling.

  12. If you are providing your own labelers, choose the labeler group to use for this labeling task.

    Choose an existing labeler group from the drop-down list, or choose New labeler group and enter a group name and comma-separated list of email addresses for the group's managers in the text boxes below the drop-down list. Click the check box to grant the specified managers to see your data labeling information.

  13. Click Start Task.

    If Start Task is unavailable, review the pages within the New labeling task pane to verify that you've entered all the required information.

You can review the progress of the data labeling task by selecting Labeling tasks in the console. The page shows the status of each requested labeling task. When the Progress column shows 100%, the corresponding dataset is labeled and ready for training a model.

REST & CMD LINE

Before using any of the request data below, make the following replacements:

  • PROJECT_ID: Your project ID
  • DISPLAY_NAME: Name for the data labeling job
  • DATASET_ID: ID of the dataset containing the items to label
  • LABELERS: The number of human labelers you want to have review each data item; valid values are 1, 3, and 5.
  • INSTRUCTIONS: The path to the PDF file containing instructions for the human labelers; the file must be in a Google Cloud Storage bucket accessible from your project
  • INPUT_SCHEMA_URI: Path to the schema file for the data item type:
    • Image classification single label:
      gs://google-cloud-aiplatform/schema/dataset/ioformat/image_classification_single_label_io_format_1.0.0.yaml
    • Image classification multi-label:
      gs://google-cloud-aiplatform/schema/dataset/ioformat/image_classification_multi_label_io_format_1.0.0.yaml
    • Image object detection:
      gs://google-cloud-aiplatform/schema/dataset/ioformat/image_bounding_box_io_format_1.0.0.yaml
    • Text classification single-label:
      gs://google-cloud-aiplatform/schema/dataset/ioformat/text_classification_single_label_io_format_1.0.0.yaml
    • Text classification multi-label:
      gs://google-cloud-aiplatform/schema/dataset/ioformat/text_classification_multi_label_io_format_1.0.0.yaml
    • Text entity extraction:
      gs://google-cloud-aiplatform/schema/dataset/ioformat/text_extraction_io_format_1.0.0.yaml
    • Text sentiment analysis:
      gs://google-cloud-aiplatform/schema/dataset/ioformat/text_sentiment_io_format_1.0.0.yaml
    • Video classification:
      gs://google-cloud-aiplatform/schema/dataset/ioformat/video_classification_io_format_1.0.0.yaml
    • Video object tracking:
      gs://google-cloud-aiplatform/schema/dataset/ioformat/video_object_tracking_io_format_1.0.0.yaml
  • LABEL_LIST: A comma-separated list of strings, enumerating the labels available to apply to a data item
  • ANNOTATION_SET: The name of the annotation set for the labeled data

HTTP method and URL:

POST https://us-central1-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/us-central1/dataLabelingJobs

Request JSON body:

{
   "displayName":"DISPLAY_NAME",
   "datasets":"DATASET_ID",
   "labelerCount":LABELERS,
   "instructionUri":"INSTRUCTIONS",
   "inputsSchemaUri":"INPUT_SCHEMA_URI",
   "inputs": {
     "annotation_specs": [LABEL_LIST]
   },
   "annotationLabels": {
     "aiplatform.googleapis.com/annotation_set_name": "ANNOTATION_SET"
   }
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_ID/locations/us-central1/dataLabelingJobs/JOB_ID",
  "displayName": "DISPLAY_NAME",
  "datasets": [
    "DATASET_ID"
  ],
  "labelerCount": LABELERS,
  "instructionUri": "INSTRUCTIONS",
  "inputsSchemaUri": "INPUT_SCHEMA_URI",
  "inputs": {
    "annotationSpecs": [
      LABEL_LIST
    ]
  },
  "state": "JOB_STATE_PENDING",
  "labelingProgress": "0",
  "createTime": "2020-05-30T23:13:49.121133Z",
  "updateTime": "2020-05-30T23:13:49.121133Z",
  "savedQuery": {
    "name": "projects/PROJECT_ID/locations/us-central1/datasets/DATASET_ID/savedQueries/ANNOTATION_SET_ID"
  },
  "annotationSpecCount": 2
}
The response is a DataLabelingJob. You can check the progress of the job by monitoring the "labelingProgress" element, whose value is the percentage completed.

Java

Additional code samples:


import com.google.cloud.aiplatform.v1.DataLabelingJob;
import com.google.cloud.aiplatform.v1.DatasetName;
import com.google.cloud.aiplatform.v1.JobServiceClient;
import com.google.cloud.aiplatform.v1.JobServiceSettings;
import com.google.cloud.aiplatform.v1.LocationName;
import com.google.protobuf.Value;
import com.google.protobuf.util.JsonFormat;
import com.google.type.Money;
import java.io.IOException;
import java.util.Map;

public class CreateDataLabelingJobSample {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    String project = "YOUR_PROJECT_ID";
    String displayName = "YOUR_DATA_LABELING_DISPLAY_NAME";
    String datasetId = "YOUR_DATASET_ID";
    String instructionUri =
        "gs://YOUR_GCS_SOURCE_BUCKET/path_to_your_data_labeling_source/file.pdf";
    String inputsSchemaUri = "YOUR_INPUT_SCHEMA_URI";
    String annotationSpec = "YOUR_ANNOTATION_SPEC";
    createDataLabelingJob(
        project, displayName, datasetId, instructionUri, inputsSchemaUri, annotationSpec);
  }

  static void createDataLabelingJob(
      String project,
      String displayName,
      String datasetId,
      String instructionUri,
      String inputsSchemaUri,
      String annotationSpec)
      throws IOException {
    JobServiceSettings jobServiceSettings =
        JobServiceSettings.newBuilder()
            .setEndpoint("us-central1-aiplatform.googleapis.com:443")
            .build();

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (JobServiceClient jobServiceClient = JobServiceClient.create(jobServiceSettings)) {
      String location = "us-central1";
      LocationName locationName = LocationName.of(project, location);

      String jsonString = "{\"annotation_specs\": [ " + annotationSpec + "]}";
      Value.Builder annotationSpecValue = Value.newBuilder();
      JsonFormat.parser().merge(jsonString, annotationSpecValue);

      DatasetName datasetName = DatasetName.of(project, location, datasetId);
      DataLabelingJob dataLabelingJob =
          DataLabelingJob.newBuilder()
              .setDisplayName(displayName)
              .setLabelerCount(1)
              .setInstructionUri(instructionUri)
              .setInputsSchemaUri(inputsSchemaUri)
              .addDatasets(datasetName.toString())
              .setInputs(annotationSpecValue)
              .putAnnotationLabels(
                  "aiplatform.googleapis.com/annotation_set_name", "my_test_saved_query")
              .build();

      DataLabelingJob dataLabelingJobResponse =
          jobServiceClient.createDataLabelingJob(locationName, dataLabelingJob);

      System.out.println("Create Data Labeling Job Response");
      System.out.format("\tName: %s\n", dataLabelingJobResponse.getName());
      System.out.format("\tDisplay Name: %s\n", dataLabelingJobResponse.getDisplayName());
      System.out.format("\tDatasets: %s\n", dataLabelingJobResponse.getDatasetsList());
      System.out.format("\tLabeler Count: %s\n", dataLabelingJobResponse.getLabelerCount());
      System.out.format("\tInstruction Uri: %s\n", dataLabelingJobResponse.getInstructionUri());
      System.out.format("\tInputs Schema Uri: %s\n", dataLabelingJobResponse.getInputsSchemaUri());
      System.out.format("\tInputs: %s\n", dataLabelingJobResponse.getInputs());
      System.out.format("\tState: %s\n", dataLabelingJobResponse.getState());
      System.out.format("\tLabeling Progress: %s\n", dataLabelingJobResponse.getLabelingProgress());
      System.out.format("\tCreate Time: %s\n", dataLabelingJobResponse.getCreateTime());
      System.out.format("\tUpdate Time: %s\n", dataLabelingJobResponse.getUpdateTime());
      System.out.format("\tLabels: %s\n", dataLabelingJobResponse.getLabelsMap());
      System.out.format(
          "\tSpecialist Pools: %s\n", dataLabelingJobResponse.getSpecialistPoolsList());
      for (Map.Entry<String, String> annotationLabelMap :
          dataLabelingJobResponse.getAnnotationLabelsMap().entrySet()) {
        System.out.println("\tAnnotation Level");
        System.out.format("\t\tkey: %s\n", annotationLabelMap.getKey());
        System.out.format("\t\tvalue: %s\n", annotationLabelMap.getValue());
      }
      Money money = dataLabelingJobResponse.getCurrentSpend();

      System.out.println("\tCurrent Spend");
      System.out.format("\t\tCurrency Code: %s\n", money.getCurrencyCode());
      System.out.format("\t\tUnits: %s\n", money.getUnits());
      System.out.format("\t\tNanos: %s\n", money.getNanos());
    }
  }
}

Python

Additional code samples:

from google.cloud import aiplatform
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value


def create_data_labeling_job_sample(
    project: str,
    display_name: str,
    dataset_name: str,
    instruction_uri: str,
    inputs_schema_uri: str,
    annotation_spec: str,
    location: str = "us-central1",
    api_endpoint: str = "us-central1-aiplatform.googleapis.com",
):
    # The AI Platform services require regional API endpoints.
    client_options = {"api_endpoint": api_endpoint}
    # Initialize client that will be used to create and send requests.
    # This client only needs to be created once, and can be reused for multiple requests.
    client = aiplatform.gapic.JobServiceClient(client_options=client_options)
    inputs_dict = {"annotation_specs": [annotation_spec]}
    inputs = json_format.ParseDict(inputs_dict, Value())

    data_labeling_job = {
        "display_name": display_name,
        # Full resource name: projects/{project_id}/locations/{location}/datasets/{dataset_id}
        "datasets": [dataset_name],
        # labeler_count must be 1, 3, or 5
        "labeler_count": 1,
        "instruction_uri": instruction_uri,
        "inputs_schema_uri": inputs_schema_uri,
        "inputs": inputs,
        "annotation_labels": {
            "aiplatform.googleapis.com/annotation_set_name": "my_test_saved_query"
        },
    }
    parent = f"projects/{project}/locations/{location}"
    response = client.create_data_labeling_job(
        parent=parent, data_labeling_job=data_labeling_job
    )
    print("response:", response)

NOTE: The maximum turnaround time for a data labeling job is 63 days. If it is not complete within that time, the job expires and is deleted along with the tasks assigned to labelers.

What's next