Submitting text labeling requests

AI Platform Data Labeling Service supports three types of text labeling tasks:

  • Classification tasks, where labelers assign one or more labels to each text segment. You could specify the number of labelers to label each text segment. We recommend that the number should be five or less. Data Labeling Service does a majority vote to determine the proper labels.
  • Classification tasks with sentiment, where the overall label input is the same as the text classification tasks. The labelers can also assign a sentiment regarding the label in the text segment, such as "POSITIVE" or "NEGATIVE". Data Labeling Service will collect the sentiment along with labels from the labelers.
  • Entity extraction tasks, where the labeler will be given a list of labels and a text segment (up to 100000 characters), and they will select the start and end place where the text is talking about for each label. They have the option to select "not included" as well. Data Labeling Service will collect the indices of the selected text for each label.

The labeling request is a long-running operation. The response includes the operation ID, which you can use to check the status of the request. When the labeling is complete, the response includes the value "done": true.

Please note only text in English is supported now.

Text classification tasks

Web UI

  1. Open the Data Labeling Service UI.

  2. Select Datasets from the left navigation.

    The Datasets page shows the status of previously created datasets for the current project.

  3. Click the name of the dataset you want to submit for labeling.

    Datasets with status Import complete are available to submit. The Type of data column shows whether the dataset includes images, videos, or text.

  4. On the Dataset detail page, click the Create labeling task button in the title bar.

  5. On the New labeling task page, enter a name and description for the annotated dataset.

    The annotated dataset is the version of this dataset labeled by human labelers.

  6. From the Objective drop-down, select the type of labeling task you want performed on this dataset.

    The drop-down list includes only the objectives available for the type of data in this dataset. If you do not see the objective you want, it probably means you have selected a dataset with a different type of data in it. Close the New labeling task page and select a different dataset.

  7. From the Label set drop-down, choose the label set you want the labelers to apply to data items in this set.

    The drop-down list includes all label sets associated with this project. You must choose a set.

  8. From the Instruction drop-down, choose the instructions you want to provide to the labelers working with this dataset.

    The drop-down list includes all instructions associated with this project. You must include instructions in the labeling request.

  9. From the labelers per data item drop-down, specify the number of labelers to review each item in the dataset.

    The default is one, but you can request to have three or five labelers to label each item.

  10. Click the check box to confirm that you understand how you will be charged for the labeling.

  11. Click Create.

Command-line

Set the following environment variables:
  1. PROJECT_ID variable to your Google Cloud project ID.
  2. DATASET_ID variable to the ID of your dataset, from the response when you created the dataset. The ID appears at the end of the full dataset name:

    projects/project-id/locations/us-central1/datasets/dataset-id
  3. INSTRUCTION_RESOURCE_NAME to the name of your instruction resource.
  4. ANNOTATION_SPEC_SET_RESOURCE_NAME to the name of your label set resource.
curl -X POST \
  -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
  -H "Content-Type: application/json" \
  https://datalabeling.googleapis.com/v1beta1/projects/${PROJECT_ID}/datasets/${DATASET_ID}/text:label \
  -d '{
  "basicConfig": {
    "instruction": "${INSTRUCTION_RESOURCE_NAME}",
    "annotatedDatasetDisplayName": "curl_testing_annotated_dataset",
    "labelGroup": "test_label_group",
    "replica_count": 1
  },
  "feature": "TEXT_CLASSIFICATION",
  "textClassificationConfig": {
    "annotationSpecSet": "${ANNOTATION_SPEC_SET_RESOURCE_NAME}",
  },
}'

You should see output similar to the following. You can use the operation ID to get the status of the task. Getting the status of an operation is an example.

{
  "name": "projects/data-labeling-codelab/operations/5c73dd6b_0000_2b34_a920_883d24fa2064",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.data-labeling.v1beta1.LabelTextClassificationOperationMetadata",
    "dataset": "projects/data-labeling-codelab/datasets/5c73db3d_0000_23e0_a25b_94eb2c119c4c"
  }
}

Java

Before you can run this code example, you must install the Java Client Libraries.
import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.datalabeling.v1beta1.AnnotatedDataset;
import com.google.cloud.datalabeling.v1beta1.DataLabelingServiceClient;
import com.google.cloud.datalabeling.v1beta1.DataLabelingServiceSettings;
import com.google.cloud.datalabeling.v1beta1.HumanAnnotationConfig;
import com.google.cloud.datalabeling.v1beta1.LabelOperationMetadata;
import com.google.cloud.datalabeling.v1beta1.LabelTextRequest;
import com.google.cloud.datalabeling.v1beta1.LabelTextRequest.Feature;
import com.google.cloud.datalabeling.v1beta1.SentimentConfig;
import com.google.cloud.datalabeling.v1beta1.TextClassificationConfig;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

class LabelText {

  // Start a Text Labeling Task
  static void labelText(
      String formattedInstructionName,
      String formattedAnnotationSpecSetName,
      String formattedDatasetName) throws IOException {
    // String formattedInstructionName = DataLabelingServiceClient.formatInstructionName(
    //      "YOUR_PROJECT_ID", "YOUR_INSTRUCTION_UUID");
    // String formattedAnnotationSpecSetName =
    //     DataLabelingServiceClient.formatAnnotationSpecSetName(
    //         "YOUR_PROJECT_ID", "YOUR_ANNOTATION_SPEC_SET_UUID");
    // String formattedDatasetName = DataLabelingServiceClient.formatDatasetName(
    //      "YOUR_PROJECT_ID", "YOUR_DATASET_UUID");


    DataLabelingServiceSettings settings = DataLabelingServiceSettings
        .newBuilder()
        .build();
    try (DataLabelingServiceClient dataLabelingServiceClient =
             DataLabelingServiceClient.create(settings)) {
      HumanAnnotationConfig humanAnnotationConfig =
          HumanAnnotationConfig.newBuilder()
              .setAnnotatedDatasetDisplayName("annotated_displayname")
              .setAnnotatedDatasetDescription("annotated_description")
              .setLanguageCode("en-us")
              .setInstruction(formattedInstructionName)
              .build();

      SentimentConfig sentimentConfig =
          SentimentConfig.newBuilder().setEnableLabelSentimentSelection(false).build();

      TextClassificationConfig textClassificationConfig =
          TextClassificationConfig.newBuilder()
              .setAnnotationSpecSet(formattedAnnotationSpecSetName)
              .setSentimentConfig(sentimentConfig)
              .build();

      LabelTextRequest labelTextRequest =
          LabelTextRequest.newBuilder()
              .setParent(formattedDatasetName)
              .setBasicConfig(humanAnnotationConfig)
              .setTextClassificationConfig(textClassificationConfig)
              .setFeature(Feature.TEXT_CLASSIFICATION)
              .build();

      OperationFuture<AnnotatedDataset, LabelOperationMetadata> operation =
          dataLabelingServiceClient.labelTextAsync(labelTextRequest);

      // You'll want to save this for later to retrieve your completed operation.
      // System.out.format("Operation Name: %s\n", operation.getName());

      // Cancel the operation to avoid charges when testing.
      dataLabelingServiceClient.getOperationsClient().cancelOperation(operation.getName());

    } catch (IOException | InterruptedException | ExecutionException e) {
      e.printStackTrace();
    }
  }
}

Entity extraction tasks

Web UI

  1. Open the Data Labeling Service UI.

  2. Select Datasets from the left navigation.

    The Datasets page shows the status of previously created datasets for the current project.

  3. Click the name of the dataset you want to submit for labeling.

    Datasets with status Import complete are available to submit. The Type of data column shows whether the dataset includes images, videos, or text.

  4. On the Dataset detail page, click the Create labeling task button in the title bar.

  5. On the New labeling task page, enter a name and description for the annotated dataset.

    The annotated dataset is the version of this dataset labeled by human labelers.

  6. From the Objective drop-down, select the type of labeling task you want performed on this dataset.

    The drop-down list includes only the objectives available for the type of data in this dataset. If you do not see the objective you want, it probably means you have selected a dataset with a different type of data in it. Close the New labeling task page and select a different dataset.

  7. From the Label set drop-down, choose the label set you want the labelers to apply to data items in this set.

    The drop-down list includes all label sets associated with this project. You must choose a set.

  8. From the Instruction drop-down, choose the instructions you want to provide to the labelers working with this dataset.

    The drop-down list includes all instructions associated with this project. You must include instructions in the labeling request.

  9. From the labelers per data item drop-down, specify the number of labelers to review each item in the dataset.

    The default is one, but you can request to have three or five labelers to label each item.

  10. Click the check box to confirm that you understand how you will be charged for the labeling.

  11. Click Create.

Command-line

Set the following environment variables:
  1. PROJECT_ID variable to your Google Cloud project ID.
  2. DATASET_ID variable to the ID of your dataset, from the response when you created the dataset. The ID appears at the end of the full dataset name:

    projects/project-id/locations/us-central1/datasets/dataset-id
  3. INSTRUCTION_RESOURCE_NAME to the name of your instruction resource.
  4. ANNOTATION_SPEC_SET_RESOURCE_NAME to the name of your label set resource.
curl -X POST \
  -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
  -H "Content-Type: application/json" \
  https://datalabeling.googleapis.com/v1beta1/projects/${PROJECT_ID}/datasets/${DATASET_ID}/text:label \
  -d '{
  "basicConfig": {
    "instruction": "${INSTRUCTION_RESOURCE_NAME}",
    "annotatedDatasetDisplayName": "curl_testing_annotated_dataset",
    "labelGroup": "test_label_group",
    "replica_count": 1
  },
  "feature": "TEXT_ENTITY_EXTRACTION",
  "textEntityExtractionConfig": {
    "annotationSpecSet": "${ANNOTATION_SPEC_SET_RESOURCE_NAME}",
  },
}'

You should see output similar to the following. You can use the operation ID to get the status of the task. Getting the status of an operation is an example.

{
  "name": "projects/data-labeling-codelab/operations/5c73dd6b_0000_2b34_a920_883d24fa2064",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.data-labeling.v1beta1.LabelTextEntityExtractionOperationMetadata",
    "dataset": "projects/data-labeling-codelab/datasets/5c73db3d_0000_23e0_a25b_94eb2c119c4c"
  }
}

Python

Before you can run this code example, you must install the Python Client Libraries.

def label_text(
    dataset_resource_name, instruction_resource_name, annotation_spec_set_resource_name
):
    """Labels a text dataset."""
    from google.cloud import datalabeling_v1beta1 as datalabeling

    client = datalabeling.DataLabelingServiceClient()

    basic_config = datalabeling.HumanAnnotationConfig(
        instruction=instruction_resource_name,
        annotated_dataset_display_name="YOUR_ANNOTATED_DATASET_DISPLAY_NAME",
        label_group="YOUR_LABEL_GROUP",
        replica_count=1,
    )

    feature = datalabeling.LabelTextRequest.Feature.TEXT_ENTITY_EXTRACTION

    config = datalabeling.TextEntityExtractionConfig(
        annotation_spec_set=annotation_spec_set_resource_name
    )

    response = client.label_text(
        request={
            "parent": dataset_resource_name,
            "basic_config": basic_config,
            "feature": feature,
            "text_classification_config": config,
        }
    )

    print("Label_text operation name: {}".format(response.operation.name))
    return response