导出已加标签的数据

在加标签操作完成后,您可以调用 ExportData 将已添加注释的数据集导出到 Google Cloud Storage 存储分区。

ExportData 支持返回 .csv 文件,其中每个注释或数据项对应一行数据。第一个字段表示此行的 ml 使用类别,默认为 UNASSIGNED。ExportData 还支持 jsonl 文件,其中每一行代表一个示例,此示例包含一个数据项和所有注释。以下是每种类型的示例。

图片分类

  • csv 行:

    UNASSIGNED,image_url,label_1,label_2,...

  • json 行:

    {
    "name":"projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id",
    "imagePayload":{
    "mimeType":"IMAGE_PNG",
    "imageUri":"gs://sample_bucket/image.png"
    },
    "annotations":[
    {
         "name":"projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id/annotations/annotation_id",
       "annotationValue":{
          "imageClassificationAnnotation":{
           "annotationSpec":{
                "displayName":"tulip",
             }
          }
       }
    }
    ]
    }

图片边界框

  • csv 行:每行都包含一个边界框的相关信息,并使用 x,y 坐标表示每个框角。单个图片的多个框位于单独的行中。行格式为 UNASSIGNED, image_url, label, topleft_x, topleft_y, topright_x, topright_y, bottomright_x, bottomright_y, bottomleft_x, bottomleft_y。topright_x、topright_y、bottomleft_x 和 bottomleft_y 坐标可能是空字符串,因为它们提供冗余的信息。

    UNASSIGNED,image_url,label,0.1,0.1,,,0.3,0.3,,

  • json 行:如果未设置 normalizedVertices 中的坐标,则该字段默认为 0。这也适用于任何基于坐标的注释。

    {
     "name":"projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id",
     "imagePayload":{
        "mimeType":"IMAGE_PNG",
        "imageUri":"gs://sample_bucket/image.png"
     },
     "annotations":[
        {
             "name":"projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id/annotations/annotation_id",
           "annotationValue":{
             "image_bounding_poly_annotation": {
              "annotationSpec": {
                "displayName": "tulip"
              },
              "normalizedBoundingPoly": {
              "normalizedVertices": [ {
                  "x": 0.1,
                  "y": 0.2
                }, {
                  "x": 0.9,
                  "y": 0.9
                } ]
              }
           }
        }
      }
     ]
    }

图片边界多边形、定向边界框和折线

  • csv 行:封闭多边形/折线中的每个点由 x,y 点表示,并由两个空的 csv 列分隔。如果折线没有封闭循环,最后一对 x,y 会连接回多边形的第一对 x,y。每行代表一个多边形/一条折线。

    UNASSIGNED,image_url,label,0.1,0.1,,,0.3,0.3,,,0.6,0.6,,...

  • json 行:

    {
    "name":"projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id",
    "imagePayload":{
    "mimeType":"IMAGE_PNG",
    "imageUri":"gs://sample_bucket/image.png"
    },
    "annotations":[
    {
         "name":"projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id/annotations/annotation_id",
       "annotationValue":{
         "image_bounding_poly_annotation": {
          "annotationSpec": {
            "displayName": "tulip"
          },
          "normalizedBoundingPoly": {
            "normalizedVertices": [ {
              "x": 0.1,
              "y": 0.1
            }, {
              "x": 0.1,
              "y": 0.2
            }, {
              "x": 0.2,
              "y": 0.3
            }  ]
          }
       }
    }
    }
    ]
    }

图片分割

对于图片分割,仅提供 jsonl 输出。

  • json 行:imageSegmentationAnnotation 中的 imageBytes 字段表示该图片的分割掩码。每个标签(即每只狗和猫)的颜色都显示在 annotationColors 字段中。
    {
    "name":"projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id",
    "imagePayload":{
    "mimeType":"IMAGE_PNG",
    "imageUri":"gs://sample_bucket/image.png"
    },
    "annotations":[
    {
         "name":"projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id/annotations/annotation_id",
       "annotationValue":{
         "imageSegmentationAnnotation": {
            "annotationColors": [ {
              "key": "rgb(0,0,255)",
              "value": {
                "display_name": "dog"
              }
            }, {
              "key": "rgb(0,255,0)",
              "value": {
                "display_name": "cat"
              }
            } ],
            "mimeType": "IMAGE_JPEG",
            "imageBytes": "/9j/4AAQSkZJRgABAQAAAQABAAD/2"
       }
    }
    }
    ]
    }

视频分类

  • csv 行:

    UNASSIGNED,video_url,label,segment_start_time,segment_end_time

  • json 行:

    {
    "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id",
    "videoPayload": {
      "mimeType": "VIDEO_MP4",
      "resolution": {
        width: 720,
        height: 360
      }
      "frameRate": 24
    },
    "annotations": [ {
      "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id/annotations/annotation_id",
      "annotationSource": 3,
      "annotationValue": {
        "videoClassificationAnnotation": {
          "timeSegment": {
            "startTimeOffset": {
              "seconds": 10
            },
            "endTimeOffset": {
              "seconds": 20
            }
          },
          "annotationSpec": {
            "displayName": "dog"
          }
        }
      }
    } ]
    }

视频对象检测

  • csv 行:四个点分别位于左上角、右上角、右下角、左下角。 第二个点和第四个点是可选的。每个点由 x,y 表示。 每行将包含一个边界框。

    UNASSIGNED,video_url,label,timestamp,0.1,0.1,,,0.3,0.3,,

  • json 行:

    {
    "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id",
    "videoPayload": {
      "mimeType": "VIDEO_MP4",
      "resolution": {
        width: 720,
        height: 360
      }
      "frameRate": 24
    },
    "annotations": [ {
      "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id/annotations/annotation_id",
      "annotationSource": 3,
      "annotationValue": {
        "videoObjectTrackingAnnotation": {
      "annotationSpec": {
        "displayName": "tulip"
      },
      "timeSegment": {
        "startTimeOffset": {
          "seconds": 10
        },
        "endTimeOffset": {
          "seconds": 10
        }
      },
      "objectTrackingFrames": [ {
        "normalizedBoundingPoly": {
          "normalizedVertices": [ {
            "x": 0.2,
            "y": 0.3
          }, {
            "x": 0.9,
            "y": 0.5
          } ]
        },
      }, {
        "normalizedBoundingPoly": {
          "normalizedVertices": [ {
            "x": 0.3,
            "y": 0.3
          }, {
            "x": 0.5,
            "y": 0.7
          } ]
        },
      } ]
    }
    }
    }]}

视频对象跟踪

  • csv 行:四个点分别位于左上角、右上角、右下角、左下角。 第二个点和第四个点是可选的。每个点由 x,y 表示。 每行将包含一个边界框。视频中的每个对象均由非重复的 instance_id 表示。

    UNASSIGNED,video_url,label,instance_id,timestamp,0.1,0.1,,,0.3,0.3,,

  • json 行:

    {
    "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id",
    "videoPayload": {
      "mimeType": "VIDEO_MP4",
      "resolution": {
        width: 720,
        height: 360
      }
      "frameRate": 24
    },
    "annotations": [ {
      "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id/annotations/annotation_id",
      "annotationSource": 3,
      "annotationValue": {
        "videoObjectTrackingAnnotation": {
      "annotationSpec": {
        "displayName": "tulip"
      },
      "timeSegment": {
        "startTimeOffset": {
          "seconds": 10
        },
        "endTimeOffset": {
          "seconds": 20
        }
      },
      "objectTrackingFrames": [ {
        "normalizedBoundingPoly": {
          "normalizedVertices": [ {
            "x": 0.2,
            "y": 0.3
          }, {
            "x": 0.9,
            "y": 0.5
          } ]
        },
        "timeOffset": {
          "nanos": 1000000
        }
      }, {
        "normalizedBoundingPoly": {
          "normalizedVertices": [ {
            "x": 0.3,
            "y": 0.3
          }, {
            "x": 0.5,
            "y": 0.7
          } ]
        },
        "timeOffset": {
          "nanos": 84000000
        }
      } ]
    }
    }
    }]}

视频事件

  • csv 行:四个点分别位于左上角、右上角、右下角、左下角。第二个点和第四个点是可选的。每个点由 x,y 表示。 每行将包含一个边界框。视频中的每个对象均由非重复的 instance_id 表示。

    UNASSIGNED,video_url,label,segment_start_time,segment_end_time

  • json 行:

    {
    "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id",
    "videoPayload": {
      "mimeType": "VIDEO_MP4",
      "resolution": {
        width: 720,
        height: 360
      }
      "frameRate": 24
    },
    "annotations": [ {
      "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id/annotations/annotation_id",
      "annotationValue": {
        "videoEventAnnotation": {
          "annotationSpec": {
            "displayName": "Callie"
          },
          "timeSegment": {
            "startTimeOffset": {
              "seconds": 123
            },
            "endTimeOffset": {
              "seconds": 150
            }
          }
        }
      }
     } ]
    }
    }
    }]}

文本分类

  • csv 行:

    UNASSIGNED,text_url,label_l

  • json 行:

    {
      "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id",
      "textPayload": {
        "textContent": "dummy_text_content",
        "textUri": "gs://test_bucket/file.txt",
        "wordCount": 1
      }
      "annotations": [ {
        "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id/annotations/fake_annotation_id",
        "annotationValue": {
          "textClassificationAnnotation": {
            "annotationSpec": {
              "displayName": "news"
            }
          }
        }
      } ],
    }

文本实体提取

对于文本实体提取,仅提供 jsonl 输出。

  • json 行:
    {
        "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id",
        "textPayload": {
          "textContent": "dummy_text_content",
          "textUri": "gs://test_bucket/file.txt",
          "wordCount": 1
        }
        "annotations": [ {
          "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id/annotations/fake_annotation_id",
          "annotationValue": {
            "textEntityExtractionAnnotation": {
              "annotationSpec": {
                "displayName": "equations"
              },
              "textSegment": {
                "startOffset": 10,
                "endOffset": 20
              }
            }
          }
        } ],
      }

ExportData 是一项长时间运行的操作。API 将返回操作 ID。您稍后可以使用操作 ID 调用 GetOperation,以便获取其状态。

网页界面

如需使用数据标签服务界面导出已加标签的数据,请按照以下步骤操作。

  1. 在 Google Cloud 控制台中打开数据标签服务界面

    数据集页面会显示之前为当前项目创建的数据集的状态。

  2. 点击您要导出的数据集的名称。系统随即会转到数据集详情页面。

  3. 已加标签的数据集部分中,点击导出状态列中的导出

  4. 导出已加标签的数据集对话框中,输入要用于输出文件的 Cloud Storage 路径,并选择您所需的文件格式。

  5. 点击导出

    数据集详情页面会在导出数据时显示“正在进行”的状态。导出完成后,您可以在指定的 Cloud Storage 路径中找到导出文件。

命令行

设置以下环境变量:

  1. PROJECT_ID 变量设置为您的 Google Cloud 项目 ID。
  2. DATASET_ID 变量设置为您的数据集 ID(来自创建数据集时的响应)。该 ID 显示在完整数据集名称的末尾:

    projects/PROJECT_ID/locations/us-central1/datasets/DATASET_ID
  3. ANNOTATED_DATASET_ID 变量设置为带注释的数据集资源名称的 ID。资源名称采用以下格式:

    projects/PROJECT_ID/locations/us-central1/datasets/DATASET_ID/annotatedDatasets/ANNOTATED_DATASET_ID
  4. STORAGE_URI 变量设置为要存储结果的 Cloud Storage 存储分区的 URI。

对于除图片分割之外的所有注释请求,curl 请求类似于以下代码:

curl -X POST \
   -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
   -H "Content-Type: application/json" \
   https://datalabeling.googleapis.com/v1beta1/projects/${PROJECT_ID}/datasets/${DATASET_ID}:exportData \
   -d '{
     "annotatedDataset": "${ANNOTATED_DATASET_ID}",
     "outputConfig": {
       "gcsDestination": {
           "output_uri": "${STORAGE_URI}",
           "mimeType": "text/csv"
       }
     }
   }'

如需导出图片分割数据,curl 请求应类似于以下代码:

curl -X POST \
   -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
   -H "Content-Type: application/json" \
   https://datalabeling.googleapis.com/v1beta1/projects/${PROJECT_ID}/datasets/${DATASET_ID}:exportData \
   -d '{
     "annotatedDataset": "${ANNOTATED_DATASET_ID}",
     "outputConfig": {
       "gcsFolderDestination": {
         "output_folder_uri": "${STORAGE_URI}"
       }
     }
   }'

您将看到如下所示的输出:

{
  "name": "projects/data-labeling-codelab/operations/5c73dd6b_0000_2b34_a920_883d24fa2064",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.data-labeling.v1beta1.ExportDataOperationResponse",
    "dataset": "projects/data-labeling-codelab/datasets/5c73db3d_0000_23e0_a25b_94eb2c119c4c"
  }
}

Python

您必须先安装 Python 客户端库,然后才能运行此代码示例。

def export_data(dataset_resource_name, annotated_dataset_resource_name, export_gcs_uri):
    """Exports a dataset from the given Google Cloud project."""
    from google.cloud import datalabeling_v1beta1 as datalabeling

    client = datalabeling.DataLabelingServiceClient()

    gcs_destination = datalabeling.GcsDestination(
        output_uri=export_gcs_uri, mime_type="text/csv"
    )

    output_config = datalabeling.OutputConfig(gcs_destination=gcs_destination)

    response = client.export_data(
        request={
            "name": dataset_resource_name,
            "annotated_dataset": annotated_dataset_resource_name,
            "output_config": output_config,
        }
    )

    print(f"Dataset ID: {response.result().dataset}\n")
    print("Output config:")
    print("\tGcs destination:")
    print(
        "\t\tOutput URI: {}\n".format(
            response.result().output_config.gcs_destination.output_uri
        )
    )

Java

在运行此代码示例之前,您必须先安装 Java 客户端库
import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.datalabeling.v1beta1.DataLabelingServiceClient;
import com.google.cloud.datalabeling.v1beta1.DataLabelingServiceSettings;
import com.google.cloud.datalabeling.v1beta1.ExportDataOperationMetadata;
import com.google.cloud.datalabeling.v1beta1.ExportDataOperationResponse;
import com.google.cloud.datalabeling.v1beta1.ExportDataRequest;
import com.google.cloud.datalabeling.v1beta1.GcsDestination;
import com.google.cloud.datalabeling.v1beta1.LabelStats;
import com.google.cloud.datalabeling.v1beta1.OutputConfig;
import java.io.IOException;
import java.util.Map.Entry;
import java.util.Set;
import java.util.concurrent.ExecutionException;

class ExportData {

  // Export data from an annotated dataset.
  static void exportData(String datasetName, String annotatedDatasetName, String gcsOutputUri)
      throws IOException {
    // String datasetName = DataLabelingServiceClient.formatDatasetName(
    //     "YOUR_PROJECT_ID", "YOUR_DATASETS_UUID");
    // String annotatedDatasetName = DataLabelingServiceClient.formatAnnotatedDatasetName(
    //     "YOUR_PROJECT_ID",
    //     "YOUR_DATASET_UUID",
    //     "YOUR_ANNOTATED_DATASET_UUID");
    // String gcsOutputUri = "gs://YOUR_BUCKET_ID/export_path";


    DataLabelingServiceSettings settings =
        DataLabelingServiceSettings.newBuilder()
            .build();
    try (DataLabelingServiceClient dataLabelingServiceClient =
        DataLabelingServiceClient.create(settings)) {
      GcsDestination gcsDestination =
          GcsDestination.newBuilder().setOutputUri(gcsOutputUri).setMimeType("text/csv").build();

      OutputConfig outputConfig =
          OutputConfig.newBuilder().setGcsDestination(gcsDestination).build();

      ExportDataRequest exportDataRequest =
          ExportDataRequest.newBuilder()
              .setName(datasetName)
              .setOutputConfig(outputConfig)
              .setAnnotatedDataset(annotatedDatasetName)
              .build();

      OperationFuture<ExportDataOperationResponse, ExportDataOperationMetadata> operation =
          dataLabelingServiceClient.exportDataAsync(exportDataRequest);

      ExportDataOperationResponse response = operation.get();

      System.out.format("Exported item count: %d\n", response.getExportCount());
      LabelStats labelStats = response.getLabelStats();
      Set<Entry<String, Long>> entries = labelStats.getExampleCountMap().entrySet();
      for (Entry<String, Long> entry : entries) {
        System.out.format("\tLabel: %s\n", entry.getKey());
        System.out.format("\tCount: %d\n\n", entry.getValue());
      }
    } catch (IOException | InterruptedException | ExecutionException e) {
      e.printStackTrace();
    }
  }
}