在加标签操作完成后,您可以调用 ExportData
将已添加注释的数据集导出到 Google Cloud Storage 存储分区。
ExportData
支持返回 .csv 文件,其中每个注释或数据项对应一行数据。第一个字段表示此行的 ml 使用类别,默认为 UNASSIGNED。ExportData
还支持 jsonl 文件,其中每一行代表一个示例,此示例包含一个数据项和所有注释。以下是每种类型的示例。
图片分类
csv 行:
UNASSIGNED,image_url,label_1,label_2,...
json 行:
{ "name":"projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id", "imagePayload":{ "mimeType":"IMAGE_PNG", "imageUri":"gs://sample_bucket/image.png" }, "annotations":[ { "name":"projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id/annotations/annotation_id", "annotationValue":{ "imageClassificationAnnotation":{ "annotationSpec":{ "displayName":"tulip", } } } } ] }
图片边界框
csv 行:每行都包含一个边界框的相关信息,并使用 x,y 坐标表示每个框角。单个图片的多个框位于单独的行中。行格式为
UNASSIGNED, image_url, label, topleft_x, topleft_y, topright_x, topright_y, bottomright_x, bottomright_y, bottomleft_x, bottomleft_y
。topright_x、topright_y、bottomleft_x 和 bottomleft_y 坐标可能是空字符串,因为它们提供冗余的信息。UNASSIGNED,image_url,label,0.1,0.1,,,0.3,0.3,,
json 行:如果未设置 normalizedVertices 中的坐标,则该字段默认为 0。这也适用于任何基于坐标的注释。
{ "name":"projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id", "imagePayload":{ "mimeType":"IMAGE_PNG", "imageUri":"gs://sample_bucket/image.png" }, "annotations":[ { "name":"projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id/annotations/annotation_id", "annotationValue":{ "image_bounding_poly_annotation": { "annotationSpec": { "displayName": "tulip" }, "normalizedBoundingPoly": { "normalizedVertices": [ { "x": 0.1, "y": 0.2 }, { "x": 0.9, "y": 0.9 } ] } } } } ] }
图片边界多边形、定向边界框和折线
csv 行:封闭多边形/折线中的每个点由 x,y 点表示,并由两个空的 csv 列分隔。如果折线没有封闭循环,最后一对 x,y 会连接回多边形的第一对 x,y。每行代表一个多边形/一条折线。
UNASSIGNED,image_url,label,0.1,0.1,,,0.3,0.3,,,0.6,0.6,,...
json 行:
{ "name":"projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id", "imagePayload":{ "mimeType":"IMAGE_PNG", "imageUri":"gs://sample_bucket/image.png" }, "annotations":[ { "name":"projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id/annotations/annotation_id", "annotationValue":{ "image_bounding_poly_annotation": { "annotationSpec": { "displayName": "tulip" }, "normalizedBoundingPoly": { "normalizedVertices": [ { "x": 0.1, "y": 0.1 }, { "x": 0.1, "y": 0.2 }, { "x": 0.2, "y": 0.3 } ] } } } } ] }
图片分割
对于图片分割,仅提供 jsonl 输出。
- json 行:imageSegmentationAnnotation 中的 imageBytes 字段表示该图片的分割掩码。每个标签(即每只狗和猫)的颜色都显示在 annotationColors 字段中。
{ "name":"projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id", "imagePayload":{ "mimeType":"IMAGE_PNG", "imageUri":"gs://sample_bucket/image.png" }, "annotations":[ { "name":"projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id/annotations/annotation_id", "annotationValue":{ "imageSegmentationAnnotation": { "annotationColors": [ { "key": "rgb(0,0,255)", "value": { "display_name": "dog" } }, { "key": "rgb(0,255,0)", "value": { "display_name": "cat" } } ], "mimeType": "IMAGE_JPEG", "imageBytes": "/9j/4AAQSkZJRgABAQAAAQABAAD/2" } } } ] }
视频分类
csv 行:
UNASSIGNED,video_url,label,segment_start_time,segment_end_time
json 行:
{ "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id", "videoPayload": { "mimeType": "VIDEO_MP4", "resolution": { width: 720, height: 360 } "frameRate": 24 }, "annotations": [ { "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id/annotations/annotation_id", "annotationSource": 3, "annotationValue": { "videoClassificationAnnotation": { "timeSegment": { "startTimeOffset": { "seconds": 10 }, "endTimeOffset": { "seconds": 20 } }, "annotationSpec": { "displayName": "dog" } } } } ] }
视频对象检测
csv 行:四个点分别位于左上角、右上角、右下角、左下角。 第二个点和第四个点是可选的。每个点由 x,y 表示。 每行将包含一个边界框。
UNASSIGNED,video_url,label,timestamp,0.1,0.1,,,0.3,0.3,,
json 行:
{ "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id", "videoPayload": { "mimeType": "VIDEO_MP4", "resolution": { width: 720, height: 360 } "frameRate": 24 }, "annotations": [ { "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id/annotations/annotation_id", "annotationSource": 3, "annotationValue": { "videoObjectTrackingAnnotation": { "annotationSpec": { "displayName": "tulip" }, "timeSegment": { "startTimeOffset": { "seconds": 10 }, "endTimeOffset": { "seconds": 10 } }, "objectTrackingFrames": [ { "normalizedBoundingPoly": { "normalizedVertices": [ { "x": 0.2, "y": 0.3 }, { "x": 0.9, "y": 0.5 } ] }, }, { "normalizedBoundingPoly": { "normalizedVertices": [ { "x": 0.3, "y": 0.3 }, { "x": 0.5, "y": 0.7 } ] }, } ] } } }]}
视频对象跟踪
csv 行:四个点分别位于左上角、右上角、右下角、左下角。 第二个点和第四个点是可选的。每个点由 x,y 表示。 每行将包含一个边界框。视频中的每个对象均由非重复的 instance_id 表示。
UNASSIGNED,video_url,label,instance_id,timestamp,0.1,0.1,,,0.3,0.3,,
json 行:
{ "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id", "videoPayload": { "mimeType": "VIDEO_MP4", "resolution": { width: 720, height: 360 } "frameRate": 24 }, "annotations": [ { "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id/annotations/annotation_id", "annotationSource": 3, "annotationValue": { "videoObjectTrackingAnnotation": { "annotationSpec": { "displayName": "tulip" }, "timeSegment": { "startTimeOffset": { "seconds": 10 }, "endTimeOffset": { "seconds": 20 } }, "objectTrackingFrames": [ { "normalizedBoundingPoly": { "normalizedVertices": [ { "x": 0.2, "y": 0.3 }, { "x": 0.9, "y": 0.5 } ] }, "timeOffset": { "nanos": 1000000 } }, { "normalizedBoundingPoly": { "normalizedVertices": [ { "x": 0.3, "y": 0.3 }, { "x": 0.5, "y": 0.7 } ] }, "timeOffset": { "nanos": 84000000 } } ] } } }]}
视频事件
csv 行:四个点分别位于左上角、右上角、右下角、左下角。第二个点和第四个点是可选的。每个点由 x,y 表示。 每行将包含一个边界框。视频中的每个对象均由非重复的 instance_id 表示。
UNASSIGNED,video_url,label,segment_start_time,segment_end_time
json 行:
{ "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id", "videoPayload": { "mimeType": "VIDEO_MP4", "resolution": { width: 720, height: 360 } "frameRate": 24 }, "annotations": [ { "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id/annotations/annotation_id", "annotationValue": { "videoEventAnnotation": { "annotationSpec": { "displayName": "Callie" }, "timeSegment": { "startTimeOffset": { "seconds": 123 }, "endTimeOffset": { "seconds": 150 } } } } } ] } } }]}
文本分类
csv 行:
UNASSIGNED,text_url,label_l
json 行:
{ "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id", "textPayload": { "textContent": "dummy_text_content", "textUri": "gs://test_bucket/file.txt", "wordCount": 1 } "annotations": [ { "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id/annotations/fake_annotation_id", "annotationValue": { "textClassificationAnnotation": { "annotationSpec": { "displayName": "news" } } } } ], }
文本实体提取
对于文本实体提取,仅提供 jsonl 输出。
- json 行:
{ "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id", "textPayload": { "textContent": "dummy_text_content", "textUri": "gs://test_bucket/file.txt", "wordCount": 1 } "annotations": [ { "name": "projects/project_id/datasets/dataset_id/annotatedDatasets/annotated_dataset_id/examples/example_id/annotations/fake_annotation_id", "annotationValue": { "textEntityExtractionAnnotation": { "annotationSpec": { "displayName": "equations" }, "textSegment": { "startOffset": 10, "endOffset": 20 } } } } ], }
ExportData 是一项长时间运行的操作。API 将返回操作 ID。您稍后可以使用操作 ID 调用 GetOperation,以便获取其状态。
网页界面
如需使用数据标签服务界面导出已加标签的数据,请按照以下步骤操作。
在 Google Cloud 控制台中打开数据标签服务界面。
数据集页面会显示之前为当前项目创建的数据集的状态。
点击您要导出的数据集的名称。系统随即会转到数据集详情页面。
在已加标签的数据集部分中,点击导出状态列中的导出。
在导出已加标签的数据集对话框中,输入要用于输出文件的 Cloud Storage 路径,并选择您所需的文件格式。
点击导出。
数据集详情页面会在导出数据时显示“正在进行”的状态。导出完成后,您可以在指定的 Cloud Storage 路径中找到导出文件。
命令行
设置以下环境变量:- 将
PROJECT_ID
变量设置为您的 Google Cloud 项目 ID。 -
将
DATASET_ID
变量设置为您的数据集 ID(来自创建数据集时的响应)。该 ID 显示在完整数据集名称的末尾:projects/PROJECT_ID/locations/us-central1/datasets/DATASET_ID
-
将
ANNOTATED_DATASET_ID
变量设置为带注释的数据集资源名称的 ID。资源名称采用以下格式:projects/PROJECT_ID/locations/us-central1/datasets/DATASET_ID/annotatedDatasets/ANNOTATED_DATASET_ID
- 将
STORAGE_URI
变量设置为要存储结果的 Cloud Storage 存储分区的 URI。
对于除图片分割之外的所有注释请求,curl
请求类似于以下代码:
curl -X POST \ -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ -H "Content-Type: application/json" \ https://datalabeling.googleapis.com/v1beta1/projects/${PROJECT_ID}/datasets/${DATASET_ID}:exportData \ -d '{ "annotatedDataset": "${ANNOTATED_DATASET_ID}", "outputConfig": { "gcsDestination": { "output_uri": "${STORAGE_URI}", "mimeType": "text/csv" } } }'
如需导出图片分割数据,curl
请求应类似于以下代码:
curl -X POST \ -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ -H "Content-Type: application/json" \ https://datalabeling.googleapis.com/v1beta1/projects/${PROJECT_ID}/datasets/${DATASET_ID}:exportData \ -d '{ "annotatedDataset": "${ANNOTATED_DATASET_ID}", "outputConfig": { "gcsFolderDestination": { "output_folder_uri": "${STORAGE_URI}" } } }'
您将看到如下所示的输出:
{ "name": "projects/data-labeling-codelab/operations/5c73dd6b_0000_2b34_a920_883d24fa2064", "metadata": { "@type": "type.googleapis.com/google.cloud.data-labeling.v1beta1.ExportDataOperationResponse", "dataset": "projects/data-labeling-codelab/datasets/5c73db3d_0000_23e0_a25b_94eb2c119c4c" } }