관리형 데이터 세트 사용

이 페이지에서는 Vertex AI 관리형 데이터 세트를 사용하여 커스텀 모델을 학습하는 방법을 보여줍니다. 관리형 데이터 세트는 다음과 같은 이점을 제공합니다.

중앙 위치에서 데이터 세트를 관리합니다.
라벨 및 여러 주석 세트를 만듭니다.
통합 데이터 라벨 지정을 사용하여 수동 라벨링 작업을 생성합니다.
거버넌스 및 반복 개발의 모델 계보를 추적합니다.
동일한 데이터 세트로 AutoML과 커스텀 모델을 학습시켜 모델 성능을 비교합니다.
데이터 통계 및 시각화를 생성합니다.
데이터를 학습, 테스트, 검증 세트로 자동 분할합니다.

시작하기 전에

학습 애플리케이션에서 관리형 데이터 세트를 사용하려면 먼저 데이터 세트를 만들어야 합니다. 학습에 사용하는 데이터 세트와 학습 파이프라인은 같은 리전에 만들어야 합니다. 또한 Dataset 리소스를 사용할 수 있는 리전을 이용해야 합니다.

학습 애플리케이션에서 데이터 세트에 액세스

커스텀 학습 파이프라인을 만들 때 학습 애플리케이션이 Vertex AI 데이터 세트를 사용하도록 지정할 수 있습니다.

런타임 시 Vertex AI는 학습 컨테이너에서 다음 환경 변수를 설정하여 데이터 세트에 대한 메타데이터를 학습 애플리케이션에 전달합니다.

AIP_DATA_FORMAT: 데이터 세트를 내보내는 형식입니다. 가능한 값은 jsonl, csv, 또는 bigquery입니다.
AIP_TRAINING_DATA_URI: 학습 데이터의 BigQuery URI 또는 학습 데이터 파일의 Cloud Storage URI입니다.
AIP_VALIDATION_DATA_URI: 유효성 검사 데이터의 BigQuery URI 또는 유효성 검사 데이터 파일의 Cloud Storage URI입니다.
AIP_TEST_DATA_URI: 테스트 데이터의 BigQuery URI 또는 테스트 데이터 파일의 Cloud Storage URI입니다.

데이터 세트의 AIP_DATA_FORMAT이 jsonl 또는 csv이면 데이터 URI 값은 gs://bucket_name/path/training-* 같은 Cloud Storage URI를 의미합니다. 각 데이터 파일의 크기를 비교적 작게 유지하기 위해 Vertex AI는 데이터 세트를 여러 파일로 분할합니다. 학습, 검증, 또는 테스트 데이터가 여러 파일로 분할될 수 있기 때문에 URI는 와일드 카드 형식으로 제공됩니다.

Cloud Storage 코드 샘플을 사용하여 객체를 다운로드하는 방법을 자세히 알아보세요.

AIP_DATA_FORMAT이 bigquery이면 데이터 URI 값은 bq://project.dataset.table와 같은 BigQuery URI를 나타냅니다.

BigQuery 데이터를 통한 페이징 자세히 알아보기

데이터 세트 형식

다음 섹션에서는 Vertex AI가 학습 애플리케이션에 데이터 세트를 전달할 때 데이터 형식을 지정하는 방법을 자세히 알아봅니다.

이미지 데이터 세트

이미지 데이터 세트는 JSON Line 형식으로 학습 애플리케이션에 전달됩니다. 데이터 세트의 목표에 대한 탭을 선택하여 Vertex AI가 데이터 세트의 형식을 지정하는 방법을 자세히 알아보세요.

단일 라벨 분류

Vertex AI는 단일 라벨 이미지 분류 데이터 세트를 내보낼 때 공개적으로 액세스할 수 있는 다음 스키마를 사용합니다. 이 스키마는 데이터 내보내기 파일의 형식을 지정합니다. 스키마의 구조는 OpenAPI 스키마를 따릅니다.

gs://google-cloud-aiplatform/schema/dataset/ioformat/image_classification_single_label_io_format_1.0.0.yaml

내보낸 데이터 세트의 각 데이터 항목은 다음 형식을 사용합니다. 이 예시에는 가독성을 위해 줄바꿈을 사용했습니다.



{
  "imageGcsUri": "gs://bucket/filename.ext",
  "classificationAnnotation": {
    "displayName": "LABEL",
    "annotationResourceLabels": {
        "aiplatform.googleapis.com/annotation_set_name": "displayName",
        "env": "prod"
      }
   },
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training/test/validation"
  }
}

필드 참고사항:

imageGcsUri: 이 이미지의 Cloud Storage URI입니다.
annotationResourceLabels: 다수의 키-값 문자열 쌍을 포함합니다. Vertex AI는 이 필드를 사용하여 주석 세트를 지정합니다.
dataItemResourceLabels - 다수의 키-값 문자열 쌍을 포함합니다. 학습, 테스트, 검증과 같은 데이터 항목의 머신러닝 사용을 지정합니다.

JSON Line 예시



{"imageGcsUri": "gs://bucket/filename1.jpeg",  "classificationAnnotation": {"displayName": "daisy"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}
{"imageGcsUri": "gs://bucket/filename2.gif",  "classificationAnnotation": {"displayName": "dandelion"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename3.png",  "classificationAnnotation": {"displayName": "roses"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename4.bmp",  "classificationAnnotation": {"displayName": "sunflowers"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename5.tiff",  "classificationAnnotation": {"displayName": "tulips"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "validation"}}
...

멀티 라벨 분류

Vertex AI는 다중 라벨 이미지 분류 데이터 세트를 내보낼 때 공개적으로 액세스할 수 있는 다음 스키마를 사용합니다. 이 스키마는 데이터 내보내기 파일의 형식을 지정합니다. 스키마의 구조는 OpenAPI 스키마를 따릅니다.

gs://google-cloud-aiplatform/schema/dataset/ioformat/image_classification_multi_label_io_format_1.0.0.yaml

내보낸 데이터 세트의 각 데이터 항목은 다음 형식을 사용합니다. 이 예시에는 가독성을 위해 줄바꿈을 사용했습니다.


{
  "imageGcsUri": "gs://bucket/filename.ext",
  "classificationAnnotations": [
    {
      "displayName": "LABEL1",
      "annotationResourceLabels": {
        "aiplatform.googleapis.com/annotation_set_name":"displayName",
        "label_type": "flower_type"
      }
    },
    {
      "displayName": "LABEL2",
      "annotationResourceLabels": {
        "aiplatform.googleapis.com/annotation_set_name":"displayName",
        "label_type": "image_shot_type"
      }
    }
  ],
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training/test/validation"
  }
}

필드 참고사항:

imageGcsUri: 이 이미지의 Cloud Storage URI입니다.
annotationResourceLabels: 다수의 키-값 문자열 쌍을 포함합니다. Vertex AI는 이 필드를 사용하여 주석 세트를 지정합니다.
dataItemResourceLabels - 다수의 키-값 문자열 쌍을 포함합니다. 학습, 테스트, 검증과 같은 데이터 항목의 머신러닝 사용을 지정합니다.

JSON Line 예시



{"imageGcsUri": "gs://bucket/filename1.jpeg",  "classificationAnnotations": [{"displayName": "daisy"}, {"displayName": "full_shot"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}
{"imageGcsUri": "gs://bucket/filename2.gif",  "classificationAnnotations": [{"displayName": "dandelion"}, {"displayName": "medium_shot"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename3.png",  "classificationAnnotations": [{"displayName": "roses"}, {"displayName": "extreme_closeup"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename4.bmp",  "classificationAnnotations": [{"displayName": "sunflowers"}, {"displayName": "closeup"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename5.tiff",  "classificationAnnotations": [{"displayName": "tulips"}, {"displayName": "extreme_closeup"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "validation"}}
...

객체 감지

Vertex AI는 객체 감지 데이터 세트를 내보낼 때 공개적으로 액세스할 수 있는 다음 스키마를 사용합니다. 이 스키마는 데이터 내보내기 파일의 형식을 지정합니다. 스키마의 구조는 OpenAPI 스키마를 따릅니다.

gs://google-cloud-aiplatform/schema/dataset/ioformat/image_bounding_box_io_format_1.0.0.yaml

내보낸 데이터 세트의 각 데이터 항목은 다음 형식을 사용합니다. 이 예시에는 가독성을 위해 줄바꿈을 사용했습니다.



{
  "imageGcsUri": "gs://bucket/filename.ext",
  "boundingBoxAnnotations": [
    {
      "displayName": "OBJECT1_LABEL",
      "xMin": "X_MIN",
      "yMin": "Y_MIN",
      "xMax": "X_MAX",
      "yMax": "Y_MAX",
      "annotationResourceLabels": {
        "aiplatform.googleapis.com/annotation_set_name": "displayName",
        "env": "prod"
      }
    },
    {
      "displayName": "OBJECT2_LABEL",
      "xMin": "X_MIN",
      "yMin": "Y_MIN",
      "xMax": "X_MAX",
      "yMax": "Y_MAX"
    }
  ],
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "test/train/validation"
  }
}

필드 참고사항:

imageGcsUri: 이 이미지의 Cloud Storage URI입니다.
annotationResourceLabels: 다수의 키-값 문자열 쌍을 포함합니다. Vertex AI는 이 필드를 사용하여 주석 세트를 지정합니다.
dataItemResourceLabels - 다수의 키-값 문자열 쌍을 포함합니다. 학습, 테스트, 검증과 같은 데이터 항목의 머신러닝 사용을 지정합니다.

JSON Line 예시



{"imageGcsUri": "gs://bucket/filename1.jpeg", "boundingBoxAnnotations": [{"displayName": "Tomato", "xMin": "0.3", "yMin": "0.3", "xMax": "0.7", "yMax": "0.6"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}
{"imageGcsUri": "gs://bucket/filename2.gif", "boundingBoxAnnotations": [{"displayName": "Tomato", "xMin": "0.8", "yMin": "0.2", "xMax": "1.0", "yMax": "0.4"},{"displayName": "Salad", "xMin": "0.0", "yMin": "0.0", "xMax": "1.0", "yMax": "1.0"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename3.png", "boundingBoxAnnotations": [{"displayName": "Baked goods", "xMin": "0.5", "yMin": "0.7", "xMax": "0.8", "yMax": "0.8"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename4.tiff", "boundingBoxAnnotations": [{"displayName": "Salad", "xMin": "0.1", "yMin": "0.2", "xMax": "0.8", "yMax": "0.9"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "validation"}}
...

테이블 형식 데이터 세트

Vertex AI는 테이블 형식 데이터를 CSV 형식으로 학습 애플리케이션에 전달하거나 BigQuery 테이블 또는 뷰에 URI로 전달합니다. 데이터 소스 형식 및 요구사항에 대한 자세한 내용은 가져오기 소스 준비를 참조하세요. 데이터 세트 스키마에 대한 자세한 내용은 Google Cloud 콘솔의 데이터 세트를 참고하세요.

텍스트 데이터 세트

텍스트 데이터 세트는 JSON Line 형식으로 학습 애플리케이션에 전달됩니다. 데이터 세트의 목표에 대한 탭을 선택하여 Vertex AI가 데이터 세트의 형식을 지정하는 방법을 자세히 알아보세요.

단일 라벨 분류

Vertex AI는 단일 라벨 텍스트 분류 데이터 세트를 내보낼 때 공개적으로 액세스할 수 있는 다음 스키마를 사용합니다. 이 스키마는 데이터 내보내기 파일의 형식을 지정합니다. 스키마의 구조는 OpenAPI 스키마를 따릅니다.

gs://google-cloud-aiplatform/schema/dataset/ioformat/text_classification_single_label_io_format_1.0.0.yaml

내보낸 데이터 세트의 각 데이터 항목은 다음 형식을 사용합니다. 이 예시에는 가독성을 위해 줄바꿈을 사용했습니다.

{
  "classificationAnnotation": {
    "displayName": "label"
  },
  "textContent": "inline_text",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}
{
  "classificationAnnotation": {
    "displayName": "label2"
  },
  "textGcsUri": "gcs_uri_to_file",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}

멀티 라벨 분류

Vertex AI는 다중 라벨 텍스트 분류 데이터 세트를 내보낼 때 공개적으로 액세스할 수 있는 다음 스키마를 사용합니다. 이 스키마는 데이터 내보내기 파일의 형식을 지정합니다. 스키마의 구조는 OpenAPI 스키마를 따릅니다.

gs://google-cloud-aiplatform/schema/dataset/ioformat/text_classification_multi_label_io_format_1.0.0.yaml

내보낸 데이터 세트의 각 데이터 항목은 다음 형식을 사용합니다. 이 예시에는 가독성을 위해 줄바꿈을 사용했습니다.

{
  "classificationAnnotations": [{
    "displayName": "label1"
    },{
    "displayName": "label2"
  }],
  "textGcsUri": "gcs_uri_to_file",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}
{
  "classificationAnnotations": [{
    "displayName": "label2"
    },{
    "displayName": "label3"
  }],
  "textContent": "inline_text",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}

항목 추출

Vertex AI는 항목 추출 데이터 세트를 내보낼 때 공개적으로 액세스할 수 있는 다음 스키마를 사용합니다. 이 스키마는 데이터 내보내기 파일의 형식을 지정합니다. 스키마의 구조는 OpenAPI 스키마를 따릅니다.

gs://google-cloud-aiplatform/schema/dataset/ioformat/text_extraction_io_format_1.0.0.yaml.

내보낸 데이터 세트의 각 데이터 항목은 다음 형식을 사용합니다. 이 예시에는 가독성을 위해 줄바꿈을 사용했습니다.

{
    "textSegmentAnnotations": [
      {
        "startOffset":number,
        "endOffset":number,
        "displayName": "label"
      },
      ...
    ],
    "textContent": "inline_text",
    "dataItemResourceLabels": {
      "aiplatform.googleapis.com/ml_use": "training|test|validation"
    }
}
{
    "textSegmentAnnotations": [
      {
        "startOffset":number,
        "endOffset":number,
        "displayName": "label"
      },
      ...
    ],
    "textGcsUri": "gcs_uri_to_file",
    "dataItemResourceLabels": {
      "aiplatform.googleapis.com/ml_use": "training|test|validation"
    }
}

감정 분석

Vertex AI는 감정 분석 데이터 세트를 내보낼 때 공개적으로 액세스할 수 있는 다음 스키마를 사용합니다. 이 스키마는 데이터 내보내기 파일의 형식을 지정합니다. 스키마의 구조는 OpenAPI 스키마를 따릅니다.

gs://google-cloud-aiplatform/schema/trainingjob/definition/automl_text_sentiment_1.0.0.yaml

내보낸 데이터 세트의 각 데이터 항목은 다음 형식을 사용합니다. 이 예시에는 가독성을 위해 줄바꿈을 사용했습니다.

{
  "sentimentAnnotation": {
    "sentiment": number,
    "sentimentMax": number
  },
  "textContent": "inline_text",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}
{
  "sentimentAnnotation": {
    "sentiment": number,
    "sentimentMax": number
  },
  "textGcsUri": "gcs_uri_to_file",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}

동영상 데이터 세트

동영상 데이터 세트는 JSON Lines 형식으로 학습 애플리케이션에 전달됩니다. 데이터 세트의 목표에 대한 탭을 선택하여 Vertex AI가 데이터 세트의 형식을 지정하는 방법을 자세히 알아보세요.

동작 인식

Vertex AI는 작업 인식 데이터 세트를 내보낼 때 공개적으로 액세스할 수 있는 다음 스키마를 사용합니다. 이 스키마는 데이터 내보내기 파일의 형식을 지정합니다. 스키마의 구조는 OpenAPI 스키마를 따릅니다.

gs://google-cloud-aiplatform/schema/dataset/ioformat/video_action_recognition_io_format_1.0.0.yaml

내보낸 데이터 세트의 각 데이터 항목은 다음 형식을 사용합니다. 이 예시에는 가독성을 위해 줄바꿈을 사용했습니다.



{
  "videoGcsUri': "gs://bucket/filename.ext",
  "timeSegments": [{
    "startTime": "start_time_of_fully_annotated_segment",
    "endTime": "end_time_of_segment"}],
  "timeSegmentAnnotations": [{
    "displayName": "LABEL",
    "startTime": "start_time_of_segment",
    "endTime": "end_time_of_segment"
  }],
  "dataItemResourceLabels": {
    "ml_use": "train|test"
  }
}

참고: 여기서 작업 타임스탬프를 계산하는 데 시간 세그먼트가 사용됩니다. timeSegmentAnnotations의 startTime 및 endTime은 동일할 수 있으며 작업의 키 프레임에 해당합니다.

JSON Line 예시



{"videoGcsUri": "gs://demo/video1.mp4", "timeSegmentAnnotations": [{"displayName": "cartwheel", "startTime": "1.0s", "endTime": "12.0s"}], "dataItemResourceLabels": {"ml_use": "training"}}
{"videoGcsUri": "gs://demo/video2.mp4", "timeSegmentAnnotations": [{"displayName": "swing", "startTime": "4.0s", "endTime": "9.0s"}], "dataItemResourceLabels": {"ml_use": "test"}}
...

분류

Vertex AI는 분류 데이터 세트를 내보낼 때 다음과 같이 공개적으로 액세스할 수 있는 스키마를 사용합니다. 이 스키마는 데이터 내보내기 파일의 형식을 지정합니다. 스키마의 구조는 OpenAPI 스키마를 따릅니다.

gs://google-cloud-aiplatform/schema/dataset/ioformat/video_classification_io_format_1.0.0.yaml

내보낸 데이터 세트의 각 데이터 항목은 다음 형식을 사용합니다. 이 예시에는 가독성을 위해 줄바꿈을 사용했습니다.



{
	"videoGcsUri": "gs://bucket/filename.ext",
	"timeSegmentAnnotations": [{
		"displayName": "LABEL",
		"startTime": "start_time_of_segment",
		"endTime": "end_time_of_segment"
	}],
	"dataItemResourceLabels": {
		"aiplatform.googleapis.com/ml_use": "train|test"
	}
}

JSON Line 예시 - 동영상 분류



{"videoGcsUri": "gs://demo/video1.mp4", "timeSegmentAnnotations": [{"displayName": "cartwheel", "startTime": "1.0s", "endTime": "12.0s"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"videoGcsUri": "gs://demo/video2.mp4", "timeSegmentAnnotations": [{"displayName": "swing", "startTime": "4.0s", "endTime": "9.0s"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}
...

객체 추적

Vertex AI는 객체 추적 데이터 세트를 내보낼 때 공개적으로 액세스할 수 있는 다음 스키마를 사용합니다. 이 스키마는 데이터 내보내기 파일의 형식을 지정합니다. 스키마의 구조는 OpenAPI 스키마를 따릅니다.

gs://google-cloud-aiplatform/schema/dataset/ioformat/object_tracking_io_format_1.0.0.yaml

내보낸 데이터 세트의 각 데이터 항목은 다음 형식을 사용합니다. 이 예시에는 가독성을 위해 줄바꿈을 사용했습니다.



{
	"videoGcsUri": "gs://bucket/filename.ext",
	"TemporalBoundingBoxAnnotations": [{
		"displayName": "LABEL",
		"xMin": "leftmost_coordinate_of_the_bounding box",
		"xMax": "rightmost_coordinate_of_the_bounding box",
		"yMin": "topmost_coordinate_of_the_bounding box",
		"yMax": "bottommost_coordinate_of_the_bounding box",
		"timeOffset": "timeframe_object-detected"
                "instanceId": "instance_of_object
                "annotationResourceLabels": "resource_labels"
	}],
	"dataItemResourceLabels": {
		"aiplatform.googleapis.com/ml_use": "train|test"
	}
}

JSON Line 예시



{'videoGcsUri': 'gs://demo-data/video1.mp4', 'temporal_bounding_box_annotations': [{'displayName': 'horse', 'instance_id': '-1', 'time_offset': '4.000000s', 'xMin': '0.668912', 'yMin': '0.560642', 'xMax': '1.000000', 'yMax': '1.000000'}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{'videoGcsUri': 'gs://demo-data/video2.mp4', 'temporal_bounding_box_annotations': [{'displayName': 'horse', 'instance_id': '-1', 'time_offset': '71.000000s', 'xMin': '0.679056', 'yMin': '0.070957', 'xMax': '0.801716', 'yMax': '0.290358'}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}
...

다음 단계

학습 파이프라인을 만들어 커스텀 학습에서 관리형 데이터 세트를 사용하는 방법 알아보기