Dataproc Spark 지원 인스턴스 만들기

이 페이지에서는 Dataproc Spark를 지원하는 Vertex AI Workbench 인스턴스를 만드는 방법을 설명합니다. 이 페이지에서는 Dataproc JupyterLab 확장 프로그램의 이점을 설명하고 Serverless for Apache Spark 및 Compute Engine의 Dataproc과 함께 확장 프로그램을 사용하는 방법을 간략하게 설명합니다.

Dataproc JupyterLab 확장 프로그램 개요

Vertex AI Workbench 인스턴스에는 Dataproc JupyterLab 확장 프로그램이 사전 설치되어 있습니다(M113 이상 버전).

Dataproc JupyterLab 확장 프로그램은 Dataproc 클러스터와Google Cloud Apache Spark용 Serverless를 실행하여 Apache Spark 노트북 작업을 실행하는 두 가지 방법을 제공합니다.

Dataproc 클러스터에는 Spark가 실행되는 인프라를 제어할 수 있는 다양한 기능이 포함되어 있습니다. Spark 클러스터의 크기와 구성을 선택하여 환경을 맞춤설정하고 제어할 수 있습니다. 이 접근 방식은 복잡한 워크로드, 장기 실행 작업, 세분화된 리소스 관리에 적합합니다.
Serverless for Apache Spark는 인프라 문제를 해결해 줍니다. Spark 작업을 제출하면 Google이 백그라운드에서 리소스 프로비저닝, 확장, 최적화를 처리합니다. 이 서버리스 방식은 데이터 과학 및 ML 워크로드를 위한 비용 효율적인 옵션을 제공합니다.

두 옵션 모두 데이터 처리 및 분석에 Spark를 사용할 수 있습니다. Dataproc 클러스터와 Serverless for Apache Spark 중에서 어느 것을 선택할 것인지는 특정 워크로드 요구사항, 필요한 제어 수준, 리소스 사용 패턴에 따라 달라집니다.

데이터 과학 및 ML 워크로드에 Serverless for Apache Spark를 사용할 때의 이점은 다음과 같습니다.

클러스터 관리 불필요: Spark 클러스터 프로비저닝, 구성, 관리에 대해 걱정할 필요가 없습니다. 시간과 리소스가 절약됩니다.
자동 확장: Serverless for Apache Spark는 워크로드에 따라 자동으로 확장 및 축소되므로 사용하는 리소스에 대해서만 비용을 지불합니다.
고성능: Serverless for Apache Spark는 성능에 최적화되어 있으며 Google Cloud의 인프라를 활용합니다.
다른 Google Cloud 기술과의 통합: Serverless for Apache Spark는 BigQuery 및 Dataplex Universal Catalog와 같은 다른 Google Cloud 제품과 통합됩니다.

자세한 내용은 Google Cloud Serverless for Apache Spark 문서를 참고하세요.

시작하기 전에

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Enable the Cloud Resource Manager, Dataproc, and Notebooks APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Enable the Cloud Resource Manager, Dataproc, and Notebooks APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

필요한 역할

Serverless for Apache Spark 클러스터나 Dataproc 클러스터에서 노트북 파일을 실행하는 데 필요한 권한을 서비스 계정에 확보하려면 관리자에게 서비스 계정에 다음 IAM 역할을 부여해 달라고 요청하세요.

프로젝트에 대한 Dataproc 작업자(roles/dataproc.worker)
dataproc.clusters.use 권한에 필요한 클러스터에 대한 Dataproc 편집자(roles/dataproc.editor)

역할 부여에 대한 자세한 내용은 프로젝트, 폴더, 조직에 대한 액세스 관리를 참조하세요.

이러한 사전 정의된 역할에는 Serverless for Apache Spark 클러스터나 Dataproc 클러스터에서 노트북 파일을 실행하는 데 필요한 권한이 포함되어 있습니다. 필요한 정확한 권한을 보려면 필수 권한 섹션을 펼치세요.

필수 권한

Serverless for Apache Spark 클러스터나 Dataproc 클러스터에서 노트북 파일을 실행하려면 다음 권한이 필요합니다.

dataproc.agents.create
dataproc.agents.delete
dataproc.agents.get
dataproc.agents.update
dataproc.tasks.lease
dataproc.tasks.listInvalidatedLeases
dataproc.tasks.reportStatus
dataproc.clusters.use

관리자는 커스텀 역할이나 다른 사전 정의된 역할을 사용하여 서비스 계정에 이러한 권한을 부여할 수도 있습니다.

Dataproc이 사용 설정된 인스턴스 만들기

Dataproc이 사용 설정된 Vertex AI Workbench 인스턴스를 만들려면 다음을 수행합니다.

Google Cloud 콘솔에서 인스턴스 페이지로 이동합니다.

인스턴스로 이동
새로 만들기를 클릭합니다.
새 인스턴스 대화상자에서 고급 옵션을 클릭합니다.
인스턴스 만들기 대화상자의 세부정보 섹션에서 Dataproc 서버리스 대화형 세션 사용 설정이 선택되어 있는지 확인합니다.
Workbench 유형이 인스턴스로 설정되어 있는지 확인합니다.
환경 섹션에서 최신 버전이나 M113 이상의 버전을 사용하고 있는지 확인합니다.
만들기를 클릭합니다.

Vertex AI Workbench에서 인스턴스를 만들고 자동으로 시작합니다. 인스턴스를 사용할 수 있으면 Vertex AI Workbench에서 JupyterLab 열기 링크를 활성화합니다.

JupyterLab 열기

인스턴스 이름 옆에 있는 JupyterLab 열기를 클릭합니다.

JupyterLab 런처 탭이 브라우저에 열립니다. 기본적으로 Serverless for Apache Spark 노트북 및 Dataproc 작업 및 세션 섹션이 포함되어 있습니다. 선택한 프로젝트 및 리전에 Jupyter 지원 클러스터가 있는 경우 Dataproc 클러스터 노트북이라는 섹션이 있습니다.

Serverless for Apache Spark로 확장 프로그램 사용

Vertex AI Workbench 인스턴스와 동일한 리전 및 프로젝트에 있는 Serverless for Apache Spark 런타임 템플릿은 JupyterLab 런처 탭의 Serverless for Apache Spark 노트북 섹션에 표시됩니다.

런타임 템플릿을 만들려면 Serverless for Apache Spark 런타임 템플릿 만들기를 참고하세요.

새 서버리스 Spark 노트북을 열려면 런타임 템플릿을 클릭합니다. 원격 Spark 커널이 시작되는 데 약 1분 정도 걸립니다. 커널이 시작된 후 코딩을 시작할 수 있습니다.

Compute Engine에서 Dataproc과 함께 확장 프로그램 사용

Compute Engine Jupyter 클러스터에서 Dataproc을 만든 경우 런처 탭에 Dataproc 클러스터 노트북 섹션이 포함되어 있습니다.

해당 리전 및 프로젝트에서 액세스할 수 있는 Jupyter 지원 Dataproc 클러스터마다 4개의 카드가 표시됩니다.

리전 및 프로젝트를 변경하려면 다음을 수행하세요.

설정 > Cloud Dataproc 설정을 선택합니다.
설정 구성 탭의 프로젝트 정보에서 프로젝트 ID 및 리전을 변경한 후 저장을 클릭합니다.

이러한 변경사항은 JupyterLab을 다시 시작할 때까지 적용되지 않습니다.
JupyterLab을 다시 시작하려면 파일 > 종료를 선택한 후 Vertex AI Workbench 인스턴스 페이지에서 JupyterLab 열기를 클릭합니다.

새 노트북을 만들려면 카드를 클릭합니다. Dataproc 클러스터의 원격 커널이 시작되면 코드 작성을 시작한 후 클러스터에서 실행할 수 있습니다.

gcloud CLI 및 API를 사용하여 인스턴스에서 Dataproc 관리

이 섹션에서는 Vertex AI Workbench 인스턴스에서 Dataproc을 관리하는 방법을 설명합니다.

Dataproc 클러스터 리전 변경

Python 및 TensorFlow와 같은 Vertex AI Workbench 인스턴스의 기본 커널은 인스턴스의 VM에서 실행되는 로컬 커널입니다. Dataproc Spark 지원 Vertex AI Workbench 인스턴스에서 노트북은 원격 커널을 통해 Dataproc 클러스터에서 실행됩니다. 원격 커널은 인스턴스 VM 외부에 있는 서비스에서 실행되므로 같은 프로젝트 내의 모든 Dataproc 클러스터에 액세스할 수 있습니다.

기본적으로 Vertex AI Workbench는 인스턴스와 동일한 리전 내의 Dataproc 클러스터를 사용하지만, Dataproc 클러스터에서 구성요소 게이트웨이 및 선택적 Jupyter 구성요소가 사용 설정되어 있는 한 Dataproc 리전을 변경할 수 있습니다.

액세스 테스트

Dataproc JupyterLab 확장 프로그램은 기본적으로 Vertex AI Workbench 인스턴스에 사용 설정되어 있습니다. Dataproc에 대한 액세스를 테스트하려면 다음 curl 요청을 kernels.googleusercontent.com 도메인으로 전송하여 인스턴스 원격 커널에 대한 액세스를 확인하면 됩니다.

curl --verbose -H "Authorization: Bearer $(gcloud auth print-access-token)" https://PROJECT_ID-dot-REGION.kernels.googleusercontent.com/api/kernelspecs | jq .

curl 명령어가 실패하면 다음을 확인합니다.

DNS 항목이 올바르게 구성되어 있습니다.
같은 프로젝트에 사용 가능한 클러스터가 있습니다(또는 클러스터가 없으면 클러스터를 만들어야 함).
클러스터에 구성요소 게이트웨이와 선택적 Jupyter 구성요소 모두 사용 설정되어 있습니다.

Dataproc 사용 중지

Vertex AI Workbench 인스턴스는 기본적으로 Dataproc이 사용 설정된 상태로 생성됩니다. disable-mixer metadata 키를 true로 설정하여 Dataproc이 사용 중지된 Vertex AI Workbench 인스턴스를 만들 수 있습니다.

gcloud workbench instances create INSTANCE_NAME --metadata=disable-mixer=true

Dataproc 사용 설정

메타데이터 값을 업데이트하여 중지된 Vertex AI Workbench 인스턴스에서 Dataproc을 사용 설정할 수 있습니다.

gcloud workbench instances update INSTANCE_NAME --metadata=disable-mixer=false

Terraform을 사용하여 Dataproc 관리

Terraform에서 Vertex AI Workbench용 Dataproc 인스턴스는 메타데이터 필드의 disable-mixer 키를 통해 관리됩니다. disable-mixer metadata 키를 false로 설정하여 Dataproc을 사용 설정합니다. disable-mixer 메타데이터 키를 true로 설정하여 Dataproc을 사용 중지합니다.

Terraform 구성을 적용하거나 삭제하는 방법은 기본 Terraform 명령어를 참조하세요.

resource "google_workbench_instance" "default" {
  name     = "workbench-instance-example"
  location = "us-central1-a"

  gce_setup {
    machine_type = "n1-standard-1"
    vm_image {
      project = "cloud-notebooks-managed"
      family  = "workbench-instances"
    }
    metadata = {
      disable-mixer = "false"
    }
  }
}

문제 해결

Dataproc Spark 지원 인스턴스 만들기와 관련된 문제를 진단하고 해결하려면 Vertex AI Workbench 문제 해결을 참조하세요.

다음 단계

Dataproc JupyterLab 확장 프로그램에 대한 자세한 내용은 JupyterLab 확장 프로그램을 사용하여 서버리스 Spark 워크로드 개발을 참고하세요.
Serverless for Apache Spark에 대해 자세히 알아보려면 Serverless for Apache Spark 문서를 참고하세요.
클러스터를 프로비저닝 및 관리하지 않고 Serverless for Apache Spark 워크로드를 실행하는 방법을 알아봅니다.
Google Cloud 제품 및 서비스와 함께 Spark를 사용하는 방법에 대한 자세한 내용은 Google Cloud기반 Spark를 참조하세요.
사용 가능한 GitHub의 Dataproc 템플릿을 둘러봅니다.
GitHub의 serverless-spark-workshop을 통해 Serverless Spark에 대해 알아보세요.
Apache Spark 문서를 읽어보세요.