이 페이지는 Cloud Translation API를 통해 번역되었습니다.

Apache Spark 일괄 워크로드 실행

Apache Spark용 서버리스를 사용하여 Dataproc 관리형 컴퓨팅 인프라에 일괄 워크로드를 제출하고 필요에 따라 리소스를 자동 확장하는 방법을 알아봅니다.

시작하기 전에

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

Spark 배치 워크로드 제출

Google Cloud 콘솔, Google Cloud CLI 또는 Apache Spark용 서버리스 API를 사용하여 Apache Spark용 서버리스 일괄 워크로드를 만들고 제출할 수 있습니다.

콘솔

Google Cloud 콘솔에서 Dataproc Batches로 이동합니다.
만들기를 클릭합니다.
다음 필드를 선택하고 작성하여 pi의 대략적인 값을 계산하는 Spark 배치 워크로드를 제출합니다.
- 배치 정보
  - 배치 ID: 배치 워크로드에 ID를 지정합니다. 이 값은 4~63자(영문 기준)의 소문자여야 합니다. 유효한 문자는 /[a-z][0-9]-/입니다.
  - 리전: 워크로드를 실행할 리전을 선택합니다.
- 컨테이너:
  - 배치 유형: Spark
  - 런타임 버전: 기본 런타임 버전이 선택됩니다. 원하는 경우 기본이 아닌 Apache Spark용 서버리스 런타임 버전을 지정할 수 있습니다.
  - 기본 클래스:
```
org.apache.spark.examples.SparkPi
```
  - Jar 파일 (이 파일은 Apache Spark용 서버리스 Spark 실행 환경에 사전 설치됨).
```
file:///usr/lib/spark/examples/jars/spark-examples.jar
```
  - 인수: 1000.
- 실행 구성: 워크로드를 실행하는 데 사용할 서비스 계정을 지정할 수 있습니다. 서비스 계정을 지정하지 않으면 워크로드가 Compute Engine 기본 서비스 계정으로 실행됩니다. 서비스 계정에는 Dataproc 작업자 역할이 있어야 합니다.
- 네트워크 구성: 세션 리전에서 서브네트워크를 선택합니다. Apache Spark용 서버리스는 지정된 서브넷에서 비공개 Google 액세스 (PGA)를 사용 설정합니다. 네트워크 연결 요구사항은 Google Cloud Apache Spark용 서버리스 네트워크 구성을 참고하세요.
- 속성: Spark 일괄 워크로드에서 설정하려는 Key(속성 이름) 및 지원되는 Spark 속성의 Value를 입력합니다. 참고: Compute Engine 클러스터 속성의 Dataproc와 달리 Apache Spark용 서버리스 워크로드 속성에는 spark: 프리픽스가 포함되지 않습니다.
- 기타 옵션:
  - 외부 자체 관리형 Hive 메타스토어를 사용하도록 배치 워크로드를 구성할 수 있습니다.
  - 영구 기록 서버(PHS)를 사용할 수 있습니다. PHS는 일괄 워크로드를 실행하는 리전에 있어야 합니다.
제출을 클릭하여 Spark 배치 워크로드를 실행합니다.

gcloud

Spark 일괄 워크로드를 제출하여 pi의 근사치를 계산하려면 터미널 또는 Cloud Shell에서 다음 gcloud CLI gcloud dataproc batches submit spark 명령어를 로컬로 실행합니다.

gcloud dataproc batches submit spark \
    --region=REGION \
    --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
    --class=org.apache.spark.examples.SparkPi \
    -- 1000

다음을 바꿉니다.

REGION: 워크로드가 실행되는 리전을 지정합니다.
기타 옵션: gcloud dataproc batches submit spark 플래그를 추가하여 다른 워크로드 옵션 및 Spark 속성을 지정할 수 있습니다.
- --version: 기본값이 아닌 Apache Spark용 서버리스 런타임 버전을 지정할 수 있습니다.
- --jars: 예시 JAR 파일은 Spark 실행 환경에 사전 설치됩니다. SparkPi 워크로드에 전달된 1000 명령어 인수는 pi 추정 로직의 1000회 반복을 지정합니다 (워크로드 입력 인수는 '--' 뒤에 포함).
- --subnet: 이 플래그를 추가하여 세션 리전의 서브넷 이름을 지정할 수 있습니다. 서브넷을 지정하지 않으면 Apache Spark용 서버리스가 세션 리전에서 default 서브넷을 선택합니다. Apache Spark용 서버리스는 서브넷에서 비공개 Google 액세스 (PGA)를 사용 설정합니다. 네트워크 연결 요구사항은 Google Cloud Apache Spark용 서버리스 네트워크 구성을 참고하세요.
- --properties: 이 플래그를 추가하여 Spark 배치 워크로드가 사용할 지원되는 Spark 속성을 입력할 수 있습니다.
- --deps-bucket: 이 플래그를 추가하여 Apache Spark용 서버리스에서 워크로드 종속 항목을 업로드할 Cloud Storage 버킷을 지정할 수 있습니다. 버킷의 gs:// URI 접두사는 필수가 아닙니다. 버킷 경로 또는 버킷 이름을 지정할 수 있습니다. Apache Spark용 서버리스는 일괄 워크로드를 실행하기 전에 버킷의 /dependencies 폴더에 로컬 파일을 업로드합니다. 참고: 이 플래그는 일괄 워크로드가 로컬 머신의 파일을 참조할 때 필수입니다.
- --ttl: --ttl 플래그를 추가하여 일괄 수명 기간을 지정할 수 있습니다. 워크로드가 이 기간을 초과하면 진행 중인 작업이 완료될 때까지 기다리지 않고 무조건 종료됩니다. s, m, h 또는 d(초, 분, 시, 일) 서픽스를 사용하여 기간을 지정합니다. 최솟값은 10분 (10m)이며 최댓값은 14일 (14d)입니다.
  - 1.1 또는 2.0 런타임 배치: 1.1 또는 2.0 런타임 일괄 워크로드에 --ttl을 지정하지 않으면 워크로드가 자연스럽게 종료될 때까지 실행됩니다 (또는 종료되지 않는 경우 영구 실행).
  - 2.1 이상 런타임 배치: 2.1 이상 런타임 일괄 워크로드에 --ttl을 지정하지 않으면 기본값은 4h입니다.
- --service-account: 워크로드를 실행하는 데 사용할 서비스 계정을 지정할 수 있습니다. 서비스 계정을 지정하지 않으면 워크로드가 Compute Engine 기본 서비스 계정으로 실행됩니다. 서비스 계정에는 Dataproc 작업자 역할이 있어야 합니다.
- Hive 메타스토어: 다음 명령어는 표준 Spark 구성을 사용해서 외부 자체 관리형 Hive 메타스토어를 사용하도록 일괄 워크로드를 구성합니다.
```
gcloud dataproc batches submit spark\
    --properties=spark.sql.catalogImplementation=hive,spark.hive.metastore.uris=METASTORE_URI,spark.hive.metastore.warehouse.dir=WAREHOUSE_DIR> \
    other args ...
        
```
- 영구 기록 서버:
  1. 다음 명령어는 단일 노드 Dataproc 클러스터에 PHS를 만듭니다. PHS가 일괄 워크로드를 실행하는 리전에 있으며 Cloud Storage bucket-name이 있어야 합니다.
```
gcloud dataproc clusters create PHS_CLUSTER_NAME \
    --region=REGION \
    --single-node \
    --enable-component-gateway \
    --properties=spark:spark.history.fs.logDirectory=gs://bucket-name/phs/*/spark-job-history
             
```
  2. 실행 중인 영구 기록 서버를 지정하여 배치 워크로드를 제출합니다.
```
gcloud dataproc batches submit spark \
    --region=REGION \
    --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
    --class=org.apache.spark.examples.SparkPi \
    --history-server-cluster=projects/project-id/regions/region/clusters/PHS-cluster-name \
    -- 1000
              
```
- 런타임 버전: --version 플래그를 사용하여 워크로드의 Apache Spark용 서버리스 런타임 버전을 지정합니다.
```
gcloud dataproc batches submit spark \
    --region=REGION \
    --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
    --class=org.apache.spark.examples.SparkPi \
    --version=VERSION
    -- 1000
            
```

API

이 섹션에서는 Apache Spark용 서버리스 batches.create를 사용해서 pi의 근사치를 계산하도록 일괄 워크로드를 만드는 방법을 보여줍니다.

요청 데이터를 사용하기 전에 다음을 바꿉니다.

project-id: Google Cloud 프로젝트 ID입니다.
region: Apache Spark용 서버리스가 워크로드를 실행하는 Compute Engine 리전입니다. Google Cloud

참고:

PROJECT_ID: Google Cloud 프로젝트 ID입니다. 프로젝트 ID는 Google Cloud 콘솔 대시보드의 프로젝트 정보 섹션에 나열됩니다.
REGION: 세션 리전입니다.

HTTP 메서드 및 URL:

POST https://dataproc.googleapis.com/v1/projects/project-id/locations/region/batches

JSON 요청 본문:

{
  "sparkBatch":{
    "args":[
      "1000"
    ],
    "jarFileUris":[
      "file:///usr/lib/spark/examples/jars/spark-examples.jar"
    ],
    "mainClass":"org.apache.spark.examples.SparkPi"
  }
}

요청을 보내려면 다음 옵션 중 하나를 펼칩니다.

cURL(Linux, macOS, Cloud Shell)

참고: 다음 명령어는 gcloud init 또는 gcloud auth login을 실행하거나 gcloud CLI에 자동으로 로그인하는 Cloud Shell을 사용하여 사용자 계정으로 gcloud CLI에 로그인했다고 가정합니다. gcloud auth list를 실행하면 현재 활성 계정을 확인할 수 있습니다.

요청 본문을 request.json 파일에 저장하고 다음 명령어를 실행합니다.

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://dataproc.googleapis.com/v1/projects/project-id/locations/region/batches"

PowerShell(Windows)

참고: 다음 명령어는 gcloud init 또는 gcloud auth login을 실행하여 사용자 계정으로 gcloud CLI에 로그인했다고 가정합니다. gcloud auth list를 실행하면 현재 활성 계정을 확인할 수 있습니다.

요청 본문을 request.json 파일에 저장하고 다음 명령어를 실행합니다.

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://dataproc.googleapis.com/v1/projects/project-id/locations/region/batches" | Select-Object -Expand Content

다음과 비슷한 JSON 응답이 표시됩니다.

{
"name":"projects/project-id/locations/region/batches/batch-id",
  "uuid":",uuid",
  "createTime":"2021-07-22T17:03:46.393957Z",
  "sparkBatch":{
    "mainClass":"org.apache.spark.examples.SparkPi",
    "args":[
      "1000"
    ],
    "jarFileUris":[
      "file:///usr/lib/spark/examples/jars/spark-examples.jar"
    ]
  },
  "runtimeInfo":{
    "outputUri":"gs://dataproc-.../driveroutput"
  },
  "state":"SUCCEEDED",
  "stateTime":"2021-07-22T17:06:30.301789Z",
  "creator":"account-email-address",
  "runtimeConfig":{
    "version":"2.3",
    "properties":{
      "spark:spark.executor.instances":"2",
      "spark:spark.driver.cores":"2",
      "spark:spark.executor.cores":"2",
      "spark:spark.app.name":"projects/project-id/locations/region/batches/batch-id"
    }
  },
  "environmentConfig":{
    "peripheralsConfig":{
      "sparkHistoryServerConfig":{
      }
    }
  },
  "operation":"projects/project-id/regions/region/operation-id"
}

워크로드 비용 추정

Apache Spark용 서버리스 워크로드는 데이터 컴퓨팅 단위 (DCU) 및 셔플 스토리지 리소스를 사용합니다. 워크로드 리소스 소비 및 비용을 추정하기 위해 Dataproc UsageMetrics를 출력하는 예시는 Apache Spark용 서버리스 가격 책정을 참고하세요.

다음 단계

다음 사항에 대해 알아보세요.