Run an Apache Spark batch workload

Learn how to use Dataproc Serverless to submit a batch workload on a Dataproc-managed compute infrastructure that scales resources as needed.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Dataproc API.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Dataproc API.

Enable the API

Make sure the regional VPC subnet where you will run your workload has Private Google Access enabled. For more information, see Submit a Spark batch workload

Submit a Spark batch workload

You can use the Google Cloud console, the Google Cloud CLI, or the Dataproc Serverless API to create and submit a Dataproc Serverless for Spark batch workload.

Console

In the Google Cloud console, go to Dataproc Batches.
Click Create.
Submit a Spark batch workload that computes the approximate value of pi by selecting and filling in the following fields:
- Batch Info:
  - Batch ID: Specify an ID for your batch workload. This value must be 4-63 lowercase characters. Valid characters are /[a-z][0-9]-/.
  - Region: Select a region where your workload will run.
- Container:
  - Batch type: Spark.
  - Runtime version: The default runtime version is selected. You can optionally specify a non-default Dataproc Serverless runtime version.
  - Main class:
```
org.apache.spark.examples.SparkPi
```
  - Jar files (this file is pre-installed in the Dataproc Serverless Spark execution environment).
```
file:///usr/lib/spark/examples/jars/spark-examples.jar
```
  - Arguments: 1000.
- Execution Configuration: You can specify a service account to use to run your workload. If you don't specify a service account, the workload runs under the Compute Engine default service account. Your service account must have the Dataproc Worker role.
- Network configuration: The VPC subnetwork that executes Dataproc Serverless for Spark workloads must be enabled for Private Google Access PGA and meet the other requirements listed in Dataproc Serverless for Spark network configuration.
  
  The Primary network and subnetwork selectors list networks with subnets in the selected workload region that have Private Google Access enabled. Select a network and subnet from the list. If no networks and subnets are listed, you can enable Private Google Access on a VPC subnet in the currently selected workload region or change the workload region to a region with a listed PGA-enabled subnet, and then select that network and subnet.
- Properties: Enter the Key (property name) and Value of supported Spark properties to set on your Spark batch workload. Note: Unlike Dataproc on Compute Engine cluster properties, Dataproc Serverless for Spark workload properties don't include a spark: prefix.
- Other options:
  - You can configure the batch workload to use an external self-managed Hive Metastore.
  - You can use a Persistent History Server (PHS). The PHS must be located in the region where you run batch workloads.
Click Submit to run the Spark batch workload.

gcloud

To submit a Spark batch workload to compute the approximate value of pi, run the following gcloud CLI gcloud dataproc batches submit spark command locally in a terminal window or in Cloud Shell.

gcloud dataproc batches submit spark \
    --region=REGION \
    --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
    --class=org.apache.spark.examples.SparkPi \
    -- 1000

Notes:

REGION: Specify the region where your workload will run.
Subnetwork: The VPC subnetwork that executes Dataproc Serverless for Spark workloads must be enabled for Private Google Access and meet the other requirements listed in Dataproc Serverless for Spark network configuration. If the default network's subnet for the region specified in the gcloud dataproc batches submit command is not enabled for Private Google Access, you must do one of the following:
- Enable the default network's subnet for the region for Private Google Access, or
- Use the --subnet=SUBNET_URI flag to specify a subnet that has Private Google Access enabled. You can run the gcloud compute networks describe <var>NETWORK_NAME</var> command to list the URIs of subnets in a network.
--jars: The example JAR file is pre-installed in the Spark execution environment, The 1000 command argument passed to the SparkPi workload specifies 1000 iterations of the pi estimation logic (workload input arguments are included after the "-- ").
--properties: You can add this flag to enter supported Spark properties for your Spark batch workload to use.
--deps-bucket: You can add this flag to specify a Cloud Storage bucket where Dataproc Serverless will upload workload dependencies. The gs:// URI prefix of the bucket is not required; you can specify the bucket path or bucket name, for example, "mybucketname". Dataproc Serverless for Spark uploads the local file(s) to a /dependencies folder in the bucket before running the batch workload. Note: This flag is required if your batch workload references files on your local machine.
--ttl: You can add the --ttl flag to specify the duration of the batch lifetime. When the workload exceeds this duration, it is unconditionally terminated without waiting for ongoing work to finish. Specify the duration using a s, m, h, or d (seconds, minutes, hours, or days) suffix. The minimum value is 10 minutes (10m), and the maximum value is 14 days (14d).
- 1.1 or 2.0 runtime batches: If --ttl is not specified for a 1.1 or 2.0 runtime batch workload, the workload is allowed to run until it exits naturally (or run forever if it does not exit).
- 2.1+ runtime batches: If --ttl is not specified for a 2.1 or later runtime batch workload, it defaults to 4h.
--service-account: You can specify a service account to use to run your workload. If you don't specify a service account, the workload runs under the Compute Engine default service account. Your service account must have the Dataproc Worker role.

Other options: You can add gcloud dataproc batches submit spark flags to specify other workload options and Spark properties.

Hive Metastore: The following command configures a batch workload to use an external self-managed Hive Metastore using a standard Spark configuration.

gcloud dataproc batches submit spark\
    --properties=spark.sql.catalogImplementation=hive,spark.hive.metastore.uris=METASTORE_URI,spark.hive.metastore.warehouse.dir=WAREHOUSE_DIR> \
    other args ...

Persistent History Server:

The following command creates a PHS on a single-node Dataproc cluster. The PHS must be located in the region where you run batch workloads, and the Cloud Storage bucket-name must exist.

gcloud dataproc clusters create PHS_CLUSTER_NAME \
    --region=REGION \
    --single-node \
    --enable-component-gateway \
    --properties=spark:spark.history.fs.logDirectory=gs://bucket-name/phs/*/spark-job-history

Submit a batch workload, specifying your running Persistent History Server.

gcloud dataproc batches submit spark \
    --region=REGION \
    --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
    --class=org.apache.spark.examples.SparkPi \
    --history-server-cluster=projects/project-id/regions/region/clusters/PHS-cluster-name \
    -- 1000

Runtime version: Use the --version flag to specify the Dataproc Serverless runtime version for the workload.

gcloud dataproc batches submit spark \
    --region=REGION \
    --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
    --class=org.apache.spark.examples.SparkPi \
    --version=VERSION
    -- 1000

API

This section shows how to create a batch workload to compute the approximate value of pi using the Dataproc Serverless for Spark batches.create`

Before using any of the request data, make the following replacements:

project-id: A Google Cloud project ID.
region: A Compute Engine region where Dataproc Serverless will run the workload.

Notes:

RuntimeConfig.containerImage: You can specify a custom container image using the Docker image naming format: {hostname}/{project-id}/{image}:{tag}, for example, "gcr.io/my-project-id/my-image:1.0.1". Note: You must host your custom container on Container Registry.
ExecutionConfig.subnetworkUri: The VPC subnetwork that executes Dataproc Serverless for Spark workloads must be enabled for Private Google Access and meet the other requirements listed in Dataproc Serverless for Spark network configuration. If the default network's subnet for the specified region is not enabled for Private Google Access, you must do one of the following:
1. Enable the default network's subnet for the region for Private Google Access, or
2. Use the ExecutionConfig.subnetworkUri field to specify a subnet that has Private Google Access enabled. You can run the gcloud compute networks describe [NETWORK_NAME] command to list the URIs of subnets in a network.
sparkBatch.jarFileUris: The example jar file is pre-installed in the Spark execution environment. The "1000" sparkBatch.args is passed to the SparkPi workload, and specifies 1000 iterations of the pi estimation logic.
RuntimeConfig.properties: You can use this field to enter supported Spark properties for your Spark batch workload to use.
ExecutionConfig.serviceAccount: You can specify a service account to use to run your workload. If you don't specify a service account, the workload runs under the Compute Engine default service account. Your service account must have the Dataproc Worker role.
EnvironmentConfig.ttl: You can use this field to specify the duration of the batch lifetime. When the workload exceeds this duration, it is unconditionally terminated without waiting for ongoing work to finish. Specify the duration as the JSON representation for Duration. The minimum value is 10 minutes, and the maximum value is 14 days.
- 1.1 or 2.0 runtime batches: If --ttl is not specified for a 1.1 or 2.0 runtime batch workload, the workload is allowed to run until it exits naturally (or run forever if it does not exit).
- 2.1+ runtime batches: If --ttl is not specified for a 2.1 or later runtime batch workload, it defaults to 4 hours.
Other options:
- Configure the batch workload to use an external self-managed Hive Metastore.
- Use a Persistent History Server (PHS). The PHS must be located in the region where you run batch workloads.
- Use the RuntimeConfig.version field as part of the batches.create request to specify a non-default Dataproc Serverless runtime version

HTTP method and URL:

POST https://dataproc.googleapis.com/v1/projects/project-id/locations/region/batches

Request JSON body:

{
  "sparkBatch":{
    "args":[
      "1000"
    ],
    "jarFileUris":[
      "file:///usr/lib/spark/examples/jars/spark-examples.jar"
    ],
    "mainClass":"org.apache.spark.examples.SparkPi"
  }
}

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://dataproc.googleapis.com/v1/projects/project-id/locations/region/batches"

PowerShell (Windows)

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://dataproc.googleapis.com/v1/projects/project-id/locations/region/batches" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
"name":"projects/project-id/locations/region/batches/batch-id",
  "uuid":",uuid",
  "createTime":"2021-07-22T17:03:46.393957Z",
  "sparkBatch":{
    "mainClass":"org.apache.spark.examples.SparkPi",
    "args":[
      "1000"
    ],
    "jarFileUris":[
      "file:///usr/lib/spark/examples/jars/spark-examples.jar"
    ]
  },
  "runtimeInfo":{
    "outputUri":"gs://dataproc-.../driveroutput"
  },
  "state":"SUCCEEDED",
  "stateTime":"2021-07-22T17:06:30.301789Z",
  "creator":"account-email-address",
  "runtimeConfig":{
    "properties":{
      "spark:spark.executor.instances":"2",
      "spark:spark.driver.cores":"2",
      "spark:spark.executor.cores":"2",
      "spark:spark.app.name":"projects/project-id/locations/region/batches/batch-id"
    }
  },
  "environmentConfig":{
    "peripheralsConfig":{
      "sparkHistoryServerConfig":{
      }
    }
  },
  "operation":"projects/project-id/regions/region/operation-id"
}

Estimate workload costs

Dataproc Serverless for Spark workloads consume Data Compute Unit (DCU) and shuffle storage resources. For an example that outputs Dataproc UsageMetrics to estimate workload resource consumption and costs, see Dataproc Serverless pricing.

What's next

Learn about: