Run an Apache Spark batch workload

Setup

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.

  4. Enable the Dataproc API.

    Enable the API

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.

  7. Enable the Dataproc API.

    Enable the API

Submit a Spark batch workload

Console

Go to Dataproc Batches in the Google Cloud console. Click CREATE to open the Create batch page.

Select and fill in the following fields on the page to submit a Spark batch workload that computes the approximate value of pi:

  • Batch Info
    • Batch ID: Specify an ID for your batch workload. This value must be 4-63 lowercase characters. Valid characters are /[a-z][0-9]-/.
    • Region: Select a region where your workload will run.
  • Container
    • Batch type: Spark
    • Main class:
      org.apache.spark.examples.SparkPi
    • Jar files:
      file:///usr/lib/spark/examples/jars/spark-examples.jar
    • Arguments: 1000
  • Execution Configuration You can specify a service account to use to run your workload. If you do not specify a service account, the workload will run under the Compute Engine default service account.
  • Network configuration The VPC subnetwork that executes Serverless Spark workloads must meet the requirements listed in Dataproc Serverless for Spark network configuration. The subnetwork list display subnets in your selected network that are enabled for Private Google Access.
  • Properties: Enter the Key (property name) Value of any supported Spark properties you want your Spark batch workload to use. Note: Unlike Dataproc on Compute Engine cluster properties, Dataproc Serverless for Spark workload properties do not include a "spark:" prefix.
  • Other options

Click SUBMIT to run the Spark batch workload.

gcloud

To submit a Spark batch workload to compute the approximate value of pi, run the following gcloud CLI gcloud dataproc batches submit spark command locally in a terminal window or in Cloud Shell.

gcloud dataproc batches submit spark \
    --region=region \
    --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
    --class=org.apache.spark.examples.SparkPi \
    -- 1000

Notes:

  • Subnetwork: The VPC subnetwork that executes Serverless Spark workloads must meet the requirements listed in Dataproc Serverless for Spark network configuration. If the default network's subnet for the region specified in the gcloud dataproc batches submit command is not enabled for Private Google Access, you must do one of the following:
    1. Enable the default network's subnet for the region for Private Google Access, or
    2. Use the --subnet=[SUBNET_URI] flag in the command to specify a subnet that has Private Google Access enabled. You can run the gcloud compute networks describe [NETWORK_NAME] command to list the URIs of subnets in a network.
  • --jars: The example jar file is pre-installed, and the --1000 command argument passed to the SparkPi workload specifies 1000 iterations of the pi estimation logic (workload input arguments are included after the "-- ").
  • --properties: You can add the --properties flag to enter any supported Spark properties you want your Spark batch workload to use.
  • --deps-bucket: You can add this flag to specify a Cloud Storage bucket where Dataproc Serverless for Spark will upload workload dependencies. The gs:// URI prefix of the bucket is not required; you can specify the bucket path/name only, for example, "mybucketname". Exception: If your batch workload references files on your local machine, the --deps-bucket flag is required; Dataproc Serverless for Spark will upload the local file(s) to a /dependencies folder in the bucket before running the batch workload.
  • --container-image: You can specify a custom container image using the Docker image naming format: {hostname}/{project-id}/{image}:{tag}, for example, gcr.io/my-project-id/my-image:1.0.1. You must host your custom container on Container Registry.
  • Other options:
    • You can add other optional command flags. For example, the following command configures the batch workload to use an external self-managed Hive Metastore using a standard Spark configuration:
      gcloud beta dataproc batches submit \
          --properties=spark.sql.catalogImplementation=hive,spark.hive.metastore.uris=METASTORE_URI,spark.hive.metastore.warehouse.dir=WAREHOUSE_DIR> \
          other args ...
                 
      See gcloud dataproc batches submit for supported command flags.
    • Use a Persistent History Server.
      1. Create a Persistent History Server (PHS) on a single-node Dataproc cluster. Note: the Cloud Storage bucket-name must exist.
        gcloud dataproc clusters create PHS-cluster-name \
            --region=region \
            --single-node \
            --enable-component-gateway \
            --properties=spark:spark.history.fs.logDirectory=gs://bucket-name/phs/*/spark-job-history
                     
      2. Submit a batch workload, specifying your running Persistent History Server.
        gcloud dataproc batches submit spark \
            --region=region \
            --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
            --class=org.apache.spark.examples.SparkPi \
            --history-server-cluster=projects/project-id/regions/region/clusters/PHS-cluster-name
            -- 1000
                      

REST & CMD LINE

This section shows how to create a batch workload to compute the approximate value of pi using the Dataproc Serverless for Spark batches.create API.

Before using any of the request data, make the following replacements:

  • project-id: Google Cloud project ID
  • region: region
  • Notes:
    • Custom-container-image: Specify the custom container image using the Docker image naming format: {hostname}/{project-id}/{image}:{tag}, for example, "gcr.io/my-project-id/my-image:1.0.1". Note: You must host your custom container on Container Registry.
    • Subnetwork: If the default network's subnet for the specified region is not enabled for Private Google Access, you must do one of the following:
      1. Enable the default network's subnet for the region for Private Google Access, or
      2. Use the ExecutionConfig.subnetworkUri field to specify a subnet that has Private Google Access enabled. You can run the gcloud compute networks describe [NETWORK_NAME] command to list the URIs of subnets in a network.
    • sparkBatch.jarFileUris: The example jar file is pre-installed in the Spark execution environment. The "1000" sparkBatch.args is passed to the SparkPi workload, and specifies 1000 iterations of the pi estimation logic.
    • Spark properties: You can use the RuntimeConfig.properties field to enter any supported Spark properties you want your Spark batch workload to use.
    • Other options:

    HTTP method and URL:

    POST https://dataproc.googleapis.com/v1/projects/project-id/locations/region/batches

    Request JSON body:

    {
      "sparkBatch":{
        "args":[
          "1000"
        ],
        "jarFileUris":[
          "file:///usr/lib/spark/examples/jars/spark-examples.jar"
        ],
        "mainClass":"org.apache.spark.examples.SparkPi"
      }
    }
    

    To send your request, expand one of these options:

    You should receive a JSON response similar to the following:

    {
    "name":"projects/project-id/locations/region/batches/batch-id",
      "uuid":",uuid",
      "createTime":"2021-07-22T17:03:46.393957Z",
      "sparkBatch":{
        "mainClass":"org.apache.spark.examples.SparkPi",
        "args":[
          "1000"
        ],
        "jarFileUris":[
          "file:///usr/lib/spark/examples/jars/spark-examples.jar"
        ]
      },
      "runtimeInfo":{
        "outputUri":"gs://dataproc-.../driveroutput"
      },
      "state":"SUCCEEDED",
      "stateTime":"2021-07-22T17:06:30.301789Z",
      "creator":"account-email-address",
      "runtimeConfig":{
        "properties":{
          "spark:spark.executor.instances":"2",
          "spark:spark.driver.cores":"2",
          "spark:spark.executor.cores":"2",
          "spark:spark.app.name":"projects/project-id/locations/region/batches/batch-id"
        }
      },
      "environmentConfig":{
        "peripheralsConfig":{
          "sparkHistoryServerConfig":{
          }
        }
      },
      "operation":"projects/project-id/regions/region/operation-id"
    }