gcloud dataproc jobs submit pyspark

NAME: gcloud dataproc jobs submit pyspark - submit a PySpark job to a cluster
SYNOPSIS: gcloud dataproc jobs submit pyspark PY_FILE (--cluster=CLUSTER | --cluster-labels=[KEY=VALUE,…]) [--archives=[ARCHIVE,…]] [--async] [--bucket=BUCKET] [--driver-log-levels=[PACKAGE=LEVEL,…]] [--driver-required-memory-mb=DRIVER_REQUIRED_MEMORY_MB] [--driver-required-vcores=DRIVER_REQUIRED_VCORES] [--files=[FILE,…]] [--jars=[JAR,…]] [--labels=[KEY=VALUE,…]] [--max-failures-per-hour=MAX_FAILURES_PER_HOUR] [--max-failures-total=MAX_FAILURES_TOTAL] [--properties=[PROPERTY=VALUE,…]] [--properties-file=PROPERTIES_FILE] [--py-files=[PY_FILE,…]] [--region=REGION] [GCLOUD_WIDE_FLAG …] [-- JOB_ARGS …]
DESCRIPTION: Submit a PySpark job to a cluster.
EXAMPLES: To submit a PySpark job with a local script and custom flags, run:
gcloud dataproc jobs submit pyspark --cluster=my-cluster my_script.py -- --custom-flag

To submit a Spark job that runs a script that is already on the cluster, run:
gcloud dataproc jobs submit pyspark --cluster=my-cluster file:///usr/lib/spark/examples/src/main/python/pi.py -- 100
POSITIONAL ARGUMENTS: PY_FILE

Main .py file to run as the driver.

[-- JOB_ARGS …]

Arguments to pass to the driver.
The '--' argument must be specified between gcloud specific args on the left and JOB_ARGS on the right.
REQUIRED FLAGS: Exactly one of these must be specified:

--cluster=CLUSTER

The Dataproc cluster to submit the job to.

--cluster-labels=[KEY=VALUE,…]

List of label KEY=VALUE pairs to add.
Keys must start with a lowercase character and contain only hyphens (-), underscores (_), lowercase characters, and numbers. Values must contain only hyphens (-), underscores (_), lowercase characters, and numbers.
Labels of Dataproc cluster on which to place the job.
OPTIONAL FLAGS: --archives=[ARCHIVE,…]

Comma separated list of archives to be extracted into the working directory of each executor. Must be one of the following file formats: .zip, .tar, .tar.gz, or .tgz.

--async

Return immediately, without waiting for the operation in progress to complete.

--bucket=BUCKET

The Cloud Storage bucket to stage files in. Defaults to the cluster's configured bucket.

--driver-log-levels=[PACKAGE=LEVEL,…]

List of key value pairs to configure driver logging, where key is a package and value is the log4j log level. For example: root=FATAL,com.example=INFO

--driver-required-memory-mb=DRIVER_REQUIRED_MEMORY_MB

The memory allocation requested by the job driver in megabytes (MB) for execution on the driver node group (it is used only by clusters with a driver node group).

--driver-required-vcores=DRIVER_REQUIRED_VCORES

The vCPU allocation requested by the job driver for execution on the driver node group (it is used only by clusters with a driver node group).

--files=[FILE,…]

Comma separated list of files to be placed in the working directory of both the app driver and executors.

--jars=[JAR,…]

Comma separated list of jar files to be provided to the executor and driver classpaths.

--labels=[KEY=VALUE,…]

List of label KEY=VALUE pairs to add.
Keys must start with a lowercase character and contain only hyphens (-), underscores (_), lowercase characters, and numbers. Values must contain only hyphens (-), underscores (_), lowercase characters, and numbers.

--max-failures-per-hour=MAX_FAILURES_PER_HOUR

Specifies the maximum number of times a job can be restarted per hour in event of failure. Default is 0 (no retries after job failure).

--max-failures-total=MAX_FAILURES_TOTAL

Specifies the maximum total number of times a job can be restarted after the job fails. Default is 0 (no retries after job failure).

--properties=[PROPERTY=VALUE,…]

List of key value pairs to configure PySpark. For a list of available properties, see: https://spark.apache.org/docs/latest/configuration.html#available-properties.

--properties-file=PROPERTIES_FILE

Path to a local file or a file in a Cloud Storage bucket containing configuration properties for the job. The client machine running this command must have read permission to the file.
Specify properties in the form of property=value in the text file. For example:
# Properties to set for the job: key1=value1 key2=value2 # Comment out properties not used. # key3=value3

If a property is set in both --properties and --properties-file, the value defined in --properties takes precedence.

--py-files=[PY_FILE,…]

Comma separated list of Python files to be provided to the job. Must be one of the following file formats ".py, .zip, or .egg".

--region=REGION

Dataproc region to use. Each Dataproc region constitutes an independent resource namespace constrained to deploying instances into Compute Engine zones inside the region. Overrides the default dataproc/region property value for this command invocation.
GCLOUD WIDE FLAGS: These flags are available to all commands: --access-token-file, --account, --billing-project, --configuration, --flags-file, --flatten, --format, --help, --impersonate-service-account, --log-http, --project, --quiet, --trace-token, --user-output-enabled, --verbosity.
Run $ gcloud help for details.
NOTES: These variants are also available:
gcloud alpha dataproc jobs submit pyspark

gcloud beta dataproc jobs submit pyspark