You can specify a custom container image to use with Dataproc on GKE . Your custom container image must use one of the Dataproc on GKE base Spark images.
Use a custom container image
To use a Dataproc on GKE custom container image, set the
spark.kubernetes.container.image property
when you
create a Dataproc on GKE virtual cluster
or submit a Spark job to the cluster.
- gcloud CLI cluster creation example:
gcloud dataproc clusters gke create "${DP_CLUSTER}" \ --properties=spark:spark.kubernetes.container.image=custom-image \ ... other args ...
- gcloud CLI job submit example:
gcloud dataproc jobs submit spark \ --properties=spark.kubernetes.container.image=custom-image \ ... other args ...
Custom container image requirements and settings
Base images
You can use docker
tools for building customized docker based upon one of
the published Dataproc on GKE base Spark images.
Container user
Dataproc on GKE runs Spark containers as the Linux spark
user with a
1099
UID and a 1099
GID. Use the UID and GID for filesystem permissions.
For example, if you add a jar file at /opt/spark/jars/my-lib.jar
in the image
as a workload dependency, you must give the spark
user read permission to the file.
Components
Java: The
JAVA_HOME
environment variable points to the location of the Java installation. The current default value is/usr/lib/jvm/adoptopenjdk-8-hotspot-amd64
, which is subject to change (see the Dataproc release notes for updated information).- If you customize the Java environment, make sure that
JAVA_HOME
is set to the correct location andPATH
includes the path to binaries.
- If you customize the Java environment, make sure that
Python: Dataproc on GKE base Spark images have Miniconda3 installed at
/opt/conda
.CONDA_HOME
points to this location,${CONDA_HOME}/bin
is included inPATH
, andPYSPARK_PYTHON
is set to${CONDA_HOME}/python
.If you customize Conda, make sure that
CONDA_HOME
points to the Conda home directory ,${CONDA_HOME}/bin
is included inPATH
, andPYSPARK_PYTHON
is set to${CONDA_HOME}/python.
You can install, remove, and update packages in the default base environment, or create a new environment, but it is strongly recommended that the environment include all packages installed in the base environment of the base container image.
If you add Python modules, such as a Python script with utility functions, to the container image, include the module directories in
PYTHONPATH
.
Spark: Spark is installed in
/usr/lib/spark
, andSPARK_HOME
points to this location. Spark cannot be customized. If it is changed, the container image will be rejected or fail to operate correctly.Jobs: You can customize Spark job dependencies.
SPARK_EXTRA_CLASSPATH
defines the extra classpath for Spark JVM processes. Recommendation: put jars under/opt/spark/jars
, and setSPARK_EXTRA_CLASSPATH
to/opt/spark/jars/*
.If you embed the job jar in the image, the recommended directory is
/opt/spark/job
. When you submit the job, you can reference it with a local path, for example,file:///opt/spark/job/my-spark-job.jar
.Cloud Storage connector: The Cloud Storage connector is installed at
/usr/lib/spark/jars
.Utilities: The
procps
andtini
utility packages are required to run Spark. These utilities are included in the base Spark images, so custom images do not need to re-install them.Entrypoint: Dataproc on GKE ignores any changes made to the
ENTRYPOINT
andCMD
primitives in the container image.Initialization scripts: you can add an optional initialization script at
/opt/init-script.sh
. An initialization script can download files from Cloud Storage, start a proxy within the container, call other scripts, and perform other startup tasks.The entrypoint script calls the initialization script with all command line args (
$@
) before starting the Spark driver, Spark executor, and other processes. The initialization script can select the type of Spark process based on the first arg ($1
): possible values includespark-submit
for driver containers, andexecutor
for executor containers.
Configs: Spark configs are located under
/etc/spark/conf
. TheSPARK_CONF_DIR
environment variable points to this location.Don't customize Spark configs in the container image. Instead, submit any properties via the Dataproc on GKE API for the following reasons:
- Some properties, such as executor memory size, are determined at runtime, not at container image build time; they must be injected by Dataproc on GKE .
- Dataproc on GKE places restrictions on the properties supplied by users.
Dataproc on GKE mounts configs from
configMap
into/etc/spark/conf
in the container, overriding settings embedded in the image.
Base Spark images
Dataproc supports the following base Spark container images:
- Spark 2.4: ${REGION}-docker.pkg.dev/cloud-dataproc/spark/dataproc_1.5
- Spark 3.1: ${REGION}-docker.pkg.dev/cloud-dataproc/spark/dataproc_2.0
Sample custom container image build
Sample Dockerfile
FROM us-central1-docker.pkg.dev/cloud-dataproc/spark/dataproc_2.0:latest
# Change to root temporarily so that it has permissions to create dirs and copy
# files.
USER root
# Add a BigQuery connector jar.
ENV SPARK_EXTRA_JARS_DIR=/opt/spark/jars/
ENV SPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'
RUN mkdir -p "${SPARK_EXTRA_JARS_DIR}" \
&& chown spark:spark "${SPARK_EXTRA_JARS_DIR}"
COPY --chown=spark:spark \
spark-bigquery-with-dependencies_2.12-0.22.2.jar "${SPARK_EXTRA_JARS_DIR}"
# Install Cloud Storage client Conda package.
RUN "${CONDA_HOME}/bin/conda" install google-cloud-storage
# Add a custom Python file.
ENV PYTHONPATH=/opt/python/packages
RUN mkdir -p "${PYTHONPATH}"
COPY test_util.py "${PYTHONPATH}"
# Add an init script.
COPY --chown=spark:spark init-script.sh /opt/init-script.sh
# (Optional) Set user back to `spark`.
USER spark
Build the container image
Run the following commands in the Dockerfile directory
- Set image (example:
us-central1-docker.pkg.dev/my-project/spark/spark-test-image:latest
) and change to build directory.IMAGE=custom container image \ BUILD_DIR=$(mktemp -d) \ cd "${BUILD_DIR}"
Download the BigQuery connector.
gcloud storage cp \ gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.22.2.jar .
Create a Python example file.
cat >test_util.py <<'EOF' def hello(name): print("hello {}".format(name))
def read_lines(path): with open(path) as f: return f.readlines() EOFCreate an example init script.
cat >init-script.sh <<EOF echo "hello world" >/tmp/init-script.out EOF
Build and push the image.
docker build -t "${IMAGE}" . && docker push "${IMAGE}"