Questa pagina è stata tradotta dall'API Cloud Translation.

Personalizzare l'ambiente di runtime del job Spark con Docker su YARN

La funzionalità Docker su YARN di Dataproc ti consente di creare e utilizzare un'immagine Docker per personalizzare l'ambiente di runtime del job Spark. L'immagine può includere personalizzazioni delle dipendenze Java, Python e R e del file JAR del job.

Limitazioni

La disponibilità o l'assistenza per la funzionalità non è disponibile con:

Versioni immagine di Dataproc precedenti alla 2.0.49 (non disponibili nelle immagini 1.5)
Job MapReduce (supportati solo per i job Spark)
Modalità client Spark (supportata solo con la modalità cluster Spark)
Cluster Kerberos: la creazione del cluster non riesce se crei un cluster con Docker su YARN e Kerberos è abilitato.
Personalizzazioni di JDK, Hadoop e Spark: vengono utilizzati JDK, Hadoop e Spark host, non le tue personalizzazioni.

Crea un'immagine Docker

Il primo passaggio per personalizzare l'ambiente Spark è creare un'immagine Docker.

Dockerfile

Puoi utilizzare il seguente Dockerfile come esempio, apportando modifiche e aggiunte per soddisfare le tue esigenze.

FROM debian:10-slim

# Suppress interactive prompts.
ENV DEBIAN_FRONTEND=noninteractive

# Required: Install utilities required by Spark scripts.
RUN apt update && apt install -y procps tini

# Optional: Add extra jars.
ENV SPARK_EXTRA_JARS_DIR=/opt/spark/jars/
ENV SPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'
RUN mkdir -p "${SPARK_EXTRA_JARS_DIR}"
COPY *.jar "${SPARK_EXTRA_JARS_DIR}"

# Optional: Install and configure Miniconda3.
ENV CONDA_HOME=/opt/miniconda3
ENV PYSPARK_PYTHON=${CONDA_HOME}/bin/python
ENV PYSPARK_DRIVER_PYTHON=${CONDA_HOME}/bin/python

ENV PATH=${CONDA_HOME}/bin:${PATH}
COPY Miniconda3-py39_4.10.3-Linux-x86_64.sh .
RUN bash Miniconda3-py39_4.10.3-Linux-x86_64.sh -b -p /opt/miniconda3 \
  && ${CONDA_HOME}/bin/conda config --system --set always_yes True \
  && ${CONDA_HOME}/bin/conda config --system --set auto_update_conda False \
  && ${CONDA_HOME}/bin/conda config --system --prepend channels conda-forge \
  && ${CONDA_HOME}/bin/conda config --system --set channel_priority strict

# Optional: Install Conda packages.
#
# The following packages are installed in the default image. It is strongly
# recommended to include all of them.
#
# Use mamba to install packages quickly.
RUN ${CONDA_HOME}/bin/conda install mamba -n base -c conda-forge \
    && ${CONDA_HOME}/bin/mamba install \
      conda \
      cython \
      fastavro \
      fastparquet \
      gcsfs \
      google-cloud-bigquery-storage \
      google-cloud-bigquery[pandas] \
      google-cloud-bigtable \
      google-cloud-container \
      google-cloud-datacatalog \
      google-cloud-dataproc \
      google-cloud-datastore \
      google-cloud-language \
      google-cloud-logging \
      google-cloud-monitoring \
      google-cloud-pubsub \
      google-cloud-redis \
      google-cloud-spanner \
      google-cloud-speech \
      google-cloud-storage \
      google-cloud-texttospeech \
      google-cloud-translate \
      google-cloud-vision \
      koalas \
      matplotlib \
      nltk \
      numba \
      numpy \
      openblas \
      orc \
      pandas \
      pyarrow \
      pysal \
      pytables \
      python \
      regex \
      requests \
      rtree \
      scikit-image \
      scikit-learn \
      scipy \
      seaborn \
      sqlalchemy \
      sympy \
      virtualenv

# Optional: Add extra Python modules.
ENV PYTHONPATH=/opt/python/packages
RUN mkdir -p "${PYTHONPATH}"
COPY test_util.py "${PYTHONPATH}"

# Required: Create the 'yarn_docker_user' group/user.
# The GID and UID must be 1099. Home directory is required.
RUN groupadd -g 1099 yarn_docker_user
RUN useradd -u 1099 -g 1099 -d /home/yarn_docker_user -m yarn_docker_user
USER yarn_docker_user

Crea ed esegui il push dell'immagine

Di seguito sono riportati i comandi per creare e inviare l'immagine Docker di esempio. Puoi apportare modifiche in base alle tue personalizzazioni.

# Increase the version number when there is a change to avoid referencing
# a cached older image. Avoid reusing the version number, including the default
# `latest` version.
IMAGE=gcr.io/my-project/my-image:1.0.1

# Download the BigQuery connector.
gcloud storage cp \
  gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.22.2.jar .

# Download the Miniconda3 installer.
wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.10.3-Linux-x86_64.sh

# Python module example:
cat >test_util.py <<EOF
def hello(name):
  print("hello {}".format(name))

def read_lines(path):
  with open(path) as f:
    return f.readlines()
EOF

# Build and push the image.
docker build -t "${IMAGE}" .
docker push "${IMAGE}"

Crea un cluster Dataproc

Dopo aver creato un'immagine Docker che personalizza l'ambiente Spark, crea un cluster Dataproc che utilizzerà l'immagine Docker durante l'esecuzione dei job Spark.

gcloud

gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --image-version=DP_IMAGE \
    --optional-components=DOCKER \
    --properties=dataproc:yarn.docker.enable=true,dataproc:yarn.docker.image=DOCKER_IMAGE \
    other flags

Sostituisci quanto segue:

CLUSTER_NAME: il nome del cluster.
REGION: la regione del cluster.
DP_IMAGE: la versione immagine di Dataproc deve essere 2.0.49 o successive (--image-version=2.0 utilizzerà una versione secondaria qualificata successiva a 2.0.49).
--optional-components=DOCKER: attiva il componente Docker sul cluster.
Flag --properties:
- dataproc:yarn.docker.enable=true: proprietà obbligatoria per attivare la funzionalità Dataproc Docker su YARN.
- dataproc:yarn.docker.image: proprietà facoltativa che puoi aggiungere per specificare il tuo DOCKER_IMAGE utilizzando il seguente formato di denominazione delle immagini Container Registry: {hostname}/{project-id}/{image}:{tag}.
  Esempio:
```
dataproc:yarn.docker.image=gcr.io/project-id/image:1.0.1
```
  Requisito:devi ospitare l'immagine Docker su Container Registry o Artifact Registry. Dataproc non può recuperare i container da altri registri.
  
  Consiglio:aggiungi questa proprietà quando crei il cluster per memorizzare nella cache l'immagine Docker ed evitare timeout di YARN in un secondo momento quando invii un job che utilizza l'immagine.

Quando dataproc:yarn.docker.enable è impostato su true, Dataproc aggiorna le configurazioni Hadoop e Spark per attivare la funzionalità Docker su YARN nel cluster. Ad esempio, spark.submit.deployMode è impostato su cluster e spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS e spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS sono impostati per montare le directory dall'host nel container.

Invia un job Spark al cluster

Dopo aver creato un cluster Dataproc, invia un job Spark al cluster che utilizza l'immagine Docker. L'esempio in questa sezione invia un job PySpark al cluster.

Imposta le proprietà del job:

# Set the Docker image URI.
IMAGE=(e.g., gcr.io/my-project/my-image:1.0.1)

# Required: Use `#` as the delimiter for properties to avoid conflicts.
JOB_PROPERTIES='^#^'

# Required: Set Spark properties with the Docker image.
JOB_PROPERTIES="${JOB_PROPERTIES}#spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${IMAGE}"
JOB_PROPERTIES="${JOB_PROPERTIES}#spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${IMAGE}"

# Optional: Add custom jars to Spark classpath. Don't set these properties if
# there are no customizations.
JOB_PROPERTIES="${JOB_PROPERTIES}#spark.driver.extraClassPath=/opt/spark/jars/*"
JOB_PROPERTIES="${JOB_PROPERTIES}#spark.executor.extraClassPath=/opt/spark/jars/*"

# Optional: Set custom PySpark Python path only if there are customizations.
JOB_PROPERTIES="${JOB_PROPERTIES}#spark.pyspark.python=/opt/miniconda3/bin/python"
JOB_PROPERTIES="${JOB_PROPERTIES}#spark.pyspark.driver.python=/opt/miniconda3/bin/python"

# Optional: Set custom Python module path only if there are customizations.
# Since the `PYTHONPATH` environment variable defined in the Dockerfile is
# overridden by Spark, it must be set as a job property.
JOB_PROPERTIES="${JOB_PROPERTIES}#spark.yarn.appMasterEnv.PYTHONPATH=/opt/python/packages"
JOB_PROPERTIES="${JOB_PROPERTIES}#spark.executorEnv.PYTHONPATH=/opt/python/packages"

Note:

Consulta le informazioni sulle proprietà correlate in Avvio di applicazioni utilizzando container Docker.

gcloud

Invia il job al cluster.

gcloud dataproc jobs submit pyspark PYFILE \
    --cluster=CLUSTER_NAME \
    --region=REGION \
    --properties=${JOB_PROPERTIES}

Sostituisci quanto segue:

PYFILE: il percorso del file del job PySpark. Può essere un percorso file locale o l'URI del file in Cloud Storage (gs://BUCKET_NAME/PySpark filename).
CLUSTER_NAME: il nome del cluster.
REGION: la regione del cluster.

Personalizzare l'ambiente di runtime del job Spark con Docker su YARN Mantieni tutto organizzato con le raccolte Salva e classifica i contenuti in base alle tue preferenze.

Limitazioni

Crea un'immagine Docker

Dockerfile

Crea ed esegui il push dell'immagine

Crea un cluster Dataproc

gcloud

Invia un job Spark al cluster

gcloud

Personalizzare l'ambiente di runtime del job Spark con Docker su YARN