Configure the cluster's Python environment

PySpark jobs on Dataproc are run by a Python interpreter on the cluster. Job code must be compatible at runtime with the Python interpreter's version and dependencies.

Checking interpreter version and modules

The following check_python_env.py sample program checks the Linux user running the job, the Python interpreter, and available modules.

import getpass
import sys
import imp

print('This job is running as "{}".'.format(getpass.getuser()))
print(sys.executable, sys.version_info)
for package in sys.argv[1:]:
  print(imp.find_module(package))

Run the program

gcloud dataproc jobs submit pyspark check_python_env.py \
    --cluster=my-cluster \
    -- pandas scipy

Sample output

This job is running as "root".
('/usr/bin/python', sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0))
(None, '/usr/local/lib/python2.7/dist-packages/pandas', ('', '', 5))
(None, '/usr/local/lib/python2.7/dist-packages/scipy', ('', '', 5))

Dataproc image Python environments

Dataproc image versions 1.0, 1.1 and 1.2

In 1.0, 1.1, and 1.2 image version clusters, the default Python interpreter is Python 2.7, located on the VM instance at /usr/bin/python2.7. You can install Python packages through the pip install initialization action (specifying package versions, as shown in the next example, is optional, but recommended).

Example

gcloud dataproc clusters create my-cluster \
    --image-version 1.2 \
    --metadata 'PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \
    --initialization-actions \
    gs://dataproc-initialization-actions/python/pip-install.sh

If you need a different Python interpreter, you can use the conda initialization action to install Miniconda with Conda and PIP packages.

Example

The next command creates a 1.2 image version Dataproc cluster and installs the latest version of Miniconda3 with Conda and PIP packages:

gcloud dataproc clusters create my-cluster \
    --image-version 1.2 \
    --metadata 'MINICONDA_VARIANT=3' \
    --metadata 'MINICONDA_VERSION=latest' \
    --metadata 'CONDA_PACKAGES=scipy=1.0.0 tensorflow' \
    --metadata 'PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \
    --initialization-actions \
    gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,gs://dataproc-initialization-actions/conda/install-conda-env.sh

Dataproc image version 1.3

In 1.3 image version clusters, the default Python interpreter is Python 2.7, installed on the VM instance at /usr/bin/python2.7. You can install Python packages with the pip install initialization action.

Example

gcloud dataproc clusters create my-cluster \
    --image-version 1.3 \
    --metadata 'PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \
    --initialization-actions \
    gs://dataproc-initialization-actions/python/pip-install.sh

If you need a different Python interpreter, you can create a cluster with the Anaconda optional component, and use the pip install initialization action to install Conda and PIP packages.

Example

The next command creates a 1.3 image version cluster, and installs Anaconda2 and Conda and PIP packages:

gcloud beta dataproc clusters create my-cluster \
    --image-version 1.3 \
    --optional-components=ANACONDA \
    --metadata 'CONDA_PACKAGES=scipy=1.1.0 tensorflow' \
    --metadata 'PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \
    --initialization-actions \
    gs://dataproc-initialization-actions/python/conda-install.sh,gs://dataproc-initialization-actions/python/pip-install.sh

When installing the Anaconda optional component, the default Python interpreter for PySpark jobs is /opt/conda/anaconda/bin/python2.7.

To use Python 3 as the default interpreter on a 1.3 cluster, do not use the Anaconda optional component when creating the cluster. Instead, use the conda initialization action to install Miniconda3 and the Conda and PIP packages on the cluster.

Example

gcloud dataproc clusters create my-cluster \
    --image-version 1.3 \
    --metadata 'MINICONDA_VARIANT=3' \
    --metadata 'MINICONDA_VERSION=latest' \
    --metadata 'CONDA_PACKAGES=scipy=1.0.0 tensorflow=1.12.0' \
    --metadata 'PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \
    --initialization-actions \
    gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,gs://dataproc-initialization-actions/conda/install-conda-env.sh

Dataproc image version 1.4

Miniconda3 is installed on Dataproc 1.4 clusters. The default Python interpreter is Python 3.6, located on the VM instance at /opt/conda/miniconda3/bin/python3.6. Python 2.7 is also available on image version 1.4 clusters at /usr/bin/python2.7.

You can install Conda and PIP packages with the conda initialization action and the pip install initialization action.

Example

gcloud dataproc clusters create my-cluster \
    --image-version preview \
    --metadata 'CONDA_PACKAGES=scipy=1.1.0 tensorflow' \
    --metadata 'PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \
    --initialization-actions \
    gs://dataproc-initialization-actions/python/conda-install.sh,gs://dataproc-initialization-actions/python/pip-install.sh

To install Anaconda3 instead of Miniconda3, choose the Anaconda optional component, and install Conda and PIP packages with the conda initialization action and the pip install initialization action.

Example

gcloud beta dataproc clusters create my-cluster \
    --image-version=preview \
    --optional-components=ANACONDA \
    --metadata 'CONDA_PACKAGES=scipy=1.1.0 tensorflow' \
    --metadata 'PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \
    --initialization-actions \
    gs://dataproc-initialization-actions/python/conda-install.sh,gs://dataproc-initialization-actions/python/pip-install.sh

When you install the Anaconda3 optional component, Miniconda3 is removed from the cluster, and /opt/conda/anaconda3/bin/python3.6 becomes the default Python interpreter for PySpark jobs. You cannot change the Python interpreter version of the optional component.

To use Python 2.7 as the default interpreter on a 1.4 cluster, do not use the Anaconda optional component when creating the cluster. Instead, use the conda initialization action to install Miniconda2 and the Conda and PIP packages on the cluster.

Example

gcloud dataproc clusters create my-cluster \
    --image-version preview \
    --metadata 'MINICONDA_VARIANT=2' \
    --metadata 'MINICONDA_VERSION=latest' \
    --metadata 'CONDA_PACKAGES=scipy=1.0.0 tensorflow=1.12.0' \
    --metadata 'PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \
    --initialization-actions \
    gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,gs://dataproc-initialization-actions/conda/install-conda-env.sh

Choosing a Python interpreter for a job

If multiple Python interpreters are installed on your cluster, the system runs /etc/profile.d/effective-python.sh, which exports the PYSPARK_PYTHON environment variable to choose the default Python interpreter for your PySpark jobs. If you need a non-default Python interpreter for a PySpark job, when you submit the job to your cluster, set the spark.pyspark.python and spark.pyspark.driver.python properties to the required Python version number (for example, "python2.7" or "python3.6").

Example

gcloud dataproc jobs submit pyspark check_python_env.py \
    --cluster=my-cluster \
    --properties \
"spark.pyspark.python=python2.7,spark.pyspark.driver.python=python2.7"

Python with sudo

If you SSH into a cluster node that has Miniconda or Anaconda installed, when you run sudo python --version, the displayed Python version can be different from the version displayed by python --version. This version difference can occur because sudo uses the default system Python /usr/bin/python, and does not execute /etc/profile.d/effective-python.sh to initialize the Python environment. For a consistent experience when using sudo, locate the Python path set in /etc/profile.d/effective-python.sh, then run the env command to set the PATH to this Python path. Here is a 1.4 cluster example:

sudo env PATH=/opt/conda/default/bin:${PATH} python --version
Oliko tästä sivusta apua? Kerro mielipiteesi

Palautteen aihe:

Tämä sivu
Cloud Dataproc Documentation
Tarvitsetko apua? Siirry tukisivullemme.