Configure the cluster's Python environment

PySpark jobs on Cloud Dataproc are run by a Python interpreter on the cluster. Job code must be compatible at runtime with the Python interpreter's version and dependencies.

Checking interpreter version and modules

The following check_python_env.py sample program checks the Python interpreter version and available modules.

import sys
import imp

print(sys.executable, sys.version_info)
for package in sys.argv[1:]:
  print(imp.find_module(package))

Run the program

gcloud dataproc jobs submit pyspark check_python_env.py \
    --cluster=my-cluster \
    -- pandas scipy

Sample output

('/usr/bin/python', sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0))
(None, '/usr/local/lib/python2.7/dist-packages/pandas', ('', '', 5))
(None, '/usr/local/lib/python2.7/dist-packages/scipy', ('', '', 5))

Cloud Dataproc image Python environments

Cloud Dataproc image versions 1.0, 1.1 and 1.2

In 1.0, 1.1, and 1.2 image version clusters, the default Python interpreter is Python 2.7, located on the master instance at /usr/bin/python2.7. You can install Python packages through the pip install initialization action (specifying package versions, as shown in the next example, is optional, but recommended).

Example

gcloud dataproc clusters create my-cluster \
    --image-version 1.2 \
    --metadata 'PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \
    --initialization-actions \
    gs://dataproc-initialization-actions/python/pip-install.sh

If you need a different Python interpreter, you can use the conda initialization action to install Miniconda with Conda and PIP packages.

Example

The next command creates a 1.2 image version Cloud Dataproc cluster and installs the latest version of Miniconda3 with Conda and PIP packages:

gcloud dataproc clusters create my-cluster \
    --image-version 1.2 \
    --metadata 'MINICONDA_VARIANT=3' \
    --metadata 'MINICONDA_VERSION=latest' \
    --metadata 'CONDA_PACKAGES=scipy=1.0.0 tensorflow' \
    --metadata 'PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \
    --initialization-actions \
    gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,gs://dataproc-initialization-actions/conda/install-conda-env.sh

Cloud Dataproc image version 1.3

In 1.3 image version clusters, the default Python interpreter is Python 2.7, installed on the master instance at /usr/bin/python2.7. You can install Python packages with the pip install initialization action.

Example

gcloud dataproc clusters create my-cluster \
    --image-version 1.3 \
    --metadata 'PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \
    --initialization-actions \
    gs://dataproc-initialization-actions/python/pip-install.sh

If you need a different Python interpreter, you can create a cluster with the Anaconda optional component, and use the pip install initialization action to install Conda and PIP packages.

Example

The next command creates a 1.3 image version cluster, and installs Anaconda2 and Conda and PIP packages:

gcloud beta dataproc clusters create my-cluster \
    --image-version 1.3 \
    --optional-components=ANACONDA \
    --metadata 'CONDA_PACKAGES=scipy=1.1.0 tensorflow' \
    --metadata 'PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \
    --initialization-actions \
    gs://dataproc-initialization-actions/python/conda-install.sh,gs://dataproc-initialization-actions/python/pip-install.sh

When installing the Anaconda optional component, the default Python interpreter for PySpark jobs is /opt/conda/anaconda/bin/python2.7.

To use Python 3 as the default interpreter on a 1.3 cluster, do not use the Anaconda optional component when creating the cluster. Instead, use the conda initialization action to install Miniconda3 and the Conda and PIP packages on the cluster.

Example

gcloud dataproc clusters create my-cluster \
    --image-version 1.3 \
    --metadata 'MINICONDA_VARIANT=3' \
    --metadata 'MINICONDA_VERSION=latest' \
    --metadata 'CONDA_PACKAGES=scipy=1.0.0 tensorflow=1.12.0' \
    --metadata 'PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \
    --initialization-actions \
    gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,gs://dataproc-initialization-actions/conda/install-conda-env.sh

Cloud Dataproc image version 1.4 preview

Miniconda3 is installed on Cloud Dataproc 1.4 clusters. The default Python interpreter is Python 3.6, located on the master instance at /opt/conda/miniconda3/bin/python3.6. Python 2.7 is also available on image version 1.4 clusters at /usr/bin/python2.7.

You can install Conda and PIP packages with the conda initialization action and the pip install initialization action.

Example

gcloud dataproc clusters create my-cluster \
    --image-version preview \
    --metadata 'CONDA_PACKAGES=scipy=1.1.0 tensorflow' \
    --metadata 'PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \
    --initialization-actions \
    gs://dataproc-initialization-actions/python/conda-install.sh,gs://dataproc-initialization-actions/python/pip-install.sh

To install Anaconda3 instead of Miniconda3, choose the Anaconda optional component, and install Conda and PIP packages with the conda initialization action and the pip install initialization action.

Example

gcloud beta dataproc clusters create my-cluster \
    --image-version=preview \
    --optional-components=ANACONDA \
    --metadata 'CONDA_PACKAGES=scipy=1.1.0 tensorflow' \
    --metadata 'PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \
    --initialization-actions \
    gs://dataproc-initialization-actions/python/conda-install.sh,gs://dataproc-initialization-actions/python/pip-install.sh

When you install the Anaconda3 optional component, Miniconda3 is removed from the cluster, and /opt/conda/anaconda3/bin/python3.6 becomes the default Python interpreter for PySpark jobs. You cannot change the Python interpreter version of the optional component.

To use Python 2.7 as the default interpreter on a 1.4 cluster, do not use the Anaconda optional component when creating the cluster. Instead, use the conda initialization action to install Miniconda2 and the Conda and PIP packages on the cluster.

Example

gcloud dataproc clusters create my-cluster \
    --image-version preview \
    --metadata 'MINICONDA_VARIANT=2' \
    --metadata 'MINICONDA_VERSION=latest' \
    --metadata 'CONDA_PACKAGES=scipy=1.0.0 tensorflow=1.12.0' \
    --metadata 'PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \
    --initialization-actions \
    gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,gs://dataproc-initialization-actions/conda/install-conda-env.sh

Choosing a Python interpreter for a job

If you install multiple Python interpreters on your cluster and you need a non-default Python interpreter for some PySpark jobs, when you submit the jobs to your cluster, set the spark.pyspark.python and spark.pyspark.driver.python properties to the required Python version number (for example, "python2.7" or "python3.6").

Example

gcloud dataproc jobs submit pyspark check_python_env.py \
    --cluster=my-cluster \
    --properties \
"spark.pyspark.python=python2.7,spark.pyspark.driver.python=python2.7"
Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataproc Documentation
Need help? Visit our support page.