PySpark jobs on Dataproc are run by a Python interpreter on the cluster. Job code must be compatible at runtime with the Python interpreter's version and dependencies.
Checking interpreter version and modules
The following check_python_env.py
sample program checks the Linux user running
the job, the Python interpreter, and available modules.
import getpass import sys import imp print('This job is running as "{}".'.format(getpass.getuser())) print(sys.executable, sys.version_info) for package in sys.argv[1:]: print(imp.find_module(package))
Run the program
gcloud dataproc jobs submit pyspark check_python_env.py \ --cluster=my-cluster \ --region=region \ -- pandas scipy
Sample output
This job is running as "root". ('/usr/bin/python', sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0)) (None, '/usr/local/lib/python2.7/dist-packages/pandas', ('', '', 5)) (None, '/usr/local/lib/python2.7/dist-packages/scipy', ('', '', 5))
Dataproc image Python environments
Dataproc image versions 1.0, 1.1 and 1.2
In 1.0, 1.1, and 1.2 image version clusters, the default
Python interpreter is Python 2.7, located on the VM instance
at /usr/bin/python2.7
. You can install Python packages through the
pip install initialization action (specifying package versions,
as shown in the next example, is optional, but recommended).
Example
REGION=region
gcloud dataproc clusters create my-cluster \ --image-version=1.2 \ --region=${REGION} \ --metadata='PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \ --initialization-actions=gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh
If you need a different Python interpreter, you can use the conda initialization action to install Miniconda with Conda and PIP packages.
Example
The next command creates a 1.2 image version Dataproc cluster and installs the latest version of Miniconda3 with Conda and PIP packages:
REGION=region
gcloud dataproc clusters create my-cluster \ --image-version=1.2 \ --region=${REGION} \ --metadata='MINICONDA_VARIANT=3' \ --metadata='MINICONDA_VERSION=latest' \ --metadata='CONDA_PACKAGES=scipy=1.0.0 tensorflow' \ --metadata='PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \ --initialization-actions=gs://goog-dataproc-initialization-actions-${REGION}/conda/bootstrap-conda.sh,gs://goog-dataproc-initialization-actions-${REGION}/conda/install-conda-env.sh
Dataproc image version 1.3
In 1.3 image version clusters, the default Python interpreter is Python 2.7,
installed on the VM instance at /usr/bin/python2.7
. You can install
Python packages with the
pip install initialization action.
Example
REGION=region
gcloud dataproc clusters create my-cluster \ --image-version=1.3 \ --region=${REGION} \ --metadata='PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \ --initialization-actions=gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh
If you need a different Python interpreter, you can create a cluster with the Anaconda optional component, and use the pip install initialization action to install Conda and PIP packages.
Example
The next command creates a 1.3 image version cluster, and installs Anaconda2 and Conda and PIP packages:
REGION=region
gcloud beta dataproc clusters create my-cluster \ --image-version=1.3 \ --region=${REGION} \ --optional-components=ANACONDA \ --metadata='CONDA_PACKAGES=scipy=1.1.0 tensorflow' \ --metadata='PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \ --initialization-actions=gs://goog-dataproc-initialization-actions-${REGION}/python/conda-install.sh,gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh
When installing the Anaconda optional component, the default Python interpreter for PySpark jobs is
/opt/conda/anaconda/bin/python2.7
.
To use Python 3 as the default interpreter on a 1.3 cluster, do not use the Anaconda optional component when creating the cluster. Instead, use the conda initialization action to install Miniconda3 and the Conda and PIP packages on the cluster.
Example
REGION=region
gcloud dataproc clusters create my-cluster \ --image-version=1.3 \ --region=${REGION} \ --metadata='MINICONDA_VARIANT=3' \ --metadata='MINICONDA_VERSION=latest' \ --metadata='CONDA_PACKAGES=scipy=1.0.0 tensorflow=1.12.0' \ --metadata='PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \ --initialization-actions=gs://goog-dataproc-initialization-actions-${REGION}/conda/bootstrap-conda.sh,gs://goog-dataproc-initialization-actions-${REGION}/conda/install-conda-env.sh
Dataproc image version 1.4+
Miniconda3 is installed on Dataproc 1.4+ clusters.
The default Python interpreter is Python 3.6 for Dataproc 1.4 and Python 3.7
for Dataproc 1.5, located on the VM instance at
/opt/conda/miniconda3/bin/python3.6
and
/opt/conda/miniconda3/bin/python3.7
, respectively. Python 2.7 is also
available at /usr/bin/python2.7
.
You can install Conda and PIP packages with the conda initialization action and the pip install initialization action.
Example
REGION=region
gcloud dataproc clusters create my-cluster \ --image-version=1.4 \ --region=${REGION} \ --metadata='CONDA_PACKAGES=scipy=1.1.0 tensorflow' \ --metadata='PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \ --initialization-actions=gs://goog-dataproc-initialization-actions-${REGION}/python/conda-install.sh,gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh
To install Anaconda3 instead of Miniconda3, choose the Anaconda optional component, and install Conda and PIP packages with the conda initialization action and the pip install initialization action.
Example
REGION=region
gcloud beta dataproc clusters create my-cluster \ --image-version=1.4 \ --region=${REGION} \ --optional-components=ANACONDA \ --metadata='CONDA_PACKAGES=scipy=1.1.0 tensorflow' \ --metadata='PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \ --initialization-actions=gs://goog-dataproc-initialization-actions-${REGION}/python/conda-install.sh,gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh
When you install the Anaconda3 optional component, Miniconda3 is removed
from the cluster, and /opt/conda/anaconda3/bin/python3.6
becomes the
default Python interpreter for PySpark jobs. You cannot change the Python
interpreter version of the optional component.
To use Python 2.7 as the default interpreter on a 1.4+ cluster, do not use the Anaconda optional component when creating the cluster. Instead, use the conda initialization action to install Miniconda2 and the Conda and PIP packages on the cluster.
Example
REGION=region
gcloud dataproc clusters create my-cluster \ --image-version=1.4 \ --region=${REGION} \ --metadata='MINICONDA_VARIANT=2' \ --metadata='MINICONDA_VERSION=latest' \ --metadata='CONDA_PACKAGES=scipy=1.0.0 tensorflow=1.12.0' \ --metadata='PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \ --initialization-actions=gs://goog-dataproc-initialization-actions-${REGION}/conda/bootstrap-conda.sh,gs://goog-dataproc-initialization-actions-${REGION}/conda/install-conda-env.sh
Choosing a Python interpreter for a job
If multiple Python interpreters are installed on your cluster, the system
runs /etc/profile.d/effective-python.sh
, which exports the PYSPARK_PYTHON
environment variable to choose the default Python interpreter for your PySpark
jobs. If you need a non-default Python interpreter for a PySpark job, when
you submit the job to your cluster, set the spark.pyspark.python
and
spark.pyspark.driver.python
properties to the required Python version number
(for example, "python2.7" or "python3.6").
Example
gcloud dataproc jobs submit pyspark check_python_env.py \ --cluster=my-cluster \ --region=region \ --properties="spark.pyspark.python=python2.7,spark.pyspark.driver.python=python2.7"
Python with sudo
If you SSH into a cluster node that has Miniconda or Anaconda installed,
when you run sudo python --version
, the displayed Python version can be
different from the version displayed by python --version
.
This version difference can occur because sudo
uses the default system Python
/usr/bin/python
, and does not execute
/etc/profile.d/effective-python.sh
to initialize the Python environment.
For a consistent experience when using sudo
, locate the Python path set in
/etc/profile.d/effective-python.sh
, then run the env
command to set the PATH
to this Python path. Here is a 1.4 cluster example:
sudo env PATH=/opt/conda/default/bin:${PATH} python --version