PySpark jobs on Dataproc are run by a Python interpreter on the cluster. Job code must be compatible at runtime with the Python interpreter's version and dependencies.
Checking interpreter version and modules
The following check_python_env.py
sample program checks the Linux user running
the job, the Python interpreter, and available modules.
import getpass import sys import imp print('This job is running as "{}".'.format(getpass.getuser())) print(sys.executable, sys.version_info) for package in sys.argv[1:]: print(imp.find_module(package))
Run the program
REGION=region gcloud dataproc jobs submit pyspark check_python_env.py \ --cluster=my-cluster \ --region=${REGION} \ -- pandas scipy
Sample output
This job is running as "root". ('/usr/bin/python', sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0)) (None, '/usr/local/lib/python2.7/dist-packages/pandas', ('', '', 5)) (None, '/usr/local/lib/python2.7/dist-packages/scipy', ('', '', 5))
Dataproc image Python environments
Dataproc image version 1.5
Miniconda3 is installed on Dataproc 1.5 clusters.
The default interpreter is Python 3.7 which is located on the VM instance at
/opt/conda/miniconda3/bin/python3.7
, respectively. Python 2.7 is also
available at /usr/bin/python2.7
.
You can install Conda and PIP packages in the base
environment or
set up your own Conda environment on the cluster using
Conda-related cluster properties.
To install Anaconda3 instead of Miniconda3, choose the
Anaconda optional component,
and install Conda and PIP packages in the base
environment or set up your
own Conda environment on the cluster using
Conda-related cluster properties.
Example
REGION=region gcloud beta dataproc clusters create my-cluster \ --image-version=1.5 \ --region=${REGION} \ --optional-components=ANACONDA \ --properties=^#^dataproc:conda.packages='pytorch==1.0.1,visions==0.7.1'#dataproc:pip.packages='tokenizers==0.10.1,datasets==1.5.0'
When you install the Anaconda3 optional component, Miniconda3 is removed
from the cluster and /opt/conda/anaconda3/bin/python3.6
becomes the
default Python interpreter for PySpark jobs. You cannot change the Python
interpreter version of the optional component.
To use Python 2.7 as the default interpreter on 1.5 clusters,
do not use the
Anaconda optional component
when creating the cluster. Instead, use the
Conda initialization action
to install Miniconda2 and use
Conda-related cluster properties
to install Conda and PIP packages in the base
environment or set up your
own Conda environment on the cluster.
Example
REGION=region gcloud dataproc clusters create my-cluster \ --image-version=1.5 \ --region=${REGION} \ --metadata='MINICONDA_VARIANT=2' \ --metadata='MINICONDA_VERSION=latest' \ --initialization-actions=gs://goog-dataproc-initialization-actions-${REGION}/conda/bootstrap-conda.sh \ --properties=^#^dataproc:conda.packages='pytorch==1.0.1,visions==0.7.1'#dataproc:pip.packages='tokenizers==0.10.1,datasets==1.5.0'
Dataproc image version 2.0
Miniconda3 is installed on Dataproc 2.0 clusters.
The default Python interpreter is Python 3.8, located on the VM instance at
/opt/conda/miniconda3/bin/python3.8
. Python 2.7 is also
available at /usr/bin/python2.7
.
You can install Conda and PIP packages in the base
environment or set up your
own Conda environment on the cluster using
Conda-related cluster properties.
Example
REGION=region gcloud dataproc clusters create my-cluster \ --image-version=2.0 \ --region=${REGION} \ --properties=^#^dataproc:conda.packages='pytorch==1.7.1,coverage==5.5'#dataproc:pip.packages='tokenizers==0.10.1,datasets==1.5.0'
Note: Anaconda is not available for Dataproc 2.0 clusters.
Choosing a Python interpreter for a job
If multiple Python interpreters are installed on your cluster, the system
runs /etc/profile.d/effective-python.sh
, which exports the PYSPARK_PYTHON
environment variable to choose the default Python interpreter for your PySpark
jobs. If you need a non-default Python interpreter for a PySpark job, when
you submit the job to your cluster, set the spark.pyspark.python
and
spark.pyspark.driver.python
properties to the required Python version number
(for example, "python2.7" or "python3.6").
Example
REGION=region gcloud dataproc jobs submit pyspark check_python_env.py \ --cluster=my-cluster \ --region=${REGION} \ --properties="spark.pyspark.python=python2.7,spark.pyspark.driver.python=python2.7"
Python with sudo
If you SSH into a cluster node that has Miniconda or Anaconda installed,
when you run sudo python --version
, the displayed Python version can be
different from the version displayed by python --version
.
This version difference can occur because sudo
uses the default system Python
/usr/bin/python
, and does not execute
/etc/profile.d/effective-python.sh
to initialize the Python environment.
For a consistent experience when using sudo
, locate the Python path set in
/etc/profile.d/effective-python.sh
, then run the env
command to set the PATH
to this Python path. Here is a 1.5 cluster example:
sudo env PATH=/opt/conda/default/bin:${PATH} python --version
Using Conda-related cluster properties
You can customize the Conda environment during cluster creation using Conda-related cluster properties.
There are two mutually exclusive ways to customize the Conda environment when you create a Dataproc cluster:
Use the
dataproc:conda.env.config.uri
cluster property to create and activate a new Conda environment on the cluster. orUse the
dataproc:conda.packages
anddataproc:pip.packages
cluster properties to add Conda and PIP packages, respectively, to the Condabase
environment on the cluster.
Conda-related cluster properties
dataproc:conda.env.config.uri
: The absolute path to a Conda environment YAML config file located in Cloud Storage. This file will be used to create and activate a new Conda environment on the cluster.Example:
Get or create a Conda
environment.yaml
config file. You can manually create the file, use an existing file, or export an existing Conda environment) into anenvironment.yaml
file as shown below.conda env export --name=env-name > environment.yaml
Copy the config file to your Cloud Storage bucket.
gsutil cp environment.yaml gs://bucket-name/environment.yaml
Create the cluster and point to your environment config file in Cloud Storage.
REGION=region gcloud dataproc clusters create cluster-name \ --region=${REGION} \ --properties='dataproc:conda.env.config.uri=gs://bucket-name/environment.yaml' \ ... other flags ...
dataproc:conda.packages
: A list of Conda packages with specific versions to be installed in the base environment, formatted aspkg1==v1,pkg2==v2...
. If Conda fails to resolve conflicts with existing packages in the base environment, the conflicting packages will not be installed.Notes:
The
dataproc:conda.packages
anddataproc:pip.packages
cluster properties cannot be used with thedataproc:conda.env.config.uri
cluster property.When specifying multiple packages (separated by a comma), you must specify an alternate delimiter character (see cluster property Formatting). The following example specifies "#" as the delimiter character to pass multiple, comma-separated, package names to the
dataproc:conda.packages
property.
Example:
REGION=region gcloud dataproc clusters create cluster-name \ --region=${REGION} \ --properties='^#^dataproc:conda.packages=pytorch==1.7.1,coverage==5.5' \ ... other flags ...
dataproc:pip.packages
: A list of pip packages with specific versions to be installed in the base environment, formatted aspkg1==v1,pkg2==v2...
. Pip will upgrade existing dependencies only if required. Conflicts can cause the environment to be inconsistent.Notes:
The
dataproc:pip.packages
anddataproc:conda.packages
cluster properties cannot be used with thedataproc:conda.env.config.uri
cluster property.When specifying multiple packages (separated by a comma), you must specify an alternate delimiter character (see cluster property Formatting). The following example specifies "#" as the delimiter character to pass multiple, comma-separated, package names to the
dataproc:pip.packages
property.
Example:
REGION=region gcloud dataproc clusters create cluster-name \ --region=${REGION} \ --properties='^#^dataproc:pip.packages=tokenizers==0.10.1,datasets==1.4.1' \ ... other flags ...
You can use both
dataproc:conda.packages
anddataproc:pip.packages
when creating a cluster.Example:
REGION=region gcloud dataproc clusters create cluster-name \ --region=${REGION} \ --image-version=1.5 \ --metadata='MINICONDA_VARIANT=2' \ --metadata='MINICONDA_VERSION=latest' \ --properties=^#^dataproc:conda.packages='pytorch==1.7.1,coverage==5.5'#dataproc:pip.packages='tokenizers==0.10.1,datasets==1.4.1' \ ... other flags ...