PySpark jobs on Dataproc are run by a Python interpreter on the cluster. Job code must be compatible at runtime with the Python interpreter version and dependencies.
Check interpreter version and modules
The following check_python_env.py
sample program checks the Linux user running
the job, the Python interpreter, and available modules.
import getpass import sys import imp print('This job is running as "{}".'.format(getpass.getuser())) print(sys.executable, sys.version_info) for package in sys.argv[1:]: print(imp.find_module(package))
Run the program
REGION=region gcloud dataproc jobs submit pyspark check_python_env.py \ --cluster=my-cluster \ --region=${REGION} \ -- pandas scipy
Sample output
This job is running as "root". ('/usr/bin/python', sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0)) (None, '/usr/local/lib/python2.7/dist-packages/pandas', ('', '', 5)) (None, '/usr/local/lib/python2.7/dist-packages/scipy', ('', '', 5))
Dataproc image Python environments
The following sections describe the Python environments for different Dataproc image version clusters.
Dataproc image version 1.5
Miniconda3 is installed on Dataproc 1.5 clusters.
The default interpreter is Python 3.7 which is located on the VM instance at
/opt/conda/miniconda3/bin/python3.7
, respectively. Python 2.7 is also
available at /usr/bin/python2.7
.
You can install conda
and pip
packages in the base
environment or
set up your own conda
environment on the cluster using
conda-related cluster properties.
To use Python 2.7 as the default interpreter on 1.5 clusters,
don't use the
Anaconda optional component
when creating the cluster. Instead, use the
conda initialization action
to install Miniconda2 and use
conda-related cluster properties
to install conda
and pip
packages in the base
environment or set up your
own conda
environment on the cluster.
Example
REGION=region gcloud dataproc clusters create my-cluster \ --image-version=1.5 \ --region=${REGION} \ --metadata='MINICONDA_VARIANT=2' \ --metadata='MINICONDA_VERSION=latest' \ --initialization-actions=gs://goog-dataproc-initialization-actions-${REGION}/conda/bootstrap-conda.sh \ --properties=^#^dataproc:conda.packages='pytorch==1.0.1,visions==0.7.1'#dataproc:pip.packages='tokenizers==0.10.1,datasets==1.5.0'
Dataproc image version 2.x
conda is installed on
Dataproc 2.x clusters. The default Python3 interpreter is located
on the VM instance under /opt/conda/default/bin
. The following pages list
the Python version included in Dataproc image versions:
The non-default Python interpreter from the OS is available under /usr/bin/
.
You can install conda
and pip
packages in the base
environment or set up your
own conda
environment on the cluster using
conda-related cluster properties.
Example
REGION=region gcloud dataproc clusters create my-cluster \ --image-version=2.0 \ --region=${REGION} \ --properties=^#^dataproc:conda.packages='pytorch==1.7.1,coverage==5.5'#dataproc:pip.packages='tokenizers==0.10.1,datasets==1.5.0'
Note: Anaconda is not available for Dataproc 2.0 clusters.
Choosing a Python interpreter for a job
If multiple Python interpreters are installed on your cluster, the system
runs /etc/profile.d/effective-python.sh
, which exports the PYSPARK_PYTHON
environment variable to choose the default Python interpreter for your PySpark
jobs. If you need a non-default Python interpreter for a PySpark job, when
you submit the job to your cluster, set the spark.pyspark.python
and
spark.pyspark.driver.python
properties to the required Python version number
(for example, "python2.7" or "python3.6").
Example
REGION=region gcloud dataproc jobs submit pyspark check_python_env.py \ --cluster=my-cluster \ --region=${REGION} \ --properties="spark.pyspark.python=python2.7,spark.pyspark.driver.python=python2.7"
Python with sudo
If you SSH into a cluster node that has Miniconda or Anaconda installed,
when you run sudo python --version
, the displayed Python version can be
different from the version displayed by python --version
.
This version difference can occur because sudo
uses the default system Python
/usr/bin/python
, and does not execute
/etc/profile.d/effective-python.sh
to initialize the Python environment.
For a consistent experience when using sudo
, locate the Python path set in
/etc/profile.d/effective-python.sh
, then run the env
command to set the PATH
to this Python path. Here is a 1.5 cluster example:
sudo env PATH=/opt/conda/default/bin:${PATH} python --version
Use conda-related cluster properties
You can customize the conda
environment during cluster creation using conda-related
cluster properties.
There are two mutually exclusive ways to customize the conda environment when you create a Dataproc cluster:
Use the
dataproc:conda.env.config.uri
cluster property to create and activate a new conda environment on the cluster. orUse the
dataproc:conda.packages
anddataproc:pip.packages
cluster properties to addconda
andpip
packages, respectively, to theconda
base
environment on the cluster.
conda-related cluster properties
dataproc:conda.env.config.uri
: The absolute path to a conda environment YAML config file located in Cloud Storage. This file will be used to create and activate a newconda
environment on the cluster.Example:
Get or create a conda
environment.yaml
config file. You can manually create the file, use an existing file, or export an existing conda environment) into anenvironment.yaml
file as shown below.conda env export --name=env-name > environment.yaml
Copy the config file to your Cloud Storage bucket.
gcloud storage cp environment.yaml gs://bucket-name/environment.yaml
Create the cluster and point to your environment config file in Cloud Storage.
REGION=region gcloud dataproc clusters create cluster-name \ --region=${REGION} \ --properties='dataproc:conda.env.config.uri=gs://bucket-name/environment.yaml' \ ... other flags ...
dataproc:conda.packages
: A list ofconda
packages with specific versions to be installed in the base environment, formatted aspkg1==v1,pkg2==v2...
. Ifconda
fails to resolve conflicts with existing packages in the base environment, the conflicting packages will not be installed.Notes:
The
dataproc:conda.packages
anddataproc:pip.packages
cluster properties cannot be used with thedataproc:conda.env.config.uri
cluster property.When specifying multiple packages (separated by a comma), you must specify an alternate delimiter character (see cluster property Formatting). The following example specifies "#" as the delimiter character to pass multiple, comma-separated, package names to the
dataproc:conda.packages
property.
Example:
REGION=region gcloud dataproc clusters create cluster-name \ --region=${REGION} \ --properties='^#^dataproc:conda.packages=pytorch==1.7.1,coverage==5.5' \ ... other flags ...
dataproc:pip.packages
: A list ofpip
packages with specific versions to be installed in the base environment, formatted aspkg1==v1,pkg2==v2...
. `pip will upgrade existing dependencies only if required. Conflicts can cause the environment to be inconsistent.Notes:
The
dataproc:pip.packages
anddataproc:conda.packages
cluster properties cannot be used with thedataproc:conda.env.config.uri
cluster property.When specifying multiple packages (separated by a comma), you must specify an alternate delimiter character (see cluster property Formatting). The following example specifies "#" as the delimiter character to pass multiple, comma-separated, package names to the
dataproc:pip.packages
property.
Example:
REGION=region gcloud dataproc clusters create cluster-name \ --region=${REGION} \ --properties='^#^dataproc:pip.packages=tokenizers==0.10.1,datasets==1.4.1' \ ... other flags ...
You can use both
dataproc:conda.packages
anddataproc:pip.packages
when creating a cluster.Example:
REGION=region gcloud dataproc clusters create cluster-name \ --region=${REGION} \ --image-version=1.5 \ --metadata='MINICONDA_VARIANT=2' \ --metadata='MINICONDA_VERSION=latest' \ --properties=^#^dataproc:conda.packages='pytorch==1.7.1,coverage==5.5'#dataproc:pip.packages='tokenizers==0.10.1,datasets==1.4.1' \ ... other flags ...