Configure Dataproc Python environment

PySpark jobs on Dataproc are run by a Python interpreter on the cluster. Job code must be compatible at runtime with the Python interpreter's version and dependencies.

Checking interpreter version and modules

The following check_python_env.py sample program checks the Linux user running the job, the Python interpreter, and available modules.

import getpass
import sys
import imp

print('This job is running as "{}".'.format(getpass.getuser()))
print(sys.executable, sys.version_info)
for package in sys.argv[1:]:
  print(imp.find_module(package))

Run the program

REGION=region
gcloud dataproc jobs submit pyspark check_python_env.py \
    --cluster=my-cluster \
    --region=${REGION} \
    -- pandas scipy

Sample output

This job is running as "root".
('/usr/bin/python', sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0))
(None, '/usr/local/lib/python2.7/dist-packages/pandas', ('', '', 5))
(None, '/usr/local/lib/python2.7/dist-packages/scipy', ('', '', 5))

Dataproc image Python environments

Dataproc image version 1.5

Miniconda3 is installed on Dataproc 1.5 clusters. The default interpreter is Python 3.7 which is located on the VM instance at /opt/conda/miniconda3/bin/python3.7, respectively. Python 2.7 is also available at /usr/bin/python2.7.

You can install Conda and PIP packages in the base environment or set up your own Conda environment on the cluster using Conda-related cluster properties.

To install Anaconda3 instead of Miniconda3, choose the Anaconda optional component, and install Conda and PIP packages in the base environment or set up your own Conda environment on the cluster using Conda-related cluster properties.

Example

REGION=region
gcloud beta dataproc clusters create my-cluster \
    --image-version=1.5 \
    --region=${REGION} \
    --optional-components=ANACONDA \
    --properties=^#^dataproc:conda.packages='pytorch==1.0.1,visions==0.7.1'#dataproc:pip.packages='tokenizers==0.10.1,datasets==1.5.0'

When you install the Anaconda3 optional component, Miniconda3 is removed from the cluster and /opt/conda/anaconda3/bin/python3.6 becomes the default Python interpreter for PySpark jobs. You cannot change the Python interpreter version of the optional component.

To use Python 2.7 as the default interpreter on 1.5 clusters, do not use the Anaconda optional component when creating the cluster. Instead, use the Conda initialization action to install Miniconda2 and use Conda-related cluster properties to install Conda and PIP packages in the base environment or set up your own Conda environment on the cluster.

Example

REGION=region
gcloud dataproc clusters create my-cluster \
    --image-version=1.5 \
    --region=${REGION} \
    --metadata='MINICONDA_VARIANT=2' \
    --metadata='MINICONDA_VERSION=latest' \
    --initialization-actions=gs://goog-dataproc-initialization-actions-${REGION}/conda/bootstrap-conda.sh \
    --properties=^#^dataproc:conda.packages='pytorch==1.0.1,visions==0.7.1'#dataproc:pip.packages='tokenizers==0.10.1,datasets==1.5.0'

Dataproc image version 2.0+

Miniconda3 is installed on Dataproc 2.0+ clusters. The default Python3 interpreter is located on the VM instance under /opt/conda/miniconda3/bin/. The following pages list the Python version included in Dataproc image versions:

The non-default Python interpreter from the OS is available under /usr/bin/.

You can install Conda and PIP packages in the base environment or set up your own Conda environment on the cluster using Conda-related cluster properties.

Example

REGION=region
gcloud dataproc clusters create my-cluster \
    --image-version=2.0 \
    --region=${REGION} \
    --properties=^#^dataproc:conda.packages='pytorch==1.7.1,coverage==5.5'#dataproc:pip.packages='tokenizers==0.10.1,datasets==1.5.0'

Note: Anaconda is not available for Dataproc 2.0 clusters.

Choosing a Python interpreter for a job

If multiple Python interpreters are installed on your cluster, the system runs /etc/profile.d/effective-python.sh, which exports the PYSPARK_PYTHON environment variable to choose the default Python interpreter for your PySpark jobs. If you need a non-default Python interpreter for a PySpark job, when you submit the job to your cluster, set the spark.pyspark.python and spark.pyspark.driver.python properties to the required Python version number (for example, "python2.7" or "python3.6").

Example

REGION=region
gcloud dataproc jobs submit pyspark check_python_env.py \
    --cluster=my-cluster \
    --region=${REGION} \
    --properties="spark.pyspark.python=python2.7,spark.pyspark.driver.python=python2.7"

Python with sudo

If you SSH into a cluster node that has Miniconda or Anaconda installed, when you run sudo python --version, the displayed Python version can be different from the version displayed by python --version. This version difference can occur because sudo uses the default system Python /usr/bin/python, and does not execute /etc/profile.d/effective-python.sh to initialize the Python environment. For a consistent experience when using sudo, locate the Python path set in /etc/profile.d/effective-python.sh, then run the env command to set the PATH to this Python path. Here is a 1.5 cluster example:

sudo env PATH=/opt/conda/default/bin:${PATH} python --version

Using Conda-related cluster properties

You can customize the Conda environment during cluster creation using Conda-related cluster properties.

There are two mutually exclusive ways to customize the Conda environment when you create a Dataproc cluster:

Use thedataproc:conda.env.config.uri cluster property to create and activate a new Conda environment on the cluster. or
Use the dataproc:conda.packages and dataproc:pip.packages cluster properties to add Conda and PIP packages, respectively, to the Conda base environment on the cluster.

Conda-related cluster properties

dataproc:conda.env.config.uri: The absolute path to a Conda environment YAML config file located in Cloud Storage. This file will be used to create and activate a new Conda environment on the cluster.

Note: The dataproc:conda.env.config.uri cluster property cannot be used with the dataproc:conda.packages or dataproc:pip.packages cluster properties.

Example:
1. Get or create a Conda environment.yaml config file. You can manually create the file, use an existing file, or export an existing Conda environment) into an environment.yaml file as shown below.
```
conda env export --name=env-name > environment.yaml
```
2. Copy the config file to your Cloud Storage bucket.
```
gsutil cp environment.yaml gs://bucket-name/environment.yaml
```
3. Create the cluster and point to your environment config file in Cloud Storage.
```
REGION=region
gcloud dataproc clusters create cluster-name \
    --region=${REGION} \
    --properties='dataproc:conda.env.config.uri=gs://bucket-name/environment.yaml' \
    ... other flags ...
```
dataproc:conda.packages: A list of Conda packages with specific versions to be installed in the base environment, formatted as pkg1==v1,pkg2==v2.... If Conda fails to resolve conflicts with existing packages in the base environment, the conflicting packages will not be installed.

Notes:
- The dataproc:conda.packages and dataproc:pip.packages cluster properties cannot be used with the dataproc:conda.env.config.uri cluster property.
- When specifying multiple packages (separated by a comma), you must specify an alternate delimiter character (see cluster property Formatting). The following example specifies "#" as the delimiter character to pass multiple, comma-separated, package names to the dataproc:conda.packages property.
Example:

REGION=region
gcloud dataproc clusters create cluster-name \
    --region=${REGION} \
    --properties='^#^dataproc:conda.packages=pytorch==1.7.1,coverage==5.5' \
    ... other flags ...

dataproc:pip.packages: A list of pip packages with specific versions to be installed in the base environment, formatted as pkg1==v1,pkg2==v2.... Pip will upgrade existing dependencies only if required. Conflicts can cause the environment to be inconsistent.

Notes:
- The dataproc:pip.packages and dataproc:conda.packages cluster properties cannot be used with the dataproc:conda.env.config.uri cluster property.
- When specifying multiple packages (separated by a comma), you must specify an alternate delimiter character (see cluster property Formatting). The following example specifies "#" as the delimiter character to pass multiple, comma-separated, package names to the dataproc:pip.packages property.
Example:

REGION=region
gcloud dataproc clusters create cluster-name \
    --region=${REGION} \
    --properties='^#^dataproc:pip.packages=tokenizers==0.10.1,datasets==1.4.1' \
    ... other flags ...

You can use both dataproc:conda.packages and dataproc:pip.packages when creating a cluster.

Example:

REGION=region
gcloud dataproc clusters create cluster-name \
    --region=${REGION} \
    --image-version=1.5 \
    --metadata='MINICONDA_VARIANT=2' \
    --metadata='MINICONDA_VERSION=latest' \
    --properties=^#^dataproc:conda.packages='pytorch==1.7.1,coverage==5.5'#dataproc:pip.packages='tokenizers==0.10.1,datasets==1.4.1' \
    ... other flags ...