Managing Pipeline Dependencies (Python)

This tutorial shows how to make the local packages your Dataflow pipeline depends on available remotely.

When you run your pipeline locally, packages it depends on are available to it because they're installed on your local machine. However, when you want to run remotely on the Cloud Dataflow Service, you need to perform some actions to make these dependencies available on the remote machines.

IMPORTANT: The Google Cloud virtual machines (i.e.workers) used for pipeline execution have a standard Python 2.7 distribution installed on them. If your code relies only on standard Python packages, then you don't need to do anything mentioned on this page.

Each of the sections below refers to a different source that your package may have been installed from, and provides instructions for how to make that type of package available remotely.

PyPI Dependencies

If your pipeline uses public packages from the Python Package Index, make these packages available remotely by performing the following steps:

Note: If your PyPI package depends on a non-Python package (such as a package that requires installation on Linux using the apt-get install command), see the PyPI Dependencies with Non-Python Dependencies section instead.

  1. Find out which packages you have installed on your machine. Run the following command:
    pip freeze > requirements.txt

    This will create a requirements.txt file that lists all packages that have been installed on your machine, regardless of where they came from (i.e. were installed from).

  2. In the requirements.txt file, leave only the packages that were installed from PyPI and are used in the workflow source. Delete the rest of the packages that are irrelevant to your code.
  3. Run your pipeline with the following command-line option:
    --requirements_file requirements.txt
    This will stage the requirements.txt file to the staging location you defined.

When the workers are spun up, they look at the staging location to find out what needs to be installed, and install all packages listed in the requirements.txt file. Because of this, it's very important that you delete non-PyPI packages from the requirements.txt file, as stated in step 2. If you don't, the workers will attempt to install a package from a source that is unknown to them, which will result in an error.

Local or non-PyPI Dependencies

If your pipeline uses packages that are not available publicly, such as packages that you've downloaded from a GitHub repo, make these packages available remotely by performing the following steps:

  1. Identify which packages you have installed on your machine that are not public. Run the following command:
    pip freeze

    This will list all packages that have been installed on your machine, regardless of where they were installed from.

  2. Run your pipeline with the following command-line option:
    --extra_package /path/to/package/package-name

Multiple File Dependencies

Often, your pipeline code spans multiple files. To run your project remotely, you need to group these files as a Python package. When the workers are spun up, they look for this kind of package in the staging location and install it. To group your files as a Python package and make it available remotely, perform the following steps:

  1. Create a setup.py file for your project. The following is a very basic setup.py file; More often, your file will contain more options.
        setuptools.setup(
          name='PACKAGE-NAME',
          version='PACKAGE-VERSION',
          install_requires=[],
          packages=setuptools.find_packages(),
        )
  2. Structure your project so that in the root directory you have the setup.py file, the main workflow file, and a directory with the rest of the files, as follows:
        root_dir/
          setup.py
          main.py
          other_files_dir/
    See Juliaset for an example that follows this required project structure.
  3. Run your pipeline with the following command-line option:
    --setup_file /path/to/setup.py

Note: If you created a requirements.txt file AND your project spans multiple files, you can get rid of the requirements.txt file and instead, add its packages to the install_requires field of the setup call (in step 1).

Non-Python Dependencies

If your pipeline uses non-Python packages, such as packages that require installation (on Linux) using the apt-get install command, you need to modify your setup.py file to execute these install commands. To do this, modify your setup.py file as described in this example, and perform the following steps:

  1. Structure your project so that in the root directory you have the setup.py file, the main workflow file, and a directory with the rest of the files, as follows:
        root_dir/
          setup.py
          main.py
          other_files_dir/
    See Juliaset for an example that follows this required project structure.
  2. Run your pipeline with the following command-line option:
    --setup_file /path/to/setup.py

PyPI Dependencies with Non-Python Dependencies

If your pipeline uses a PyPI package that depends on non-Python dependencies during package installation, you need to modify your setup.py file as described in this example and perform the following steps:

  1. Add the installation commands (e.g. the apt-get install commands) for the non-Python dependencies to the list of CUSTOM_COMMANDS in your setup.py file.
  2. Add ['pip', 'install', '<your PyPI package>'] to the list of CUSTOM_COMMANDS in your setup.py file.
  3. Structure your project so that in the root directory you have the setup.py file, the main workflow file, and a directory with the rest of the files, as follows:
        root_dir/
          setup.py
          main.py
          other_files_dir/
    See Juliaset for an example that follows this required project structure.
  4. Run your pipeline with the following command-line option:
    --setup_file /path/to/setup.py

Note: Since custom commands are executed after dependencies for your workflow are installed (by pip), you should omit this PyPI package dependency from the pipeline's requirements.txt file and from the install_requires parameter in the setuptools.setup() call of your setup.py file.

Send feedback about...

Cloud Dataflow Documentation