Quickstart Using Python

This page shows you how to set up your Python development environment, get the Dataflow SDK for Python, and run an example pipeline using the Google Cloud Platform Console.

Before you begin

  1. Sign in to your Google account.

    If you don't already have one, sign up for a new account.

  2. Select or create a Cloud Platform project.

    Go to the Manage resources page

  3. Enable billing for your project.

    Enable billing

  4. Enable the Cloud Dataflow, Compute Engine, Stackdriver Logging, Google Cloud Storage, Google Cloud Storage JSON, BigQuery, and Google Cloud Resource Manager APIs.

    Enable the APIs

  5. Install the Cloud SDK.
  6. Create a Cloud Storage bucket:
    1. Open the Cloud Storage Browser.
      Go to the Cloud Storage Browser
    2. Click Create Bucket.
    3. Enter a unique Name for your bucket.
      • Do not include sensitive information in the bucket name, because the bucket namespace is global and publicly visible.
    4. Choose Multi-Regional for Storage class.
    5. Choose United States for Location.
  7. Authenticate with the Cloud Platform. Run the following command to get Application Default Credentials.
    gcloud auth application-default login

Install pip and the Dataflow SDK

  1. The Dataflow SDK for Python requires Python version 2.7. Check that you are using Python version 2.7 by running:
    python --version
  2. Install pip, Python's package manager. Check if you already have pip installed by running:
    pip --version
    After installation, check that you have pip version 7.0.0 or newer. To update pip, run the following command:
    pip install -U pip

    If you don't have a command prompt readily available, you can use Google Cloud Shell. It has Python's package manager already installed, so you can skip this setup step.

  3. Cython is not required, but if it is installed, you must have version 0.26.1 or newer. Check your Cython version by running pip show cython.
  4. This step is optional but highly recommended. Install and create a Python virtual environment for initial experiments:
    1. If you do not have virtualenv version 13.1.0 or newer, install it by running:
      pip install --upgrade virtualenv
    2. A virtual environment is a directory tree containing its own Python distribution. To create a virtual environment, create a directory and run:
      virtualenv /path/to/directory
    3. A virtual environment needs to be activated for each shell that will use it. Activating it sets some environment variables that point to the virtual environment’s directories. To activate a virtual environment in Bash, run:
      . /path/to/directory/bin/activate
      This command sources the script bin/activate under the virtual environment directory you created.
      For instructions using other shells, see the virtualenv documentation.
  5. Install the latest version of the Dataflow SDK for Python by running the following command from a virtual environment:
    pip install google-cloud-dataflow
  6. Run the wordcount.py example locally by running the following command:
    python -m apache_beam.examples.wordcount --output OUTPUT_FILE

    You installed google-cloud-dataflow but are executing WordCount with apache_beam. The reason for this is that Dataflow is a distribution of Apache Beam.

    You may see a message similar to the following:

    INFO:root:Missing pipeline option (runner). Executing pipeline using the default runner: DirectRunner.
    INFO:oauth2client.client:Attempting refresh to obtain initial access_token

Run an Example Pipeline Remotely

  1. Set the PROJECT environment variable to your GCP project ID. Set the BUCKET environment variable to the bucket you chose in step 5 of the Before You Begin section above.
    BUCKET=gs://<bucket name chosen in step 5>
  2. Run the wordcount.py example remotely:
    python -m apache_beam.examples.wordcount \
      --project $PROJECT \
      --runner DataflowRunner \
      --staging_location $BUCKET/staging \
      --temp_location $BUCKET/temp \
      --output $BUCKET/results/output
  3. Check that your job succeeded:

    1. Open the Cloud Dataflow Monitoring UI in the Google Cloud Platform Console.
      Go to the Cloud Dataflow Monitoring UI

      You should see your wordcount job with a status of Running at first, and then Succeeded:

      Cloud Dataflow Jobs
    2. Open the Cloud Storage Browser in the Google Cloud Platform Console.
      Go to the Cloud Storage browser

      In your bucket, you should see the results and staging directories:

      Cloud Storage bucket

      In the results directory, you should see the output files that your job created:

      Output files

Clean up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this quickstart:

  1. Open the Cloud Storage browser in the Google Cloud Platform Console.
  2. Select the checkbox next to the bucket that you created.
  3. Click DELETE.
  4. Click Delete to permanently delete the bucket and its contents.

What's next

Apache Beam is a trademark of The Apache Software Foundation or its affiliates in the United States and/or other countries.

Send feedback about...

Cloud Dataflow