Quickstart Using Python

This page shows you how to set up your Python development environment, get the Dataflow SDK for Python, and run an example pipeline using the Google Cloud Platform Console.

Before you begin

  1. Sign in to your Google account.

    If you don't already have one, sign up for a new account.

  2. Select or create a Cloud Platform project.

    Go to the Projects page

  3. Enable billing for your project.

    Enable billing

  4. Enable the Google Dataflow, Compute Engine, Stackdriver Logging, Google Cloud Storage, Google Cloud Storage JSON, and BigQuery APIs.

    Enable the APIs

  5. Install the Cloud SDK.
  6. Create a Cloud Storage bucket:
    1. Open the Cloud Storage Browser.
      Go to the Cloud Storage Browser
    2. Click Create Bucket.
    3. Enter a unique Name for your bucket.
      • Do not include sensitive information in the bucket name, because the bucket namespace is global and publicly visible.
    4. Choose Multi-Regional for Storage class.
    5. Choose United States for Location.
  7. Authenticate with the Cloud Platform. Run gcloud auth login with your user account.
        gcloud auth login 'user@example.com'

Install pip and the Dataflow SDK

  1. The Dataflow SDK for Python requires Python version 2.7. Verify that you are using Python version 2.7 by running python --version.
  2. Install pip, Python's package manager. Check if you already have pip installed by running pip --version. After installation, verify that you have pip version 7.0.0 or newer. To update pip, run the following command:
      pip install -U pip

    If you don't have a command prompt readily available, you can use Google Cloud Shell. It has Python's package manager already installed, so you can skip this setup step.

  3. Cython is not required, but if it is installed, you must have version 0.23.2 or newer. Check your Cython version by running pip show cython.
  4. It is recommended that you install a Python virtual environment for initial experiments. If you do not have virtualenv version 13.1.0 or newer, install it by running:
    pip install --upgrade virtualenv
  5. Install the latest version of the Dataflow SDK for Python by running the following command from a virtual environment:
    pip install google-cloud-dataflow
  6. Run the wordcount.py example locally by running the following command:
    python -m apache_beam.examples.wordcount --output OUTPUT_FILE

    You installed google-cloud-dataflow but are executing WordCount with apache_beam. The reason for this is that Dataflow is a distribution of Apache Beam.

    You may see a message similar to the following:

      INFO:root:Missing pipeline option (runner). Executing pipeline using the default runner: DirectRunner.
      INFO:oauth2client.client:Attempting refresh to obtain initial access_token

Run an Example Pipeline Remotely

  1. Set the environment variable BUCKET to the bucket you chose in step 5 of the Before You Begin section above.
      BUCKET=gs://<bucket name chosen in step 5>
  2. Run the wordcount.py example remotely:
    python -m apache_beam.examples.wordcount \
      --project $PROJECT \
      --runner DataflowRunner \
      --staging_location $BUCKET/staging \
      --temp_location $BUCKET/temp \
      --output $BUCKET/results/output
  3. Check that your job succeeded:

    1. Open the Cloud Dataflow Monitoring UI in the Google Cloud Platform Console.
      Go to the Cloud Dataflow Monitoring UI

      You should see your wordcount job with a status of Running at first, and then Succeeded:

      Cloud Dataflow Jobs
    2. Open the Cloud Storage Browser in the Google Cloud Platform Console.
      Go to the Cloud Storage browser

      In your bucket, you should see the results and staging directories:

      Cloud Storage bucket

      In the results directory, you should see the output files that your job created:

      Output files

Clean up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this quickstart:

  1. Open the Cloud Storage browser in the Google Cloud Platform Console.
  2. Select the checkbox next to the bucket that you created.
  3. Click DELETE.
  4. Click Delete to permanently delete the bucket and its contents.

What's next

Apache Beam is a trademark of The Apache Software Foundation or its affiliates in the United States and/or other countries.

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...

Cloud Dataflow