Quickstart Using Python

This page shows you how to set up your Python development environment, get the Cloud Dataflow SDK for Python, and run an example pipeline using the Google Cloud Platform Console.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the Manage resources page

  3. Make sure that billing is enabled for your project.

    Learn how to enable billing

  4. Enable the Cloud Dataflow, Compute Engine, Stackdriver Logging, Google Cloud Storage, Google Cloud Storage JSON, BigQuery, Google Cloud Pub/Sub, Google Cloud Datastore, and Google Cloud Resource Manager APIs.

    Enable the APIs

  5. Set up authentication:
    1. Go to the Create service account key page in the GCP Console.

      Go to the Create Service Account Key page
    2. From the Service account drop-down list, select New service account.
    3. Enter a name into the Service account name field.
    4. From the Role drop-down list, select Project > Owner.

      Note: The Role field authorizes your service account to access resources. You can view and change this field later using GCP Console. If you are developing a production application, specify more granular permissions than Project > Owner. For more information, see granting roles to service accounts.
    5. Click Create. A JSON file that contains your key downloads to your computer.
  6. Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the file path of the JSON file that contains your service account key. This variable only applies to your current shell session, so if you open a new session, set the variable again.

  7. Create a Cloud Storage bucket:
    1. In the GCP Console, go to the Cloud Storage browser.

      Go to the Cloud Storage browser

    2. Click Create bucket.
    3. In the Create bucket dialog, specify the following attributes:
      • Name: A unique bucket name. Do not include sensitive information in the bucket name, as the bucket namespace is global and publicly visible.
      • Storage class: Multi-Regional
      • A location where bucket data will be stored.
    4. Click Create.

Install pip and the Cloud Dataflow SDK

  1. The Cloud Dataflow SDK for Python requires Python version 2.7. Check that you are using Python version 2.7 by running:
    python --version
  2. Install pip, Python's package manager. Check if you already have pip installed by running:
    pip --version
    After installation, check that you have pip version 7.0.0 or newer. To update pip, run the following command:
    pip install -U pip

    If you don't have a command prompt readily available, you can use Google Cloud Shell. It has Python's package manager already installed, so you can skip this setup step.

  3. Cython is not required, but if it is installed, you must have version 0.26.1 or newer. Check your Cython version by running pip show cython.
  4. This step is optional but highly recommended. Install and create a Python virtual environment for initial experiments:
    1. If you do not have virtualenv version 13.1.0 or newer, install it by running:
      pip install --upgrade virtualenv
    2. A virtual environment is a directory tree containing its own Python distribution. To create a virtual environment, create a directory and run:
      virtualenv /path/to/directory
    3. A virtual environment needs to be activated for each shell that will use it. Activating it sets some environment variables that point to the virtual environment’s directories. To activate a virtual environment in Bash, run:
      . /path/to/directory/bin/activate
      This command sources the script bin/activate under the virtual environment directory you created.
      For instructions using other shells, see the virtualenv documentation.
  5. Install the latest version of the Cloud Dataflow SDK for Python by running the following command from a virtual environment:
    pip install google-cloud-dataflow
  6. You can read more about using Python on Google Cloud Platform on the Setting Up a Python Development Environment page.
  7. Run the wordcount.py example locally by running the following command:
    python -m apache_beam.examples.wordcount --output OUTPUT_FILE

    You installed google-cloud-dataflow but are executing WordCount with apache_beam. The reason for this is that Cloud Dataflow is a distribution of Apache Beam.

    You may see a message similar to the following:

    INFO:root:Missing pipeline option (runner). Executing pipeline using the default runner: DirectRunner.
    INFO:oauth2client.client:Attempting refresh to obtain initial access_token

Run an Example Pipeline Remotely

  1. Set the PROJECT environment variable to your GCP project ID. Set the BUCKET environment variable to the bucket you chose in step 6 of the Before You Begin section above.
    PROJECT=<your GCP project ID>
    BUCKET=gs://<bucket name chosen in step 6>
  2. Run the wordcount.py example remotely:
    python -m apache_beam.examples.wordcount \
      --project $PROJECT \
      --runner DataflowRunner \
      --staging_location $BUCKET/staging \
      --temp_location $BUCKET/temp \
      --output $BUCKET/results/output
  3. Check that your job succeeded:

    1. Open the Cloud Dataflow Web UI.
      Go to the Cloud Dataflow Web UI

      You should see your wordcount job with a status of Running at first, and then Succeeded:

      Cloud Dataflow Jobs
    2. Open the Cloud Storage Browser in the Google Cloud Platform Console.
      Go to the Cloud Storage browser

      In your bucket, you should see the results and staging directories:

      Cloud Storage bucket

      In the results directory, you should see the output files that your job created:

      Output files

Clean up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this quickstart:

  1. In the GCP Console, go to the Cloud Storage browser.

    Go to the Cloud Storage browser

  2. Click the checkbox next to the bucket you want to delete.
  3. Click the Delete button at the top of the page to delete the bucket.

What's next

Apache Beam is a trademark of The Apache Software Foundation or its affiliates in the United States and/or other countries.
Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataflow