Quickstart Using Python

This page shows you how to set up your Python development environment, get the Apache Beam SDK for Python, and run an example pipeline using the Google Cloud Platform Console.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the Manage resources page

  3. Make sure that billing is enabled for your project.

    Learn how to enable billing

  4. Enable the Cloud Dataflow, Compute Engine, Stackdriver Logging, Google Cloud Storage, Google Cloud Storage JSON, BigQuery, Google Cloud Pub/Sub, Google Cloud Datastore, and Google Cloud Resource Manager APIs.

    Enable the APIs

  5. Set up authentication:
    1. In the GCP Console, go to the Create service account key page.

      Go to the Create Service Account Key page
    2. From the Service account drop-down list, select New service account.
    3. In the Service account name field, enter a name .
    4. From the Role drop-down list, select Project > Owner.

      Note: The Role field authorizes your service account to access resources. You can view and change this field later by using GCP Console. If you are developing a production app, specify more granular permissions than Project > Owner. For more information, see granting roles to service accounts.
    5. Click Create. A JSON file that contains your key downloads to your computer.
  6. Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the file path of the JSON file that contains your service account key. This variable only applies to your current shell session, so if you open a new session, set the variable again.

  7. Create a Cloud Storage bucket:
    1. In the GCP Console, go to the Cloud Storage browser.

      Go to the Cloud Storage browser

    2. Click Create bucket.
    3. In the Create bucket dialog, specify the following attributes:
      • Name: A unique bucket name. Do not include sensitive information in the bucket name, as the bucket namespace is global and publicly visible.
      • Storage class: Multi-Regional
      • A location where bucket data will be stored.
    4. Click Create.

Set up your environment

  1. The Apache Beam SDK for Python requires Python version 2.7.x. Check that you have version 2.7.x by running:
    python --version
  2. Install pip, Python's package manager. Check that you have version 7.0.0 or newer by running:
    pip --version
    If you do not have pip version 7.0.0 or newer, run the following command to install it. This command might require administrative privileges.
    pip install -U pip

    If you don't have a command prompt readily available, you can use Google Cloud Shell. It has Python's package manager already installed, so you can skip this setup step.

  3. Cython is not required, but if it is installed, you must have version 0.26.1 or newer. Check your Cython version by running pip show cython.
  4. It is recommended that you install a Python virtual environment for initial experiments. If you do not have virtualenv version 13.1.0 or newer, run the following command to install it. This command might require administrative privileges.
    pip install --upgrade virtualenv
    1. A virtual environment is a directory tree containing its own Python distribution. To create a virtual environment, create a directory and run:
      virtualenv /path/to/directory
    2. A virtual environment needs to be activated for each shell that will use it. Activating it sets some environment variables that point to the virtual environment’s directories. To activate a virtual environment in Bash, run:
      . /path/to/directory/bin/activate
      This command sources the script bin/activate under the virtual environment directory you created.

      For instructions using other shells, see the virtualenv documentation.

Get the Apache Beam SDK

Install the latest Apache Beam SDK for Python from PyPI:

    pip install apache-beam[gcp]
  
You can read more about using Python on Google Cloud Platform on the Setting Up a Python Development Environment page.

Run WordCount Locally

Run WordCount locally by running the following command from your word-count-beam directory:

python -m apache_beam.examples.wordcount --output OUTPUT_FILE

Run WordCount on the Cloud Dataflow service

Run WordCount on the Cloud Dataflow service:
python -m apache_beam.examples.wordcount --input gs://dataflow-samples/shakespeare/kinglear.txt \
                                         --output gs://<your-gcs-bucket>/counts \
                                         --runner DataflowRunner \
                                         --project your-gcp-project \
                                         --temp_location gs://<your-gcs-bucket>/tmp/
  

View your results

  1. Open the Cloud Dataflow Web UI.
    Go to the Cloud Dataflow Web UI

    You should see your wordcount job with a status of Running at first, and then Succeeded:

    Cloud Dataflow Jobs
  2. Open the Cloud Storage Browser in the Google Cloud Platform Console.
    Go to the Cloud Storage browser

    In your bucket, you should see the results and staging directories:

    Cloud Storage bucket

    In the results directory, you should see the output files that your job created:

    Output files

Clean up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this quickstart:

  1. In the GCP Console, go to the Cloud Storage browser.

    Go to the Cloud Storage browser

  2. Click the checkbox next to the bucket you want to delete.
  3. Click the Delete button at the top of the page to delete the bucket.

What's next

Apache Beam is a trademark of The Apache Software Foundation or its affiliates in the United States and/or other countries.
Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataflow
Need help? Visit our support page.