Edit on GitHub
Report issue
Page history

Count words with Cloud Dataflow and Python

Author(s): @jscud ,   Published: 2019-07-31


Count words with Cloud Dataflow and Python

Take the interactive version of this tutorial, which runs in the Google Cloud Platform (GCP) Console:

Open in GCP Console

Introduction

In this tutorial, you'll learn the basics of the Cloud Dataflow service by running a simple example pipeline using Python.

Cloud Dataflow pipelines are either batch (processing bounded input like a file or database table) or streaming (processing unbounded input from a source like Cloud Pub/Sub). The example in this tutorial is a batch pipeline that counts words in a collection of Shakespeare's works.

Before you start, you'll need to check for prerequisites in your GCP project and perform initial setup.

Project setup

GCP organizes resources into projects. This allows you to collect all of the related resources for a single application in one place.

Begin by creating a new project or selecting an existing project for this tutorial.

For details, see Creating a project.

Set up Cloud Dataflow

To use Cloud Dataflow, enable the Cloud Dataflow APIs and open Cloud Shell.

Enable Cloud APIs

Cloud Dataflow processes data in many GCP data stores and messaging services, including BigQuery, Cloud Storage, and Cloud Pub/Sub. To use these services, you must first enable their APIs.

Use the following to enable the APIs:

https://console.cloud.google.com/flows/enableapi?apiid=compute.googleapis.com,dataflow,cloudresourcemanager.googleapis.com,logging,storage_component,storage_api,bigquery,pubsub

Open Cloud Shell

In this tutorial, you do much of your work in Cloud Shell, which is a built-in command-line tool for the GCP Console.

Open Cloud Shell by clicking the Activate Cloud Shell button in the navigation bar in the upper-right corner of the console.

Install Cloud Dataflow samples

Cloud Dataflow runs jobs written using the Apache Beam SDK. To submit jobs to the Cloud Dataflow service using Python, your development environment requires Python, the Google Cloud SDK, and the Apache Beam SDK for Python. Additionally, Cloud Dataflow uses pip, Python's package manager, to manage SDK dependencies.

This tutorial uses a Cloud Shell environment that has Python and pip installed. If you prefer, you can do this tutorial on your local machine.

Download the samples and the Apache Beam SDK for Python

To write a Cloud Dataflow job with Python, you first need to download the SDK from the repository.

When you run this command in Cloud Shell, pip will download and install the appropriate version of the Apache Beam SDK:

pip install --user --quiet apache-beam[gcp]

Set up a Cloud Storage bucket

Cloud Dataflow uses Cloud Storage buckets to store output data and cache your pipeline code.

Run gsutil mb

In Cloud Shell, use the command gsutil mb to create a Cloud Storage bucket:

gsutil mb gs://

is your GCP project ID.

For more information about the gsutil tool, see the documentation.

Create and launch a pipeline

In Cloud Dataflow, data processing work is represented by a pipeline. A pipeline reads input data, performs transformations on that data, and then produces output data. A pipeline's transformations might include filtering, grouping, comparing, or joining data.

Launch your pipeline on the Cloud Dataflow service

Use Python to launch your pipeline on the Cloud Dataflow service. The running pipeline is referred to as a job.

python -m apache_beam.examples.wordcount \
  --project  \
  --runner DataflowRunner \
  --temp_location gs:///temp \
  --output gs:///results/output \
  --job_name dataflow-intro
  • is your GCP project ID.
  • output is the bucket used by the WordCount example to store the job results.

Your job is running

Congratulations! Your binary is now staged to the storage bucket that you created earlier, and Compute Engine instances are being created. Cloud Dataflow will split up your input file such that your data can be processed by multiple machines in parallel.

Monitor your job

In this section, you check the progress of your pipeline on the Dataflow page in the GCP Console.

Go to the Dataflow page

Open the Navigation menu in the upper-left corner of the console, and then select Dataflow.

Select your job

Click the job name to view the job details.

Explore pipeline details and metrics

Explore the pipeline on the left and the job information on the right. To see detailed job status, click Logs at the top of the page.

Click a step in the pipeline to view its metrics.

As your job finishes, you'll see the job status change, and the Compute Engine instances used by the job will stop automatically.

View your output

Now that your job has run, you can explore the output files in Cloud Storage.

Go to the Cloud Storage page

Open the Navigation menu in the upper-left corner of the console, select Storage, and then click Browser.

Go to the storage bucket

In the list of buckets, select the bucket that you created earlier.

The bucket contains output and temp folders. Dataflow saves the output in shards, so your bucket will contain several output files.

The temp folder is for staging binaries needed by the workers and for temporary files needed by the job execution.

Clean up

To prevent being charged for Cloud Storage usage, delete the bucket you created.

  1. Click the Buckets link to go back to the bucket browser.

  2. Check the box next to the bucket that you created.

  3. Click the Delete button at the top of the GCP Console, and confirm the deletion.

Conclusion

Here's what you can do next:

Set up your local environment:

Submit a Tutorial

Share step-by-step guides

SUBMIT A TUTORIAL

Request a Tutorial

Ask for community help

SUBMIT A REQUEST

GCP Tutorials

Tutorials published by GCP

VIEW TUTORIALS

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see our Site Policies. Java is a registered trademark of Oracle and/or its affiliates.