This page describes how to create a Dataflow project and run an example pipeline from within Eclipse.
The Tools for Eclipse plugin works only with the Dataflow SDK distribution versions 2.0.0 to 2.5.0. The Dataflow Eclipse plugin does not work with the Apache Beam SDK distribution.
Before you begin
Sign in to your Google Account.
If you don't already have one, sign up for a new account.
In the Cloud Console, on the project selector page, select or create a Google Cloud project.
Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.
- Enable the Cloud Dataflow, Compute Engine, Stackdriver Logging, Google Cloud Storage, Google Cloud Storage JSON, BigQuery, Cloud Pub/Sub, Cloud Datastore, and Cloud Resource Manager APIs.
- Install and initialize the Cloud SDK.
- Ensure you have installed Eclipse IDE version 4.7 or later.
- Ensure you have installed the Java Development Kit (JDK) version 1.8 or later.
- Ensure you have installed the latest version of the Dataflow plugin.
- If you have not done so already, follow the Dataflow Quickstart to install the plugin.
- Or, select Help > Check for Updates to update your plugin to the latest version.
Create a Dataflow project in Eclipse
To create a new project, use the New Project wizard to generate a template application that you can use as the start for your own application.
If you don't have an application, you can run the WordCount sample app to complete the rest of these procedures.
- Select File -> New -> Project.
- In the Google Cloud Platform directory, select Cloud Dataflow Java Project.
- Enter the Group ID.
- Enter the Artifact ID.
- Select the Project Template. For the WordCount sample, select Example pipelines.
- Select the Project Dataflow Version. For the WordCount sample, select 2.5.0.
- Enter the Package name. For the WordCount sample, enter com.google.cloud.dataflow.examples.
- Click Next.
Configure execution options
You should now see the Set Default Dataflow Run Options dialog.
- Select the account associated with your Google Cloud project or add a new account. To add a new account:
- Select Add a new account... in the Account drop-down menu.
- A new browser window opens to complete the sign in process.
- Enter your Cloud Platform Project ID.
- Select a Cloud Storage Staging Location or create a new staging location. To create a new staging location:
- Enter a unique name for Cloud Storage Staging Location. Location name must include the bucket name and a folder. Objects are created in your Cloud Storage bucket inside the specified folder. Do not include sensitive information in the bucket name because the bucket namespace is global and publicly visible.
- Bucket names must contain only lowercase letters, numbers, dashes (
-), underscores (
_), and dots (
.). Names containing dots require verification.
- Bucket names must start and end with a number or letter.
- Bucket names must contain 3 to 63 characters. Names containing dots can contain up to 222 characters, but each dot-separated component can be no longer than 63 characters.
- Bucket names cannot be represented as an IP address in dotted-decimal notation (for example, 192.168.5.4).
- Bucket names cannot begin with the "goog" prefix.
- Bucket names cannot contain "google" or close misspellings, such as "g00gle".
- Click Create Bucket.
- Click Browse to navigate to your service account key.
- Click Finish.
Also, for DNS compliance and future compatibility, you should not use underscores
_) or have a period adjacent to another period or dash. For example, ".." or "-." or
".-" are not valid in DNS names.
Run the WordCount example pipeline on the Dataflow service
After creating your Dataflow project, you can create pipelines that run on the Dataflow service. As an example, you can run the WordCount sample pipeline.
- Select Run -> Run Configurations.
- In the left menu, select Dataflow Pipeline.
- Click New Launch Configuration.
- Click the Main tab.
- Click Browse to select your Dataflow project.
- Click Search... and select the WordCount Main Type.
- Click the Pipeline Arguments tab.
- Select the DataflowRunner runner.
- Click the Arguments tab.
- In the Program arguments field, set the output to your Cloud Storage Staging Location. The staging location must be a folder; you can't stage pipeline jobs from a bucket's root directory.
- Click Run.
- When the job finishes, among other output, you should see the following
line in the Eclipse console:
Submitted job: <job_id>
To avoid incurring charges to your Google Cloud account for the resources used in this quickstart, follow these steps.
- Open the Cloud Storage browser in the Google Cloud Console.
- Select the checkbox next to the bucket that you created.
- Click DELETE.
- Click Delete to confirm that you want to permanently delete the bucket and its contents.