Setting up Cloud Dataflow in Eclipse

This page describes how to create a Dataflow project and run an example pipeline from within Eclipse.

The Dataflow Eclipse plugin works only with the Dataflow SDK distribution versions 2.0.0 to 2.5.0. The Dataflow Eclipse plugin does not work with the Apache Beam SDK distribution.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. Enable the Cloud Dataflow, Compute Engine, Stackdriver Logging, Cloud Storage, Cloud Storage JSON, BigQuery, Cloud Pub/Sub, Cloud Datastore, and Cloud Resource Manager APIs.

    Enable the APIs

  5. Install and initialize the Cloud SDK.
  6. Ensure you have installed Eclipse IDE version 4.7 or later.
  7. Ensure you have installed the Java Development Kit (JDK) version 1.8 or later.
  8. Ensure you have installed the latest version of the Cloud Tools for Eclipse plugin.
    1. If you have not done so already, follow the Cloud Tools for Eclipse Quickstart to install the plugin.
    2. Or, select Help > Check for Updates to update your plugin to the latest version.

Create a Dataflow project in Eclipse

To create a project, use the New Project wizard to generate a template application that you can use as the start for your own application.

If you don't have an application, you can run the WordCount sample app to complete the rest of these procedures.

  1. Select File -> New -> Project.
  2. In the Google Cloud Platform directory, select Cloud Dataflow Java Project.
  3. A wizard to select the
    type of project you are creating. There are directories for General, Eclipse Modeling Framework,
    EJB, Java, and Java EE. There is also a Google Cloud directory that is expanded,
    showing options for creating an App Engine Flexible Java Project,
    an App Engine Standard Java Project, and a Dataflow Java Project.
  4. Enter the Group ID.
  5. Enter the Artifact ID.
  6. Select the Project Template. For the WordCount sample, select Example pipelines.
  7. Select the Project Dataflow Version. For the WordCount sample, select 2.5.0.
  8. Enter the Package name. For the WordCount sample, enter
  9. A wizard to
    create a dataflow project. Provides fields to enter group ID, artifact
    ID, Project template, Cloud Dataflow version, package name, workspace location,
    and name template. Has buttons to go back, move to next, cancel the
    operation, and finish.
  10. Click Next.

Configure execution options

You should now see the Set Default Cloud Tools for Eclipse Run Options dialog.

  1. Select the account associated with your Google Cloud project or add a new account. To add a new account:
    1. Select Add a new account... in the Account drop-down menu.
    2. A new browser window opens to complete the sign-in process.
  2. Enter your Google Cloud Platform project ID.
  3. Select a Cloud Storage staging location or create a staging location. To create a staging location:
    1. Enter a unique name for Cloud Storage staging location. Location name must include the bucket name and a folder. Objects are created in your Cloud Storage bucket inside the specified folder. Do not include sensitive information in the bucket name because the bucket namespace is global and publicly visible.
    2. Click Create Bucket.
    3. A
        dialog to enter Google Cloud account, Google Cloud Platform ID, and
        Cloud Storage Staging Location. A Create button allows you to create a
        staging location. Buttons exist to go back, advance to the next window, cancel
        the operation, or finish the operation.
  4. Click Browse to navigate to your service account key.
  5. Click Finish.

Run the WordCount example pipeline on the Dataflow service

After creating your Cloud Tools for Eclipse project, you can create pipelines that run on the Dataflow service. As an example, you can run the WordCount sample pipeline.

  1. Select Run -> Run Configurations.
  2. In the left menu, select Dataflow Pipeline.
  3. Click New Launch Configuration.
  4. A dialog
    to select the Dataflow Pipeline run configuration. Options include Apache
    Tomcat, App Engine Local Server, Dataflow Pipeline, Eclipse Application,
    Eclipse Data Tools. The mouse pointer hovers over the New Launch
    Configuration button, and the New launch configuration tooltip for that
    button displays.
  5. Click the Main tab.
  6. Click Browse to select your Dataflow project.
  7. Click Search... and select the WordCount Main Type.
  8. Click the Pipeline Arguments tab.
  9. Select the DataflowRunner runner.
  10. Click the Arguments tab.
  11. In the Program arguments field, set the output to your Cloud Storage Staging Location. The staging location must be a folder; you can't stage pipeline jobs from a bucket's root directory.
  12. A dialog with
    the Arguments tab selected. In the Program arguments field, the --output
    option is set to the writable staging location.
  13. Click Run.
  14. When the job finishes, among other output, you should see the following line in the Eclipse console:
    Submitted job: <job_id>

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this quickstart, follow these steps.

  1. Open the Cloud Storage browser in the Google Cloud Console.
  2. Select the checkbox next to the bucket that you created.
  3. Click DELETE.
  4. Click Delete to confirm that you want to permanently delete the bucket and its contents.

What's next