Quickstart Using Java and Apache Maven

This page shows you how to set up your Google Cloud Platform project to use Cloud Dataflow create a Maven project with the Cloud Dataflow SDK and examples, and run an example pipeline using the Google Cloud Platform Console.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the Manage resources page

  3. Make sure that billing is enabled for your project.

    Learn how to enable billing

  4. Enable the Cloud Dataflow, Compute Engine, Stackdriver Logging, Google Cloud Storage, Google Cloud Storage JSON, BigQuery, Google Cloud Pub/Sub, Google Cloud Datastore, and Google Cloud Resource Manager APIs.

    Enable the APIs

  5. Set up authentication:
    1. Go to the Create service account key page in the GCP Console.

      Go to the Create Service Account Key page
    2. From the Service account drop-down list, select New service account.
    3. Enter a name into the Service account name field.
    4. From the Role drop-down list, select Project > Owner.

      Note: The Role field authorizes your service account to access resources. You can view and change this field later using GCP Console. If you are developing a production application, specify more granular permissions than Project > Owner. For more information, see granting roles to service accounts.
    5. Click Create. A JSON file that contains your key downloads to your computer.
  6. Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the file path of the JSON file that contains your service account key.

  7. Create a Cloud Storage bucket:
    1. In the GCP Console, go to the Cloud Storage browser.

      Go to the Cloud Storage browser

    2. Click Create bucket.
    3. In the Create bucket dialog, specify the following attributes:
      • Name: A unique bucket name. Do not include sensitive information in the bucket name, as the bucket namespace is global and publicly visible.
      • Storage class: Multi-Regional
      • A location where bucket data will be stored.
    4. Click Create.
  8. Download and install the Java Development Kit (JDK) version 1.7 or later. Verify that the JAVA_HOME environment variable is set and points to your JDK installation.
  9. Download and install Apache Maven by following Maven's installation guide for your specific operating system.

Create a Maven Project that contains the Cloud Dataflow SDK for Java and Examples

  1. Create a Maven project containing the Cloud Dataflow SDK for Java using the Maven Archetype Plugin. Run the mvn archetype:generate command in your shell or terminal as follows:

    Java: SDK 2.x

      mvn archetype:generate \
          -DarchetypeArtifactId=google-cloud-dataflow-java-archetypes-examples \
          -DarchetypeGroupId=com.google.cloud.dataflow \
          -DarchetypeVersion=2.4.0 \
          -DgroupId=com.example \
          -DartifactId=first-dataflow \
          -Dversion="0.1" \
          -DinteractiveMode=false \
          -Dpackage=com.example
    
  2. After running the command, you should see a new directory called first-dataflow under your current directory. first-dataflow contains a Maven project that includes the Cloud Dataflow SDK for Java and example pipelines.

  3. Change to the first-dataflow/ directory.
  4. Build and run the example pipeline locally using the direct runner by using the mvn compile exec:java command in your shell or terminal window. For the --output arguments specify a local file path.

    Java: SDK 2.x

      mvn compile exec:java \
          -Dexec.mainClass=com.example.WordCount \
          -Dexec.args="--output=./output/"
    

Run an Example Pipeline on the Cloud Dataflow Service

  1. Build and run the Cloud Dataflow example pipeline called WordCount on the Cloud Dataflow managed service by using the same command but different arguments. For the --project argument, you'll need to specify the Project ID for the GCP project that you created. For the --stagingLocation and --output arguments, you'll need to specify the name of the Cloud Storage bucket you created as part of the path.

    For example, if your GCP Project ID is my-cloud-project and your Cloud Storage bucket name is my-wordcount-storage-bucket, enter the following command to run the WordCount pipeline:

    Java: SDK 2.x

      mvn compile exec:java \
          -Dexec.mainClass=com.example.WordCount \
          -Dexec.args="--project=<my-cloud-project> \
          --stagingLocation=gs://<my-wordcount-storage-bucket>/staging/ \
          --output=gs://<my-wordcount-storage-bucket>/output \
          --runner=DataflowRunner"
    
  2. Make sure that your job succeeded:

    1. Open the Cloud Dataflow Monitoring UI in the Google Cloud Platform Console.
      Go to the Cloud Dataflow Monitoring UI

      You should see your wordcount job with a status of Running at first, and then Succeeded:

      Cloud Dataflow Jobs
    2. Open the Cloud Storage Browser in the Google Cloud Platform Console.
      Go to the Cloud Storage browser

      In your bucket, you should see the output files and staging files that your job created:

      Cloud Storage bucket

Clean up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this quickstart:

  1. In the GCP Console, go to the Cloud Storage browser.

    Go to the Cloud Storage browser

  2. Click the checkbox next to the bucket you want to delete.
  3. Click the Delete button at the top of the page to delete the bucket.

What's next

Send feedback about...

Cloud Dataflow Documentation