Quickstart Using Java and Apache Maven

This page shows you how to set up your Google Cloud Platform project, create a Maven project with the Apache Beam SDK, and run an example pipeline on the Cloud Dataflow service.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Google Cloud Platform project. Learn how to enable billing.

  4. Enable the Cloud Dataflow, Compute Engine, Stackdriver Logging, Google Cloud Storage, Google Cloud Storage JSON, BigQuery, Cloud Pub/Sub, Cloud Datastore, and Cloud Resource Manager APIs.

    Enable the APIs

  5. Set up authentication:
    1. In the GCP Console, go to the Create service account key page.

      Go to the Create Service Account Key page
    2. From the Service account list, select New service account.
    3. In the Service account name field, enter a name.
    4. From the Role list, select Project > Owner.

      Note: The Role field authorizes your service account to access resources. You can view and change this field later by using the GCP Console. If you are developing a production app, specify more granular permissions than Project > Owner. For more information, see granting roles to service accounts.
    5. Click Create. A JSON file that contains your key downloads to your computer.
  6. Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the file path of the JSON file that contains your service account key. This variable only applies to your current shell session, so if you open a new session, set the variable again.

  7. Create a Cloud Storage bucket:
    1. In the GCP Console, go to the Cloud Storage Browser page.

      Go to the Cloud Storage Browser page

    2. Click Create bucket.
    3. In the Create bucket dialog, specify the following attributes:
      • Name: A unique bucket name. Do not include sensitive information in the bucket name, as the bucket namespace is global and publicly visible.
      • Default storage class: Standard
      • A location where bucket data will be stored.
    4. Click Create.
  8. Download and install the Java Development Kit (JDK) version 8. Verify that the JAVA_HOME environment variable is set and points to your JDK installation.
  9. Download and install Apache Maven by following Maven's installation guide for your specific operating system.

Get the WordCount code

The Apache Beam SDK is an open source programming model for data pipelines. You define these pipelines with an Apache Beam program and can choose a runner, such as Cloud Dataflow, to execute your pipeline.

Create a Maven project containing the Apache Beam SDK's WordCount examples, using the Maven Archetype Plugin. Run the mvn archetype:generate command in your shell or terminal as follows:

$ mvn archetype:generate \
      -DarchetypeGroupId=org.apache.beam \
      -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
      -DarchetypeVersion=2.16.0 \
      -DgroupId=org.example \
      -DartifactId=word-count-beam \
      -Dversion="0.1" \
      -Dpackage=org.apache.beam.examples \

After running the command, you should see a new directory called word-count-beam under your current directory. word-count-beam contains a simple pom.xml file and a series of example pipelines that count words in text files.

$ cd word-count-beam/

$ ls
pom.xml	src

$ ls src/main/java/org/apache/beam/examples/
DebuggingWordCount.java	WindowedWordCount.java	common
MinimalWordCount.java	WordCount.java

For a detailed introduction to the Apache Beam concepts used in these examples, see the WordCount Example Walkthrough. Here, we’ll just focus on executing WordCount.java.

Run WordCount locally

Run WordCount locally by running the following command from your word-count-beam directory:

$ mvn compile exec:java \
      -Dexec.mainClass=org.apache.beam.examples.WordCount \

Run WordCount on the Cloud Dataflow service

Build and run WordCount on the Cloud Dataflow service:

  • For the --project argument, specify the Project ID for the GCP project you created.
  • For the --stagingLocation and --output arguments, specify the name of the Cloud Storage bucket you created as part of the path.

$ mvn -Pdataflow-runner compile exec:java \
      -Dexec.mainClass=org.apache.beam.examples.WordCount \
      -Dexec.args="--project=<PROJECT_ID> \
      --stagingLocation=gs://<STORAGE_BUCKET>/staging/ \
      --output=gs://<STORAGE_BUCKET>/output \

View your results

  1. Open the Cloud Dataflow Web UI.
    Go to the Cloud Dataflow Web UI

    You should see your wordcount job with a status of Running at first, and then Succeeded:

    Cloud Dataflow Jobs
  2. Open the Cloud Storage Browser in the Google Cloud Platform Console.
    Go to the Cloud Storage browser

    In your bucket, you should see the output files and staging files that your job created:

    Cloud Storage bucket

Clean up

To avoid incurring charges to your GCP account for the resources used in this quickstart, follow these steps.

  1. In the GCP Console, go to the Cloud Storage Browser page.

    Go to the Cloud Storage Browser page

  2. Click the checkbox for the bucket you want to delete.
  3. Click Delete to delete the bucket.

What's next

Оцените, насколько информация на этой странице была вам полезна:

Оставить отзыв о...

Текущей странице
Cloud Dataflow
Нужна помощь? Обратитесь в службу поддержки.