Hadoop MapReduce job with Bigtable
This example uses Hadoop to perform a simple MapReduce job that
counts the number of times a word appears in a text file. The MapReduce job
uses Bigtable to store the results of the map operation. The code for
this example is in the GitHub repository
GoogleCloudPlatform/cloud-bigtable-examples, in the directory
java/dataproc-wordcount
.
Set up authentication
To use the Java samples on this page in a local development environment, install and initialize the gcloud CLI, and then set up Application Default Credentials with your user credentials.
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
If you're using a local shell, then create local authentication credentials for your user account:
gcloud auth application-default login
You don't need to do this if you're using Cloud Shell.
For more information, see Set up authentication for a local development environment.
Overview of the code sample
The code sample provides a simple command-line interface that takes one or more
text files and a table name as input, finds all of the words that appear in the
file, and counts how many times each word appears. The MapReduce logic appears
in the WordCountHBase
class.
First, a mapper tokenizes the text file's contents and generates key-value
pairs, where the key is a word from the text file and the value is 1
:
A reducer then sums the values for each key and writes the results to a
Bigtable table that you specified. Each row key is a word from the
text file. Each row contains a cf:count
column, which contains the number of
times the row key appears in the text file.