Hadoop on Google Cloud Platform

BigQuery Connector for Hadoop

You can use a Google BigQuery connector to enable programmatic read/write access to BigQuery. This is ideal for processing data that you've already stored in BigQuery. No command-line access is exposed.

The BigQuery connector for Hadoop is a Java library that enables Hadoop to process data from BigQuery using abstracted versions of the Hadoop InputFormat and OutputFormat classes.

Pricing considerations

The BigQuery connector for Hadoop downloads data into your Google Cloud Storage bucket before running a Hadoop job. After the Hadoop job successfully completes, the data is deleted from Cloud Storage. You are charged for storage according to Cloud Storage pricing. In order to avoid excess charges, check your Cloud Storage account and make sure to remove unneeded temporary files. By downloading the BigQuery connector for Hadoop, you acknowledge and accept these additional terms.

When using the connector, you will also be charged for any associated BigQuery usage fees.

Getting the connector

There are several ways to get the BigQuery connector. The best method to choose depends on whether you are creating a new cluster or using an existing cluster.

New clusters

There are two simple ways to create Spark and Hadoop clusters on Google Cloud Platform. The BigQuery connector is automatically installed with each method.

  1. Google Cloud Dataproc - Helps you create fast, reliable, and cost-effective managed Spark and Hadoop clusters. You can use the BigQuery connector to read and write data from Spark and Hadoop to BigQuery. Since Google Cloud Dataproc is a managed service, it is the easiest way to create a Spark and Hadoop cluster.
  2. Command line setup scripts - Create Spark and Hadoop clusters with a set of command line utilities. The BigQuery connector is automatically included with these scripts.

Existing clusters

If you have an existing cluster, you can download the connector directly. The connector is also available as part of the Google Cloud Platform bigdata-interop project on GitHub.

Using the connector

You can download the BigQuery connector Javadoc reference.

Using command line scripts

To configure and enable BigQuery access, modify bigquery_env.sh and load it with bdutil:

./bdutil --bucket foo-bucket -n 5 -P my-cluster --env_var_files bigquery_env.sh deploy

Learn more about writing a MapReduce job with the connector and running a MapReduce job with the connector.