BigQuery Connector for Spark and Hadoop

You can use a Google BigQuery connector to enable programmatic read/write access to BigQuery. This is an ideal way to process data that is stored in BigQuery. No command-line access is exposed.

The BigQuery connector for Hadoop is a Java library that enables Hadoop to process data from BigQuery using abstracted versions of the Hadoop InputFormat and OutputFormat classes.

Pricing considerations

The BigQuery connector for Hadoop downloads data into your Google Cloud Storage bucket before running a Hadoop job. After the Hadoop job successfully completes, the data is deleted from Cloud Storage. You are charged for storage according to Cloud Storage pricing. To avoid excess charges, check your Cloud Storage account and make sure to remove unneeded temporary files. By downloading the BigQuery connector for Hadoop, you acknowledge and accept these additional terms.

When using the connector, you will also be charged for any associated BigQuery usage fees.

Getting the connector

There are two ways to get the BigQuery connector. The best method to choose depends on whether you are creating a new cluster or using an existing cluster.

New clusters

There are two simple ways to create Spark and Hadoop clusters on Google Cloud Platform. The BigQuery connector is automatically installed with each method.

  1. Google Cloud Dataproc—Use Cloud Dataproc to create fast, reliable, and cost-effective managed Spark and Hadoop clusters. You can use the BigQuery connector to read and write BigQuery data to/from Spark and Hadoop. Since Cloud Dataproc is a managed service, it is the easiest way to create a Spark and Hadoop cluster.
  2. Command line setup scripts—Create Spark and Hadoop clusters with a with a command-line script. The BigQuery connector and a configuration script are included with the bdutil package. To create a Spark/Hadoop cluster and install the BigQuery connector with bdutil, do the following (see Command Line Deployment for more information):
    1. Edit the file to apply your specific project and base configuration details.
    2. Create a new cluster with the BigQuery connector installed by running:
      ./bdutil --bucket foo-bucket -n 5 -P my-cluster --env_var_files deploy

Existing clusters

If you have an existing cluster, you can download the BigQuery connector for Hadoop 1.x or the BigQuery connector for Hadoop 2.x.

The connector is also available as part of the Google Cloud Platform bigdata-interop project on GitHub.

Using the connector

To get started quickly using the connectors, see the following examples:

For more detailed information, you can download the BigQuery connector Javadoc reference.

Send feedback about...

Hadoop on Google Cloud Platform