You can use a Google BigQuery connector to enable programmatic read/write access to BigQuery. This is ideal for processing data that you've already stored in BigQuery. No command-line access is exposed.
The BigQuery connector for Hadoop downloads data into your Google Cloud Storage bucket before running a Hadoop job. After the Hadoop job successfully completes, the data is deleted from Cloud Storage. You are charged for storage according to Cloud Storage pricing. In order to avoid excess charges, check your Cloud Storage account and make sure to remove unneeded temporary files. By downloading the BigQuery connector for Hadoop, you acknowledge and accept these additional terms.
When using the connector, you will also be charged for any associated BigQuery usage fees.
Getting the connector
There are several ways to get the BigQuery connector. The best method to choose depends on whether you are creating a new cluster or using an existing cluster.
There are two simple ways to create Spark and Hadoop clusters on Google Cloud Platform. The BigQuery connector is automatically installed with each method.
- Google Cloud Dataproc - Helps you create fast, reliable, and cost-effective managed Spark and Hadoop clusters. You can use the BigQuery connector to read and write data from Spark and Hadoop to BigQuery. Since Google Cloud Dataproc is a managed service, it is the easiest way to create a Spark and Hadoop cluster.
- Command line setup scripts - Create Spark and Hadoop clusters with a set of command line utilities. The BigQuery connector is automatically included with these scripts.
Using the connector
You can download the BigQuery connector Javadoc reference.
Using command line scripts
To configure and enable BigQuery access, modify
bigquery_env.sh and load it with bdutil:
./bdutil --bucket foo-bucket -n 5 -P my-cluster --env_var_files bigquery_env.sh deploy