You can use a BigQuery connector to enable programmatic read/write access to BigQuery. This is an ideal way to process data that is stored in BigQuery. No command-line access is exposed. The BigQuery connector is a Java library that enables Hadoop to process data from BigQuery using abstracted versions of the Apache Hadoop InputFormat and OutputFormat classes.
When using the connector, you will also be charged for any associated BigQuery usage fees. Additionally, the BigQuery connector downloads data into a Cloud Storage bucket before running a Hadoop job. After the Hadoop job successfully completes, the data is deleted from Cloud Storage. You are charged for storage according to Cloud Storage pricing. To avoid excess charges, check your Cloud Storage account and make sure to remove unneeded temporary files.
Getting the connector
Cloud Dataproc clusters
The BigQuery connector is installed by default on all
Cloud Dataproc 1.0-1.2 cluster nodes under
It's available in both Spark and PySpark environments.
Because BigQuery connector is not installed by default in Cloud Dataproc 1.3 and higher, you should use it in one of the following ways:
- install the BigQuery connector using initialization action
- specify the BigQuery connector in the
jarsparameter when submitting a job:
- include the BigQuery connector classes in the application's jar-with-dependencies
Other Spark/Hadoop clusters
Using the connector
To get started quickly using the BigQuery connector, see the following examples:
The BigQuery connector requires Java 8.
Apache Maven Dependency Information
<dependency> <groupId>com.google.cloud.bigdataoss</groupId> <artifactId>bigquery-connector</artifactId> <version>insert "hadoopX-X.X.X" connector version number here</version> </dependency>