Hive to BigQuery template
Use the Dataproc Serverless Hive to BigQuery template to extract data from Hive to BigQuery.
Use the template
Run the template using the gcloud CLI or Dataproc API.
gcloud
Before using any of the command data below, make the following replacements:
- PROJECT_ID: Required. Your Google Cloud project ID listed in the IAM Settings.
- REGION: Required. Compute Engine region.
- TEMPLATE_VERSION: Required. Specify
latest
for the latest template version, or the date of a specific version, for example,2023-03-17_v0.1.0-beta
(visit gs://dataproc-templates-binaries or rungcloud storage ls gs://dataproc-templates-binaries
to list available template versions). - SUBNET: Optional. If a subnet is not specified, the subnet
in the specified REGION in the
default
network is selected.Example:
projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME
- HOST and PORT: Required.
Hostname or IP address and port of the source Hive database host.
Example:
10.0.0.33:9083
- DATASET: Required. BigQuery output dataset name.
- TABLE: Required. BigQuery output table name.
- QUERY: Required. Query for extracting data from Hive.
- TEMP_BUCKET: Required. Cloud Storage bucket name (specify the name of the bucket only). This bucket is used for BigQuery loading.
- MODE: Required. Write mode for BigQuery output.
Options:
Append
,Overwrite
,Ignore
, orErrorIfExists
. - TEMPVIEW and TEMPVIEW_SQL_QUERY: Optional. You can use these two optional parameters to apply a Spark SQL transformation while loading data into BigQuery. TEMPVIEW is the temporary view name, and TEMPVIEW_SQL_QUERY is the query statement. TEMPVIEW and the table name in TEMPVIEW_SQL_QUERY must match.
- SERVICE_ACCOUNT: Optional. If not provided, the default Compute Engine service account is used.
- PROPERTY and PROPERTY_VALUE:
Optional. Comma-separated list of
Spark property=
value
pairs. - LABEL and LABEL_VALUE:
Optional. Comma-separated list of
label
=value
pairs. - LOG_LEVEL: Optional. Level of logging. Can be one of
ALL
,DEBUG
,ERROR
,FATAL
,INFO
,OFF
,TRACE
, orWARN
. Default:INFO
. -
KMS_KEY: Optional. Cloud Key Management Service to use for encryption. If a key is not specified, data is encrypted at rest using a Google-owned and Google-managed key.
Example:
projects/PROJECT_ID/regions/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
Execute the following command:
Linux, macOS, or Cloud Shell
gcloud dataproc batches submit spark \ --class=com.google.cloud.dataproc.templates.main.DataProcTemplate \ --version="1.2" \ --project="PROJECT_ID" \ --region="REGION" \ --jars="gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar" \ --subnet="SUBNET" \ --service-account="SERVICE_ACCOUNT" \ --properties="spark.hadoop.hive.metastore.uris=thrift://HOST:PORT,PROPERTY=PROPERTY_VALUE" \ --kms-key="KMS_KEY" \ --labels="LABEL=LABEL_VALUE" \ -- --template HIVETOBIGQUERY \ --templateProperty log.level="LOG_LEVEL" \ --templateProperty hivetobq.bigquery.location="PROJECT_ID.DATASET.TABLE" \ --templateProperty hivetobq.sql="QUERY" \ --templateProperty hivetobq.temp.gcs.bucket="TEMP_BUCKET" \ --templateProperty hivetobq.write.mode="MODE" \ --templateProperty hivetobq.temp.table="TEMPVIEW" \ --templateProperty hivetobq.temp.query="TEMPVIEW_SQL_QUERY"
Windows (PowerShell)
gcloud dataproc batches submit spark ` --class=com.google.cloud.dataproc.templates.main.DataProcTemplate ` --version="1.2" ` --project="PROJECT_ID" ` --region="REGION" ` --jars="gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar" ` --subnet="SUBNET" ` --service-account="SERVICE_ACCOUNT" ` --properties="spark.hadoop.hive.metastore.uris=thrift://HOST:PORT,PROPERTY=PROPERTY_VALUE" ` --kms-key="KMS_KEY" ` --labels="LABEL=LABEL_VALUE" ` -- --template HIVETOBIGQUERY ` --templateProperty log.level="LOG_LEVEL" ` --templateProperty hivetobq.bigquery.location="PROJECT_ID.DATASET.TABLE" ` --templateProperty hivetobq.sql="QUERY" ` --templateProperty hivetobq.temp.gcs.bucket="TEMP_BUCKET" ` --templateProperty hivetobq.write.mode="MODE" ` --templateProperty hivetobq.temp.table="TEMPVIEW" ` --templateProperty hivetobq.temp.query="TEMPVIEW_SQL_QUERY"
Windows (cmd.exe)
gcloud dataproc batches submit spark ^ --class=com.google.cloud.dataproc.templates.main.DataProcTemplate ^ --version="1.2" ^ --project="PROJECT_ID" ^ --region="REGION" ^ --jars="gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar" ^ --subnet="SUBNET" ^ --service-account="SERVICE_ACCOUNT" ^ --properties="spark.hadoop.hive.metastore.uris=thrift://HOST:PORT,PROPERTY=PROPERTY_VALUE" ^ --kms-key="KMS_KEY" ^ --labels="LABEL=LABEL_VALUE" ^ -- --template HIVETOBIGQUERY ^ --templateProperty log.level="LOG_LEVEL" ^ --templateProperty hivetobq.bigquery.location="PROJECT_ID.DATASET.TABLE" ^ --templateProperty hivetobq.sql="QUERY" ^ --templateProperty hivetobq.temp.gcs.bucket="TEMP_BUCKET" ^ --templateProperty hivetobq.write.mode="MODE" ^ --templateProperty hivetobq.temp.table="TEMPVIEW" ^ --templateProperty hivetobq.temp.query="TEMPVIEW_SQL_QUERY"
REST
Before using any of the request data, make the following replacements:
- PROJECT_ID: Required. Your Google Cloud project ID listed in the IAM Settings.
- REGION: Required. Compute Engine region.
- TEMPLATE_VERSION: Required. Specify
latest
for the latest template version, or the date of a specific version, for example,2023-03-17_v0.1.0-beta
(visit gs://dataproc-templates-binaries or rungcloud storage ls gs://dataproc-templates-binaries
to list available template versions). - SUBNET: Optional. If a subnet is not specified, the subnet
in the specified REGION in the
default
network is selected.Example:
projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME
- HOST and PORT: Required.
Hostname or IP address and port of the source Hive database host.
Example:
10.0.0.33:9083
- DATASET: Required. BigQuery output dataset name.
- TABLE: Required. BigQuery output table name.
- QUERY: Required. Query for extracting data from Hive.
- TEMP_BUCKET: Required. Cloud Storage bucket name (specify the name of the bucket only). This bucket is used for BigQuery loading.
- MODE: Required. Write mode for BigQuery output.
Options:
Append
,Overwrite
,Ignore
, orErrorIfExists
. - TEMPVIEW and TEMPVIEW_SQL_QUERY: Optional. You can use these two optional parameters to apply a Spark SQL transformation while loading data into BigQuery. TEMPVIEW is the temporary view name, and TEMPVIEW_SQL_QUERY is the query statement. TEMPVIEW and the table name in TEMPVIEW_SQL_QUERY must match.
- SERVICE_ACCOUNT: Optional. If not provided, the default Compute Engine service account is used.
- PROPERTY and PROPERTY_VALUE:
Optional. Comma-separated list of
Spark property=
value
pairs. - LABEL and LABEL_VALUE:
Optional. Comma-separated list of
label
=value
pairs. - LOG_LEVEL: Optional. Level of logging. Can be one of
ALL
,DEBUG
,ERROR
,FATAL
,INFO
,OFF
,TRACE
, orWARN
. Default:INFO
. -
KMS_KEY: Optional. Cloud Key Management Service to use for encryption. If a key is not specified, data is encrypted at rest using a Google-owned and Google-managed key.
Example:
projects/PROJECT_ID/regions/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
HTTP method and URL:
POST https://dataproc.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/batches
Request JSON body:
{ "environmentConfig":{ "executionConfig":{ "subnetworkUri":"SUBNET", "kmsKey": "KMS_KEY", "serviceAccount": "SERVICE_ACCOUNT" } }, "labels": { "LABEL": "LABEL_VALUE" }, "runtimeConfig": { "version": "1.2", "properties": { "spark.hadoop.hive.metastore.uris":"thrift://HOST:PORT", "PROPERTY": "PROPERTY_VALUE" } }, "sparkBatch":{ "mainClass":"com.google.cloud.dataproc.templates.main.DataProcTemplate", "args":[ "--template","HIVETOBIGQUERY", "--templateProperty","log.level=LOG_LEVEL", "--templateProperty","hivetobq.bigquery.location=PROJECT_ID.DATASET.TABLE", "--templateProperty","hivetobq.sql=QUERY", "--templateProperty","hivetobq.temp.gcs.bucket=TEMP_BUCKET", "--templateProperty","hivetobq.write.mode=MODE", "--templateProperty","hivetobq.temp.table=TEMPVIEW", "--templateProperty","hivetobq.temp.query=TEMPVIEW_SQL_QUERY" ], "jarFileUris":[ "gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar" ] } }
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_ID/regions/REGION/operations/OPERATION_ID", "metadata": { "@type": "type.googleapis.com/google.cloud.dataproc.v1.BatchOperationMetadata", "batch": "projects/PROJECT_ID/locations/REGION/batches/BATCH_ID", "batchUuid": "de8af8d4-3599-4a7c-915c-798201ed1583", "createTime": "2023-02-24T03:31:03.440329Z", "operationType": "BATCH", "description": "Batch" } }