This document describes how you deploy the architecture in Use Apache Hive on Dataproc.
This document is intended for cloud architects and data engineers who are interested in deploying Apache Hive on Dataproc and the Hive Metastore in Cloud SQL.
Architecture
In this deployment guide you deploy all compute and storage services in the same Google Cloud region to minimize network latency and network transport costs.
The following diagram shows the lifecycle of a Hive query.
In the diagram, the Hive client submits a query, which is processed, fetched, and returned. Processing takes place in the Hive server. The data is requested and returmed from a Hive warehouse stored in a regional bucket in Cloud Storage.
Objectives
- Create a MySQL instance on Cloud SQL for the Hive metastore.
- Deploy Hive servers on Dataproc.
- Install the Cloud SQL Proxy on the Dataproc cluster instances.
- Upload Hive data to Cloud Storage.
- Run Hive queries on multiple Dataproc clusters.
Costs
This deployment uses the following billable components of Google Cloud:
- Dataproc
- Cloud Storage
- Cloud SQL
You can use the pricing calculator to generate a cost estimate based on your projected usage.
New Google Cloud users might be eligible for a free trial.
When you finish this deployment, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.
Before you begin
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Make sure that billing is enabled for your Google Cloud project. Learn how to check if billing is enabled on a project.
Initialize the environment
Start a Cloud Shell instance:
In Cloud Shell, set the default Compute Engine zone to the zone where you are going to create your Dataproc clusters.
export PROJECT=$(gcloud info --format='value(config.project)') export REGION=REGION export ZONE=ZONE gcloud config set compute/zone ${ZONE}
Replace the following:
REGION
: The region where you want to create the cluster, such asus-central1
.ZONE
: The zone where you want to create the cluster, such asus-central1-a
.
Enable the Dataproc and Cloud SQL Admin APIs by running this command in Cloud Shell:
gcloud services enable dataproc.googleapis.com sqladmin.googleapis.com
(Optional) Creating the warehouse bucket
If you don't have a Cloud Storage bucket to store Hive data,
create a warehouse bucket (you can run the following commands
in Cloud Shell) replacing BUCKET_NAME
with a unique bucket name:
export WAREHOUSE_BUCKET=BUCKET_NAME gcloud storage buckets create gs://${WAREHOUSE_BUCKET} --location=${REGION}
Creating the Cloud SQL instance
In this section, you create a new Cloud SQL instance that will later be used to host the Hive metastore.
In Cloud Shell, create a new Cloud SQL instance:
gcloud sql instances create hive-metastore \ --database-version="MYSQL_5_7" \ --activation-policy=ALWAYS \ --zone ${ZONE}
This command might take a few minutes to complete.
Creating a Dataproc cluster
Create the first Dataproc cluster, replacing CLUSTER_NAME
with a name such as hive-cluster
:
gcloud dataproc clusters create CLUSTER_NAME \ --scopes sql-admin \ --region ${REGION} \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/cloud-sql-proxy/cloud-sql-proxy.sh \ --properties "hive:hive.metastore.warehouse.dir=gs://${WAREHOUSE_BUCKET}/datasets" \ --metadata "hive-metastore-instance=${PROJECT}:${REGION}:hive-metastore" \ --metadata "enable-cloud-sql-proxy-on-workers=false"
Notes:
- You provide the
sql-admin
access scope to allow cluster instances to access the Cloud SQL Admin API. - You put your initialization action in a script that you store in a
Cloud Storage bucket, and you reference that bucket with the
--initialization-actions
flag. See Initialization actions - Important considerations and guidelines for more information. - You provide the URI to the Hive warehouse bucket in the
hive:hive.metastore.warehouse.dir
property. This configures the Hive servers to read from and write to the correct location. This property must contain at least one directory (for example,gs://my-bucket/my-directory
); Hive will not work properly if this property is set to a bucket name without a directory (for example,gs://my-bucket
). - You specify
enable-cloud-sql-proxy-on-workers=false
to ensure that the Cloud SQL Proxy only runs on master nodes, which is sufficient for the Hive metastore service to function and avoids unnecessary load on Cloud SQL. You provide the Cloud SQL Proxy initialization action that Dataproc automatically runs on all cluster instances. The action does the following:
- Installs the Cloud SQL Proxy.
- Establishes a secure connection to the Cloud SQL
instance specified in the
hive-metastore-instance
metadata parameter. - Creates the
hive
user and the Hive metastore's database.
You can see the full code for the Cloud SQL Proxy initialization action on GitHub.
This deployment uses a Cloud SQL instance with public IP address. If instead you use an instance with only a private IP address, then you can force the proxy to use the private IP address by passing the
--metadata "use-cloud-sql-private-ip=true"
parameter.
Creating a Hive table
In this section, you upload a sample dataset to your warehouse bucket, create a new Hive table, and run some HiveQL queries on that dataset.
Copy the sample dataset to your warehouse bucket:
gcloud storage cp gs://hive-solution/part-00000.parquet \ gs://${WAREHOUSE_BUCKET}/datasets/transactions/part-00000.parquet
The sample dataset is compressed in the Parquet format and contains thousands of fictitious bank transaction records with three columns: date, amount, and transaction type.
Create an external Hive table for the dataset:
gcloud dataproc jobs submit hive \ --cluster CLUSTER_NAME \ --region ${REGION} \ --execute " CREATE EXTERNAL TABLE transactions (SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING) STORED AS PARQUET LOCATION 'gs://${WAREHOUSE_BUCKET}/datasets/transactions';"
Running Hive queries
You can use different tools inside Dataproc to run Hive queries. In this section, you learn how to perform queries using the following tools:
- Dataproc's Hive jobs API.
- Beeline, a popular command line client that is based on SQLLine.
- SparkSQL, Apache Spark's API for querying structured data.
In each section, you run a sample query.
Querying Hive with the Dataproc Jobs API
Run the following simple HiveQL query to verify that the parquet file is correctly linked to the Hive table:
gcloud dataproc jobs submit hive \ --cluster CLUSTER_NAME \ --region ${REGION} \ --execute " SELECT * FROM transactions LIMIT 10;"
The output includes the following:
+-----------------+--------------------+------------------+ | submissiondate | transactionamount | transactiontype | +-----------------+--------------------+------------------+ | 2017-12-03 | 1167.39 | debit | | 2017-09-23 | 2567.87 | debit | | 2017-12-22 | 1074.73 | credit | | 2018-01-21 | 5718.58 | debit | | 2017-10-21 | 333.26 | debit | | 2017-09-12 | 2439.62 | debit | | 2017-08-06 | 5885.08 | debit | | 2017-12-05 | 7353.92 | authorization | | 2017-09-12 | 4710.29 | authorization | | 2018-01-05 | 9115.27 | debit | +-----------------+--------------------+------------------+
Querying Hive with Beeline
Open an SSH session with the Dataproc's master instance(
CLUSTER_NAME
-m):gcloud compute ssh CLUSTER_NAME-m
In the master instance's command prompt, open a Beeline session:
beeline -u "jdbc:hive2://localhost:10000"
Notes:
You can also reference the master instance's name as the host instead of
localhost
:beeline -u "jdbc:hive2://CLUSTER_NAME-m:10000"
If you were using the high-availability mode with 3 masters, you would have to use the following command instead:
beeline -u "jdbc:hive2://CLUSTER_NAME-m-0:2181,CLUSTER_NAME-m-1:2181,CLUSTER_NAME-m-2:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2"
When the Beeline prompt appears, run the following HiveQL query:
SELECT TransactionType, AVG(TransactionAmount) AS AverageAmount FROM transactions WHERE SubmissionDate = '2017-12-22' GROUP BY TransactionType;
The output includes the following:
+------------------+--------------------+ | transactiontype | averageamount | +------------------+--------------------+ | authorization | 4890.092525252529 | | credit | 4863.769269565219 | | debit | 4982.781458176331 | +------------------+--------------------+
Close the Beeline session:
!quit
Close the SSH connection:
exit
Querying Hive with SparkSQL
Open an SSH session with the Dataproc's master instance:
gcloud compute ssh CLUSTER_NAME-m
In the master instance's command prompt, open a new PySpark shell session:
pyspark
When the PySpark shell prompt appears, type the following Python code:
from pyspark.sql import HiveContext hc = HiveContext(sc) hc.sql(""" SELECT SubmissionDate, AVG(TransactionAmount) as AvgDebit FROM transactions WHERE TransactionType = 'debit' GROUP BY SubmissionDate HAVING SubmissionDate >= '2017-10-01' AND SubmissionDate < '2017-10-06' ORDER BY SubmissionDate """).show()
The output includes the following:
+-----------------+--------------------+ | submissiondate | avgdebit | +-----------------+--------------------+ | 2017-10-01 | 4963.114920399849 | | 2017-10-02 | 5021.493300510582 | | 2017-10-03 | 4982.382279569891 | | 2017-10-04 | 4873.302702503676 | | 2017-10-05 | 4967.696333583777 | +-----------------+--------------------+
Close the PySpark session:
exit()
Close the SSH connection:
exit
Inspecting the Hive metastore
You now verify that the Hive metastore in Cloud SQL contains
information about the transactions
table.
In Cloud Shell, start a new MySQL session on the Cloud SQL instance:
gcloud sql connect hive-metastore --user=root
When you're prompted for the
root
user password, do not type anything and just press theRETURN
key. For the sake of simplicity in this deployment, you did not set any password for theroot
user. For information about setting a password to further protect the metastore database, refer to the Cloud SQL documentation. The Cloud SQL Proxy initialization action also provides a mechanism for protecting passwords through encryption—for more information, see the action's code repository.In the MySQL command prompt, make
hive_metastore
the default database for the rest of the session:USE hive_metastore;
Verify that the warehouse bucket's location is recorded in the metastore:
SELECT DB_LOCATION_URI FROM DBS;
The output looks like this:
+-------------------------------------+ | DB_LOCATION_URI | +-------------------------------------+ | gs://[WAREHOUSE_BUCKET]/datasets | +-------------------------------------+
Verify that the table is correctly referenced in the metastore:
SELECT TBL_NAME, TBL_TYPE FROM TBLS;
The output looks like this:
+--------------+----------------+ | TBL_NAME | TBL_TYPE | +--------------+----------------+ | transactions | EXTERNAL_TABLE | +--------------+----------------+
Verify that the table's columns are also correctly referenced:
SELECT COLUMN_NAME, TYPE_NAME FROM COLUMNS_V2 c, TBLS t WHERE c.CD_ID = t.SD_ID AND t.TBL_NAME = 'transactions';
The output looks like this:
+-------------------+-----------+ | COLUMN_NAME | TYPE_NAME | +-------------------+-----------+ | submissiondate | date | | transactionamount | double | | transactiontype | string | +-------------------+-----------+
Verify that the input format and location are also correctly referenced:
SELECT INPUT_FORMAT, LOCATION FROM SDS s, TBLS t WHERE s.SD_ID = t.SD_ID AND t.TBL_NAME = 'transactions';
The output looks like this:
+---------------------------------------------------------------+------------------------------------------------+ | INPUT_FORMAT | LOCATION | +---------------------------------------------------------------+------------------------------------------------+ | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | gs://[WAREHOUSE_BUCKET]/datasets/transactions | +---------------------------------------------------------------+------------------------------------------------+
Close the MySQL session:
exit
Creating another Dataproc cluster
In this section, you create another Dataproc cluster to verify that the Hive data and Hive metastore can be shared across multiple clusters.
Create a new Dataproc cluster:
gcloud dataproc clusters create other-CLUSTER_NAME \ --scopes cloud-platform \ --image-version 2.0 \ --region ${REGION} \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/cloud-sql-proxy/cloud-sql-proxy.sh \ --properties "hive:hive.metastore.warehouse.dir=gs://${WAREHOUSE_BUCKET}/datasets" \ --metadata "hive-metastore-instance=${PROJECT}:${REGION}:hive-metastore"\ --metadata "enable-cloud-sql-proxy-on-workers=false"
Verify that the new cluster can access the data:
gcloud dataproc jobs submit hive \ --cluster other-CLUSTER_NAME \ --region ${REGION} \ --execute " SELECT TransactionType, COUNT(TransactionType) as Count FROM transactions WHERE SubmissionDate = '2017-08-22' GROUP BY TransactionType;"
The output includes the following:
+------------------+--------+ | transactiontype | count | +------------------+--------+ | authorization | 696 | | credit | 1722 | | debit | 2599 | +------------------+--------+
Congratulations, you've completed the steps in the deployment.
Clean up
The following sections explain how you can avoid future charges for your Google Cloud project and the Apache Hive and Dataproc resources that you used in this deployment.
Delete the Google Cloud project
To avoid incurring charges to your Google Cloud account for the resources used in this deployment, you can delete the Google Cloud project.
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Deleting individual resources
Run the following commands in Cloud Shell to delete individual resources instead of deleting the whole project:
gcloud dataproc clusters delete CLUSTER_NAME --region ${REGION} --quiet gcloud dataproc clusters delete other-CLUSTER_NAME --region ${REGION} --quiet gcloud sql instances delete hive-metastore --quiet gcloud storage rm gs://${WAREHOUSE_BUCKET}/datasets --recursive
What's next
- Try BigQuery, Google's serverless, highly scalable, low-cost enterprise data warehouse.
- Check out this guide on migrating Hadoop workloads to Google Cloud.
- Check out this initialization action for more details on how to use Hive HCatalog on Dataproc.
- Learn how to configure Cloud SQL for high availability to increase service reliability.
- For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.