Use Apache Hive on Dataproc

Last reviewed 2023-05-08 UTC

This reference architecture describes the benefits of using Apache Hive on Dataproc in an efficient and flexible way by storing Hive data in Cloud Storage and hosting the Hive metastore in a MySQL database on Cloud SQL.

This document is intended for cloud architects and data engineers who are interested in deploying Apache Hive on Dataproc and the Hive Metastore in Cloud SQL.

Architecture

The following diagram shows the lifecycle of a Hive query.

Diagram of a single-region architecture.

In the diagram, the lifecycle of a Hive query follows these steps:

  1. The Hive client submits a query to a Hive server that runs in an ephemeral Dataproc cluster.
  2. The Hive server processes the query and requests metadata from the metastore service.
  3. The Metastore service fetches Hive metadata from Cloud SQL through the Cloud SQL Proxy.
  4. The Hive server loads data from the Hive warehouse located in a regional bucket in Cloud Storage.
  5. The Hive server returns the result to the client.

Design alternatives

The following section presents a potential design alternative for this architecture.

Multi-regional architecture

Consider using a multi-regional architecture if you need to run Hive servers in different geographic regions. In that case, you should create separate Dataproc clusters that are dedicated to hosting the metastore service and that reside in the same region as the Cloud SQL instance.

The metastore service can sometimes send high volumes of requests to the MySQL database, so it is critical to keep the metastore service geographically close to the MySQL database in order to minimize impact on performance. In comparison, the Hive server typically sends far fewer requests to the metastore service. Therefore, it can be more acceptable for the Hive server and the metastore service to reside in different regions despite the increased latency.

The metastore service can run only on Dataproc master nodes, not on worker nodes. Dataproc enforces a minimum of two worker nodes in standard clusters and in high-availability clusters.

To avoid wasting resources on unused worker nodes, you can create a single-node cluster for the metastore service instead. To achieve high availability, you can create multiple single-node clusters.

The Cloud SQL proxy needs to be installed only on the metastore service clusters, because only the metastore service clusters need to directly connect to the Cloud SQL instance. The Hive servers then point to the metastore service clusters by setting the hive.metastore.uris property to the comma-separated list of URIs. For example:

thrift://metastore1:9083,thrift://metastore2:9083

You can also consider using a dual-region or multi-region bucket if the Hive data needs to be accessed from Hive servers that are located in multiple locations. The choice between different bucket location types depends on your use case. You must balance latency, availability, and costs.

The following diagram shows an example of a multi-regional architecture.

Diagram of a multi-region Hive architecture.

As you can see, the multi-regional scenario is slightly more complex and much more robust. The deployment guide for this reference architecture uses a single-region scenario.

Advantages of a multi-regional architecture

Separating compute and storage resources offers some advantages:

  • Flexibility and agility: You can tailor cluster configurations for specific Hive workloads and scale each cluster independently up and down as needed.
  • Cost savings: You can spin up an ephemeral cluster when you need to run a Hive job and then delete it when the job completes. The resources that your jobs require are active only when they're being used, so you pay only for what you use. You can also use preemptible VMs for noncritical data processing or to create very large clusters at a lower total cost.
  • Resilience: For simplicity, this reference architecture uses only one master instance. To increase resilience in production workloads, you should consider creating a cluster with three master instances by using Dataproc's high availability mode.

Cost optimization

This reference architecture and deployment uses the following billable components of Google Cloud:

  • Dataproc
  • Cloud Storage
  • Cloud SQL

You can use the pricing calculator to generate a cost estimate based on your projected usage.

New Google Cloud users might be eligible for a free trial.

Deployment

To deploy this architecture, see Deploy Apache Hive on Dataproc.

What's next