Deploy Apache Hive on Dataproc

Last reviewed 2023-05-08 UTC

This document describes how you deploy the architecture in Use Apache Hive on Dataproc.

This document is intended for cloud architects and data engineers who are interested in deploying Apache Hive on Dataproc and the Hive Metastore in Cloud SQL.

Architecture

In this deployment guide you deploy all compute and storage services in the same Google Cloud region to minimize network latency and network transport costs.

The following diagram shows the lifecycle of a Hive query.

Diagram of a single-region architecture.

In the diagram, the Hive client submits a query, which is processed, fetched, and returned. Processing takes place in the Hive server. The data is requested and returmed from a Hive warehouse stored in a regional bucket in Cloud Storage.

Objectives

  • Create a MySQL instance on Cloud SQL for the Hive metastore.
  • Deploy Hive servers on Dataproc.
  • Install the Cloud SQL Proxy on the Dataproc cluster instances.
  • Upload Hive data to Cloud Storage.
  • Run Hive queries on multiple Dataproc clusters.

Costs

This deployment uses the following billable components of Google Cloud:

  • Dataproc
  • Cloud Storage
  • Cloud SQL

You can use the pricing calculator to generate a cost estimate based on your projected usage.

New Google Cloud users might be eligible for a free trial.

When you finish this deployment, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.

Before you begin

  1. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  2. Make sure that billing is enabled for your Google Cloud project. Learn how to check if billing is enabled on a project.

Initialize the environment

  1. Start a Cloud Shell instance:

    Go to Cloud Shell

  2. In Cloud Shell, set the default Compute Engine zone to the zone where you are going to create your Dataproc clusters.

    export PROJECT=$(gcloud info --format='value(config.project)')
    export REGION=REGION
    export ZONE=ZONE
    gcloud config set compute/zone ${ZONE}

    Replace the following:

    • REGION: The region where you want to create the cluster, such as us-central1.
    • ZONE: The zone where you want to create the cluster, such as us-central1-a.
  3. Enable the Dataproc and Cloud SQL Admin APIs by running this command in Cloud Shell:

    gcloud services enable dataproc.googleapis.com sqladmin.googleapis.com

(Optional) Creating the warehouse bucket

If you don't have a Cloud Storage bucket to store Hive data, create a warehouse bucket (you can run the following commands in Cloud Shell) replacing BUCKET_NAME with a unique bucket name:

export WAREHOUSE_BUCKET=BUCKET_NAME
gsutil mb -l ${REGION} gs://${WAREHOUSE_BUCKET}

Creating the Cloud SQL instance

In this section, you create a new Cloud SQL instance that will later be used to host the Hive metastore.

In Cloud Shell, create a new Cloud SQL instance:

gcloud sql instances create hive-metastore \
    --database-version="MYSQL_5_7" \
    --activation-policy=ALWAYS \
    --zone ${ZONE}

This command might take a few minutes to complete.

Creating a Dataproc cluster

Create the first Dataproc cluster, replacing CLUSTER_NAME with a name such as hive-cluster :

gcloud dataproc clusters create CLUSTER_NAME \
    --scopes sql-admin \
    --region ${REGION} \
    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/cloud-sql-proxy/cloud-sql-proxy.sh \
    --properties "hive:hive.metastore.warehouse.dir=gs://${WAREHOUSE_BUCKET}/datasets" \
    --metadata "hive-metastore-instance=${PROJECT}:${REGION}:hive-metastore" \
    --metadata "enable-cloud-sql-proxy-on-workers=false"

Notes:

  • You provide the sql-admin access scope to allow cluster instances to access the Cloud SQL Admin API.
  • You put your initialization action in a script that you store in a Cloud Storage bucket, and you reference that bucket with the --initialization-actions flag. See Initialization actions - Important considerations and guidelines for more information.
  • You provide the URI to the Hive warehouse bucket in the hive:hive.metastore.warehouse.dir property. This configures the Hive servers to read from and write to the correct location. This property must contain at least one directory (for example, gs://my-bucket/my-directory); Hive will not work properly if this property is set to a bucket name without a directory (for example, gs://my-bucket).
  • You specify enable-cloud-sql-proxy-on-workers=false to ensure that the Cloud SQL Proxy only runs on master nodes, which is sufficient for the Hive metastore service to function and avoids unnecessary load on Cloud SQL.
  • You provide the Cloud SQL Proxy initialization action that Dataproc automatically runs on all cluster instances. The action does the following:

    • Installs the Cloud SQL Proxy.
    • Establishes a secure connection to the Cloud SQL instance specified in the hive-metastore-instance metadata parameter.
    • Creates the hive user and the Hive metastore's database.

    You can see the full code for the Cloud SQL Proxy initialization action on GitHub.

  • This deployment uses a Cloud SQL instance with public IP address. If instead you use an instance with only a private IP address, then you can force the proxy to use the private IP address by passing the --metadata "use-cloud-sql-private-ip=true" parameter.

Creating a Hive table

In this section, you upload a sample dataset to your warehouse bucket, create a new Hive table, and run some HiveQL queries on that dataset.

  1. Copy the sample dataset to your warehouse bucket:

    gsutil cp gs://hive-solution/part-00000.parquet \
    gs://${WAREHOUSE_BUCKET}/datasets/transactions/part-00000.parquet

    The sample dataset is compressed in the Parquet format and contains thousands of fictitious bank transaction records with three columns: date, amount, and transaction type.

  2. Create an external Hive table for the dataset:

    gcloud dataproc jobs submit hive \
        --cluster CLUSTER_NAME \
        --region ${REGION} \
        --execute "
          CREATE EXTERNAL TABLE transactions
          (SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING)
          STORED AS PARQUET
          LOCATION 'gs://${WAREHOUSE_BUCKET}/datasets/transactions';"

Running Hive queries

You can use different tools inside Dataproc to run Hive queries. In this section, you learn how to perform queries using the following tools:

In each section, you run a sample query.

Querying Hive with the Dataproc Jobs API

Run the following simple HiveQL query to verify that the parquet file is correctly linked to the Hive table:

gcloud dataproc jobs submit hive \
    --cluster CLUSTER_NAME \
    --region ${REGION} \
    --execute "
      SELECT *
      FROM transactions
      LIMIT 10;"

The output includes the following:

+-----------------+--------------------+------------------+
| submissiondate  | transactionamount  | transactiontype  |
+-----------------+--------------------+------------------+
| 2017-12-03      | 1167.39            | debit            |
| 2017-09-23      | 2567.87            | debit            |
| 2017-12-22      | 1074.73            | credit           |
| 2018-01-21      | 5718.58            | debit            |
| 2017-10-21      | 333.26             | debit            |
| 2017-09-12      | 2439.62            | debit            |
| 2017-08-06      | 5885.08            | debit            |
| 2017-12-05      | 7353.92            | authorization    |
| 2017-09-12      | 4710.29            | authorization    |
| 2018-01-05      | 9115.27            | debit            |
+-----------------+--------------------+------------------+

Querying Hive with Beeline

  1. Open an SSH session with the Dataproc's master instance(CLUSTER_NAME-m):

    gcloud compute ssh CLUSTER_NAME-m
  2. In the master instance's command prompt, open a Beeline session:

    beeline -u "jdbc:hive2://localhost:10000"

    Notes:

    • You can also reference the master instance's name as the host instead of localhost:

      beeline -u "jdbc:hive2://CLUSTER_NAME-m:10000"
    • If you were using the high-availability mode with 3 masters, you would have to use the following command instead:

      beeline -u "jdbc:hive2://CLUSTER_NAME-m-0:2181,CLUSTER_NAME-m-1:2181,CLUSTER_NAME-m-2:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2"
  3. When the Beeline prompt appears, run the following HiveQL query:

    SELECT TransactionType, AVG(TransactionAmount) AS AverageAmount
    FROM transactions
    WHERE SubmissionDate = '2017-12-22'
    GROUP BY TransactionType;

    The output includes the following:

    +------------------+--------------------+
    | transactiontype  |   averageamount    |
    +------------------+--------------------+
    | authorization    | 4890.092525252529  |
    | credit           | 4863.769269565219  |
    | debit            | 4982.781458176331  |
    +------------------+--------------------+
  4. Close the Beeline session:

    !quit
  5. Close the SSH connection:

    exit

Querying Hive with SparkSQL

  1. Open an SSH session with the Dataproc's master instance:

    gcloud compute ssh CLUSTER_NAME-m
  2. In the master instance's command prompt, open a new PySpark shell session:

    pyspark
  3. When the PySpark shell prompt appears, type the following Python code:

    from pyspark.sql import HiveContext
    hc = HiveContext(sc)
    hc.sql("""
    SELECT SubmissionDate, AVG(TransactionAmount) as AvgDebit
    FROM transactions
    WHERE TransactionType = 'debit'
    GROUP BY SubmissionDate
    HAVING SubmissionDate >= '2017-10-01' AND SubmissionDate < '2017-10-06'
    ORDER BY SubmissionDate
    """).show()

    The output includes the following:

    +-----------------+--------------------+
    | submissiondate  |      avgdebit      |
    +-----------------+--------------------+
    | 2017-10-01      | 4963.114920399849  |
    | 2017-10-02      | 5021.493300510582  |
    | 2017-10-03      | 4982.382279569891  |
    | 2017-10-04      | 4873.302702503676  |
    | 2017-10-05      | 4967.696333583777  |
    +-----------------+--------------------+
  4. Close the PySpark session:

    exit()
  5. Close the SSH connection:

    exit

Inspecting the Hive metastore

You now verify that the Hive metastore in Cloud SQL contains information about the transactions table.

  1. In Cloud Shell, start a new MySQL session on the Cloud SQL instance:

    gcloud sql connect hive-metastore --user=root

    When you're prompted for the root user password, do not type anything and just press the RETURN key. For the sake of simplicity in this deployment, you did not set any password for the root user. For information about setting a password to further protect the metastore database, refer to the Cloud SQL documentation. The Cloud SQL Proxy initialization action also provides a mechanism for protecting passwords through encryption—for more information, see the action's code repository.

  2. In the MySQL command prompt, make hive_metastore the default database for the rest of the session:

    USE hive_metastore;
  3. Verify that the warehouse bucket's location is recorded in the metastore:

    SELECT DB_LOCATION_URI FROM DBS;

    The output looks like this:

    +-------------------------------------+
    | DB_LOCATION_URI                     |
    +-------------------------------------+
    | gs://[WAREHOUSE_BUCKET]/datasets   |
    +-------------------------------------+
  4. Verify that the table is correctly referenced in the metastore:

    SELECT TBL_NAME, TBL_TYPE FROM TBLS;

    The output looks like this:

    +--------------+----------------+
    | TBL_NAME     | TBL_TYPE       |
    +--------------+----------------+
    | transactions | EXTERNAL_TABLE |
    +--------------+----------------+
  5. Verify that the table's columns are also correctly referenced:

    SELECT COLUMN_NAME, TYPE_NAME
    FROM COLUMNS_V2 c, TBLS t
    WHERE c.CD_ID = t.SD_ID AND t.TBL_NAME = 'transactions';

    The output looks like this:

    +-------------------+-----------+
    | COLUMN_NAME       | TYPE_NAME |
    +-------------------+-----------+
    | submissiondate    | date      |
    | transactionamount | double    |
    | transactiontype   | string    |
    +-------------------+-----------+
  6. Verify that the input format and location are also correctly referenced:

    SELECT INPUT_FORMAT, LOCATION
    FROM SDS s, TBLS t
    WHERE s.SD_ID = t.SD_ID AND t.TBL_NAME = 'transactions';

    The output looks like this:

    +---------------------------------------------------------------+------------------------------------------------+
    | INPUT_FORMAT                                                  | LOCATION                                       |
    +---------------------------------------------------------------+------------------------------------------------+
    | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | gs://[WAREHOUSE_BUCKET]/datasets/transactions |
    +---------------------------------------------------------------+------------------------------------------------+
    
  7. Close the MySQL session:

    exit

Creating another Dataproc cluster

In this section, you create another Dataproc cluster to verify that the Hive data and Hive metastore can be shared across multiple clusters.

  1. Create a new Dataproc cluster:

    gcloud dataproc clusters create other-CLUSTER_NAME \
        --scopes cloud-platform \
        --image-version 2.0 \
        --region ${REGION} \
        --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/cloud-sql-proxy/cloud-sql-proxy.sh \
        --properties "hive:hive.metastore.warehouse.dir=gs://${WAREHOUSE_BUCKET}/datasets" \
        --metadata "hive-metastore-instance=${PROJECT}:${REGION}:hive-metastore"\
        --metadata "enable-cloud-sql-proxy-on-workers=false"
  2. Verify that the new cluster can access the data:

    gcloud dataproc jobs submit hive \
        --cluster other-CLUSTER_NAME \
        --region ${REGION} \
        --execute "
          SELECT TransactionType, COUNT(TransactionType) as Count
          FROM transactions
          WHERE SubmissionDate = '2017-08-22'
          GROUP BY TransactionType;"

    The output includes the following:

    +------------------+--------+
    | transactiontype  | count  |
    +------------------+--------+
    | authorization    | 696    |
    | credit           | 1722   |
    | debit            | 2599   |
    +------------------+--------+

Congratulations, you've completed the steps in the deployment.

Clean up

The following sections explain how you can avoid future charges for your Google Cloud project and the Apache Hive and Dataproc resources that you used in this deployment.

Delete the Google Cloud project

To avoid incurring charges to your Google Cloud account for the resources used in this deployment, you can delete the Google Cloud project.

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Deleting individual resources

Run the following commands in Cloud Shell to delete individual resources instead of deleting the whole project:

gcloud dataproc clusters delete CLUSTER_NAME --region ${REGION} --quiet
gcloud dataproc clusters delete other-CLUSTER_NAME --region ${REGION} --quiet
gcloud sql instances delete hive-metastore --quiet
gsutil rm -r gs://${WAREHOUSE_BUCKET}/datasets

What's next