Use transactional tables with Dataproc Metastore

Transactions with ACID semantics is supported by Apache Hive metastores in Dataproc Metastore. For more information, see Hive Transactions. These transactions are enabled by default on Hive 3.

Configurations

You must set server and client side configurations in order to enable transaction support.

Server side configurations

The following server side configurations are set by default during the creation of the service by Dataproc Metastore. You can choose to override these by entering Key and Value overrides under Metastore config overrides.

  • metastore.compactor.initiator.on — Whether to run the initiator and cleaner threads on the Dataproc Metastore service.

    Set to true to enable the initiator.

  • metastore.compactor.worker.threads — The number of compactor worker threads to run on the Dataproc Metastore.

    Set to a positive number to enable the compactor. Setting this to a higher number may affect the performance of the service, especially if you're on Developer tier. If this number needs to be tweaked, we recommend using a lower value, such as 8.

  • hive.metastore.event.db.notification.api.auth — Whether the Dataproc Metastore service should authorize against database notification related APIs.

    Set to false. If set to true, then only the superusers in proxy settings have permission. See Metastore notification API security for more information on superuser proxy privilege.

Client side configurations

Client side configurations are set in the Hive client as described in Validate transactions.

  • hive.support.concurrency — Set to true to support insert, update, and delete transactions.

  • hive.exec.dynamic.partition.mode — In strict mode, you must specify at least one static partition in case all partitions are accidentally overwritten. In nonstrict mode, all partitions are allowed to be dynamic.

    Set to nonstrict to support insert, update, and delete transactions.

  • hive.txn.manager — Set to org.apache.hadoop.hive.ql.lockmgr.DbTxnManager.

Validate transactions

You can validate Hive transactions using a Dataproc cluster that uses a Dataproc Metastore service on Hive 3.

You must create the Dataproc cluster in the same project as the Dataproc Metastore service and with Hive 3. The Dataproc 2.0 images, 2.0-ubuntu18 and 2.0-debian10, support Hive 3 and transactions. You can use the flag --image-version to set the 2.0 image. For example:

gcloud dataproc clusters create DATAPROC_CLUSTER_ID \
   --dataproc-metastore=projects/PROJECT_ID/locations/LOCATION/services/SERVICE \
   --region=REGION \
   --image-version 2.0-debian10

The following instructions demonstrate how to validate transactions in your Dataproc Metastore service that is used by a Dataproc cluster.

  1. SSH into the Dataproc cluster. You can do this from either a browser or from the command line.

  2. Run the command hive to open the Hive client:

    $> hive
    
  3. Set up the client side configurations to enable Hive ACID support for transactions in the hive client session:

    SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
    SET hive.support.concurrency=true;
    SET hive.exec.dynamic.partition.mode=nonstrict;
    
  4. Create a transactional table to insert and update into. The following is an example.

    1. Create a transaction table:

      create table student (id int, name string, age int)
      STORED AS ORC TBLPROPERTIES ('transactional' = 'true');
      
    2. Check if the table is transactional:

      describe formatted <tableName>;
      

      A list of the table properties are printed. A transactional table has transactional=true in its table parameters.

    3. Insert data into the table:

      INSERT INTO student VALUES
      (1, 'Alice', 10),
      (2, 'Bob', 10),
      (3, 'Charlie', 10);
      
      1. Observe the delta folder created under the student directory in the warehouse directory of the service. Multiple delta folders are created if you run multiple insert or update statements.
    4. View which compactions are running and their statuses. Hive metastore runs a thread called initiator every five minutes to check for tables which are due for compaction and requests compaction for those tables.

      show compactions;
      
      1. To start a manual compaction (either minor or major):

          ALTER TABLE student COMPACT 'minor';
          ALTER TABLE student COMPACT 'major';
        

What's next