Use hierarchical namespace enabled buckets for Hadoop workloads

This page describes how to use hierarchical namespace enabled buckets for Hadoop workloads.

Overview

When using a Cloud Storage bucket with hierarchical namespace, you can configure the Cloud Storage connector to use the rename folder operation for workloads like Hadoop, Spark, Hive.

In a bucket without hierarchical namespace, a rename operation in Hadoop, Spark, and Hive involves multiple object copy and delete jobs, impacting performance and consistency. Renaming a folder using the Cloud Storage connector optimizes performance and ensures consistency, when handling folders with a large number of objects.

Dataproc

You can use the Google Cloud CLI to create a Dataproc cluster and enable the Cloud Storage connector to perform the folder operations.

  1. Install or update the Cloud Storage connector version 2.2.23 or later (excluding version 3.0.0).

  2. Create a Dataproc cluster using the following command:

      gcloud dataproc clusters create CLUSTER_NAME
      --properties=core:fs.gs.hierarchical.namespace.folders.enable=true,
      core:fs.gs.http.read-timeout=30000
      

    Where:

    • CLUSTER_NAME is the name of the cluster. For example, my-cluster
    • fs.gs.hierarchical.namespace.folders.enable is used to enable the hierarchical namespace on a bucket.
    • fs.gs.http.read-timeout is the maximum time allowed, in milliseconds, to read data from an established connection. This is an optional setting.

Self-managed Hadoop

You can enable the Cloud Storage connector on your self-managed Hadoop cluster to perform the folder operations.

  1. Install or update the Cloud Storage connector version 2.2.23 or later (excluding version 3.0.0).

  2. Add the following to core-site.xml configuration file:

        <property>
          <name>fs.gs.hierarchical.namespace.folders.enable</name>
          <value>true</value>
        </property>
        <property>
          <name>fs.gs.http.read-timeout</name>
          <value>30000</value>
        </property>
      

    Where:

    • fs.gs.hierarchical.namespace.folders.enable is used to enable the hierarchical namespace on a bucket
    • fs.gs.http.read-timeout is the maximum time allowed, in milliseconds, to read data from an established connection. This is an optional setting.

Compatibility with Cloud Storage connector version 3.0.0 or versions older than 2.2.23

Using the Cloud Storage connector version 3.0.0 or versions older than 2.2.23 or disabling folder operations for hierarchical namespace can lead to the following limitations:

  • Inefficient folder renames: Folder rename operations in Hadoop happen using object-level copy and delete operations, which is slower and less efficient than the dedicated rename folder operation.

  • Accumulation of empty folders: Folder resources are not deleted automatically, leading to the accumulation of empty folders in your bucket. Accumulation of empty folders can have the following impact:

    • Increase storage costs if not deleted explicitly.
    • Slow down the list operations and increase the risk of list operation timeouts.

  • Compatibility issues: Mixing the usage of older and newer connector versions, or enabling and disabling folder operations, can lead to compatibility issues, when renaming folders. Consider the following scenario which uses a combination of connector versions:

    1. Use the Cloud Storage connector version older than 2.2.23 to perform the following tasks:

      1. Write objects under the folder foo/.
      2. Rename the folder foo/ to bar/. The rename operation copies and deletes the objects under foo/ but does not delete the empty foo/ folder.
    2. Use the Cloud Storage connector version 2.2.23 with the folder operations settings enabled to rename the folder bar/ to foo/.

    The connector version 2.2.23, with the folder operation enabled, detects the existing foo/ folder, causing the rename operation to fail. The older connector version, did not delete the foo/ folder as the folder operation was disabled.

What's next

Try it for yourself

If you're new to Google Cloud, create an account to evaluate how Cloud Storage performs in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Try Cloud Storage free