This page describes how to use hierarchical namespace enabled buckets for Hadoop workloads.
Overview
When using a Cloud Storage bucket with hierarchical namespace, you can configure the Cloud Storage connector to use the rename folder operation for workloads like Hadoop, Spark, Hive.
In a bucket without hierarchical namespace, a rename operation in Hadoop, Spark, and Hive involves multiple object copy and delete jobs, impacting performance and consistency. Renaming a folder using the Cloud Storage connector optimizes performance and ensures consistency, when handling folders with a large number of objects.
Dataproc
You can use the Google Cloud CLI to create a Dataproc cluster and enable the Cloud Storage connector to perform the folder operations.
Install or update the Cloud Storage connector version 2.2.23 or later (excluding version 3.0.0).
Create a Dataproc cluster using the following command:
gcloud dataproc clusters create CLUSTER_NAME --properties=core:fs.gs.hierarchical.namespace.folders.enable=true, core:fs.gs.http.read-timeout=30000
Where:
CLUSTER_NAME
is the name of the cluster. For example,my-cluster
fs.gs.hierarchical.namespace.folders.enable
is used to enable the hierarchical namespace on a bucket.fs.gs.http.read-timeout
is the maximum time allowed, in milliseconds, to read data from an established connection. This is an optional setting.
Self-managed Hadoop
You can enable the Cloud Storage connector on your self-managed Hadoop cluster to perform the folder operations.
Install or update the Cloud Storage connector version 2.2.23 or later (excluding version 3.0.0).
Add the following to core-site.xml configuration file:
<property> <name>fs.gs.hierarchical.namespace.folders.enable</name> <value>true</value> </property> <property> <name>fs.gs.http.read-timeout</name> <value>30000</value> </property>
Where:
fs.gs.hierarchical.namespace.folders.enable
is used to enable the hierarchical namespace on a bucketfs.gs.http.read-timeout
is the maximum time allowed, in milliseconds, to read data from an established connection. This is an optional setting.
Compatibility with Cloud Storage connector version 3.0.0 or versions older than 2.2.23
Using the Cloud Storage connector version 3.0.0 or versions older than 2.2.23 or disabling folder operations for hierarchical namespace can lead to the following limitations:
Inefficient folder renames: Folder rename operations in Hadoop happen using object-level copy and delete operations, which is slower and less efficient than the dedicated
rename folder
operation.Accumulation of empty folders: Folder resources are not deleted automatically, leading to the accumulation of empty folders in your bucket. Accumulation of empty folders can have the following impact:
- Increase storage costs if not deleted explicitly.
Slow down the list operations and increase the risk of list operation timeouts.
Compatibility issues: Mixing the usage of older and newer connector versions, or enabling and disabling folder operations, can lead to compatibility issues, when renaming folders. Consider the following scenario which uses a combination of connector versions:
Use the Cloud Storage connector version older than 2.2.23 to perform the following tasks:
- Write objects under the folder
foo/
. - Rename the folder
foo/
tobar/
. The rename operation copies and deletes the objects underfoo/
but does not delete the emptyfoo/
folder.
- Write objects under the folder
Use the Cloud Storage connector version 2.2.23 with the folder operations settings enabled to rename the folder
bar/
tofoo/
.
The connector version 2.2.23, with the folder operation enabled, detects the existing
foo/
folder, causing the rename operation to fail. The older connector version, did not delete thefoo/
folder as the folder operation was disabled.
What's next
Try it for yourself
If you're new to Google Cloud, create an account to evaluate how Cloud Storage performs in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
Try Cloud Storage free