When you're copying or moving data between distinct storage systems such as multiple Apache Hadoop Distributed File System (HDFS) clusters or between HDFS and Cloud Storage, it's a good idea to perform some type of validation to guarantee data integrity. This validation is essential to be sure data wasn't altered during transfer.
While various mechanisms already ensure point-to-point data integrity in transit (such as TLS for all communication with Cloud Storage), explicit end-to-end data integrity validation adds protection for cases that may go undetected by typical in-transit mechanisms. This can help you detect potential data corruption caused, for example, by noisy network links, memory errors on server computers and routers along the path, or software bugs (such as in a library that customers use).
For Cloud Storage, this validation happens automatically client-side with
Google Cloud CLI
commands like cp
and rsync
. Those commands compute local file
checksums,
which are then validated against the checksums computed by
Cloud Storage at the end of each operation. If the checksums don't match,
the gcloud CLI deletes the invalid copies and prints a warning
message. This mismatch rarely happens, and if it does, you can retry the
operation.
Now there's also a way to automatically perform end-to-end, client-side validation in Apache Hadoop across heterogeneous Hadoop-compatible file systems like HDFS and Cloud Storage. This article describes how the new feature lets you efficiently and accurately compare file checksums.
How HDFS performs file checksums
HDFS uses CRC32C, a 32-bit cyclic redundancy check (CRC) based on the Castagnoli polynomial, to maintain data integrity in different contexts:
- At rest, Hadoop DataNodes continuously verify data against stored CRCs to detect and repair bit-rot.
- In transit, the DataNodes send known CRCs along with the corresponding bulk data, and HDFS client libraries cooperatively compute per-chunk CRCs to compare against the CRCs received from the DataNodes.
- For HDFS administrative purposes, block-level checksums are used for low-level manual integrity checks of individual block files on DataNodes.
- For arbitrary application-layer use cases, the
FileSystem
interface definesgetFileChecksum
, and the HDFS implementation uses its stored fine-grained CRCs to define a file-level checksum.
For most day-to-day uses, the CRCs are used transparently with respect to the
application layer. The only CRCs used are the per-chunk CRC32Cs, which are
already precomputed and stored in metadata files alongside block data. The chunk
size is defined by dfs.bytes-per-checksum
and has a default value of 512 bytes.
Shortcomings of Hadoop's default file checksum type
By default when using Hadoop, all API-exposed checksums take the form of an
MD5
of a concatenation of chunk CRC32Cs,
either at the block level through the low-level DataTransferProtocol
, or at
the file level through the top-level FileSystem
interface. A file-level
checksum is defined as the MD5 of the concatenation of all the block checksums,
each of which is an MD5 of a concatenation of chunk CRCs, and is therefore
referred to as an MD5MD5CRC32FileChecksum
. This is effectively an on-demand,
three-layer
Merkle tree.
This definition of the file-level checksum is sensitive to the implementation and data-layout details of HDFS, namely the chunk size (default 512 bytes) and the block size (default 128 MB). So this default file checksum isn't suitable in any of the following situations:
- Two different copies of the same files in HDFS, but with different per-file block sizes configured.
- Two different instances of HDFS with different block or chunk sizes configured.
- A combination of HDFS and non-HDFS Hadoop-compatible file systems (HCFS) such as Cloud Storage.
The following diagram shows how the same file can end up with different checksums depending on the file system's configuration:
You can display the default checksum for a file in HDFS by using the Hadoop
fs -checksum
command:
hadoop fs -checksum hdfs:///user/bob/data.bin
In an HDFS cluster that has a block size of 64 MB (dfs.block.size=67108864
),
this command displays a result like the following:
hdfs:///user/bob/data.bin MD5-of-131072MD5-of-512CRC32C 000002000000000000020000e9378baa1b8599e50cca212ccec2f8b7
For the same file in another cluster that has a block size of 128 MB
(dfs.block.size=134217728
), you see a different output:
hdfs:///user/bob/data.bin MD5-of-0MD5-of-512CRC32C 000002000000000000000000d3a7bae0b5200ed6707803a3911959f9
You can see in these examples that the two checksums differ for the same file.
How Hadoop's new composite CRC file checksum works
A new checksum type, tracked in
HDFS-13056,
was released in Apache Hadoop 3.1.1 to address these shortcomings. The new
type, configured by dfs.checksum.combine.mode=COMPOSITE_CRC
, defines new
composite block CRCs and composite file CRCs as the mathematically composed CRC
across the stored chunk CRCs. Using this type replaces using an MD5 of the
component CRCs in order to calculate a single CRC that represents the entire
block or file and is independent of the lower-level granularity of chunk CRCs.
CRC composition has many benefits — it is efficient; it allows the resulting checksums to be completely chunk/block agnostic; and it allows comparison between striped and replicated files, between different HDFS instances, and between HDFS and other external storage systems. (You can learn more details about the CRC algorithm in this PDF download.)
The following diagram provides a look at how a file's checksum is consistent after transfer across heterogeneous file system configurations:
This feature is minimally invasive: it can be added in place to be compatible with existing block metadata, and doesn't need to change the normal path of chunk verification. This also means even large preexisting HDFS deployments can adopt this feature to retroactively sync data. For more details, you can download the full design PDF document.
Using the new composite CRC checksum type
To use the new composite CRC checksum type within Hadoop, set the
dfs.checksum.combine.mode
property to COMPOSITE_CRC
(instead of the default
value MD5MD5CRC
). When a file is copied from one location to another, the
chunk-level checksum type (that is, the property dfs.checksum.type
that
defaults to CRC32C
) must also match in both locations.
You can display the new checksum type for a file in HDFS by passing the
-Ddfs.checksum.combine.mode=COMPOSITE_CRC
argument to the Hadoop
fs -checksum
command:
hadoop fs -Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum hdfs:///user/bob/data.bin
Regardless of the block size configuration of the HDFS cluster, you see the same output, like the following:
hdfs:///user/bob/data.bin COMPOSITE-CRC32C c517d290
For Cloud Storage, you must also explicitly set the Cloud Storage connector
property fs.gs.checksum.type
to CRC32C
. This property otherwise defaults to
NONE
, causing file checksums to be disabled by default. This default behavior
by the Cloud Storage connector is a preventive measure to avoid an issue
with distcp
, where an exception is raised if the checksum types mismatch
instead of failing gracefully. The command looks like the following:
hadoop fs -Ddfs.checksum.combine.mode=COMPOSITE_CRC -Dfs.gs.checksum.type=CRC32C -checksum gs://[BUCKET]/user/bob/data.bin
The command displays the same output as in the previous example on HDFS:
gs://[BUCKET]/user/bob/data.bin COMPOSITE-CRC32C c517d290
You can see that the composite CRC checksums returned by the previous commands all match, regardless of block size, as well as between HDFS and Cloud Storage. By using composite CRC checksums, you can now guarantee that data integrity is preserved when transferring files between all types of Hadoop cluster configurations.
If you are running distcp
, as in the following example, the validation is
performed automatically:
hadoop distcp -Ddfs.checksum.combine.mode=COMPOSITE_CRC -Dfs.gs.checksum.type=CRC32C hdfs:///user/bob/* gs://[BUCKET]/user/bob/
If distcp
detects a file checksum mismatch between the source and destination
during the copy, then the operation will fail and return a warning.
Accessing the feature
The new composite CRC checksum feature is available in Apache Hadoop 3.1.1 (see release notes), and backports to versions 2.7, 2.8 and 2.9 are in the works. It has been included by default in subminor versions of Cloud Dataproc 1.3 since late 2018.