Use Cloud Storage with big data

Cloud Storage is a key part of storing and working with Big Data on Google Cloud. For example, you can use Cloud Storage to load data into BigQuery, hold staging files and temporary data for Dataflow pipelines, and integrate with Dataproc, so you can run Apache Hadoop or Apache Spark jobs directly on your data in Cloud Storage.

This page describes how to use the gsutil command-line tool to accomplish big data tasks, such as copying large files or copying many files in parallel. For an introduction to gsutil, see the gsutil quickstart guide.

Before you begin

To get the most out of the examples shown on this page, you'll need to complete the following (if you haven't yet):

Copying many files to a bucket

If you have a large number of files to upload you can use the gsutil -m option, to perform a parallel (multi-threaded/multi-processing) copy. To recursively copy subdirectories, use the -R flag of the cp command. For example, to copy files including subdirectories from a local directory named top-level-dir to a bucket, you can use:

gsutil -m cp -R top-level-dir gs://example-bucket

You can use wildcards to match a specific set of names for an operation. For example, to copy only files that start with image:

gsutil -m cp -R top-level-dir/subdir/image* gs://example-bucket

You can remove files using the same wildcard:

gsutil -m rm gs://example-bucket/top-level-dir/subdir/image*

In addition to copying local files to the cloud and vice versa, you can also copy in the cloud, for example:

gsutil -m cp gs://example-bucket/top-level-dir/subdir/** gs://example-bucket/top-level-dir/subdir/subdir2

gsutil automatically detects that you're moving multiple files and creates them in a new directory named subdir2.

Synchronizing a local directory

If you want to synchronize a local directory with a bucket or vice versa, you can do that with the gsutil rsync command. For example, to make gs://example-bucket match the contents of the local directory local-dir you can use:

gsutil -m rsync -r local-dir gs://example-bucket

If you use the rsync -d flag, it signals gsutil to delete files at the destination (gs://example-bucket in the command above) that aren't present at the source (local-dir). You can also synchronize between two buckets.

Copying large files to a bucket

In general, when working with big data, once your data is in the cloud it should stay there. Once your data is in Google's cloud, it's very fast to transfer it to other services like Compute Engine. Also, egress from buckets to Google Cloud services in the same location is free. For more information, see Network Pricing.

To copy a large local file to a bucket, use:

gsutil cp local-file gs://example-bucket

To copy a large file from an existing bucket (e.g., Cloud Storage public data), use:

gsutil cp gs://example-source-bucket/file  gs://example-destination-bucket

gsutil takes full advantage of Google Cloud Storage resumable upload and download features. For large files this is particularly important because the likelihood of a network failure at your ISP increases with the size of the data being transferred. By resuming an upload based on how many bytes the server actually received, gsutil avoids unnecessarily resending bytes and ensures that the upload can eventually be completed. The same logic is applied for downloads based on the size of the local file.

If gsutil cp does not give you the performance you need when uploading large files, you can consider configuring parallel composite uploads.

Configuring a bucket

Typical big data tasks where you will want to configure a bucket include when you move data to a different storage class, configure log access, configure object versioning, or set up a lifecycle rule.

You can list a bucket's configuration details with gsutil ls -L -b:

gsutil ls -L -b gs://example-bucket

In the output, notice the bucket configuration information, most of which is also configurable via gsutil:

  • CORS: controls Cross-Origin-Resource-Sharing settings for a bucket.
  • Logging: allows you to log bucket usage.
  • Website: allows objects in the bucket to act as web pages or be used as static assets in a website.
  • Versioning: causes deletes on objects in the bucket to create noncurrent versions.
  • Storage Class: allows you to set the set storage class during bucket creation.
  • Lifecycle: allows periodic operations to run on the bucket - the most common is stale object deletion.

For example, suppose you only want to keep files in a particular bucket around for just one day, then you can set up the lifecycle rule for the bucket with:

echo '{ "rule": [{ "action": {"type": "Delete"}, "condition": {"age": 1}}]}' > lifecycle_config.json
gsutil lifecycle set lifecycle_config.json gs://example-bucket

Now, any objects in your bucket older than a day will automatically get deleted from this bucket. You can verify the configuration you just set with the gsutil lifecycle command (other configuration commands work in a similar fashion):

gsutil lifecycle get gs://example-bucket

Sharing data in a bucket

When working with big data, you will likely work on files collaboratively and you'll need to be able to give access to specific people or groups. Identity and Access Management policies define who can access your files and what they're allowed to do. You can view a bucket's IAM policy using the gsutil iam command:

gsutil iam get gs://example-bucket

The response to the command shows principals, which are accounts that are granted access to your bucket, and roles, which are groups of permissions granted to the principals.

Three common scenarios for sharing data are sharing publicly, sharing with a group, and sharing with a person:

  • Sharing publicly: For a bucket whose contents are meant to be listed and read by anyone on the Internet, you can configure the IAM policy using the 'AllUsers' designation:

    gsutil iam ch AllUsers:objectViewer gs://example-bucket

  • Sharing with a group: For collaborators who do not have access to your other Google Cloud resources, we recommend that you create a Google group and then add the Google group to the bucket. For example, to give access to the gs-announce Google Group, you can configure the following IAM policy:

    gsutil iam ch gs://example-bucket

    For more information, see Using a Group to Control Access to Objects.

  • Sharing with one person: For many collaborators, use a group to give access in bulk. For one person, you can grant read access as follows:

    gsutil iam ch gs://example-bucket

Cleaning up a bucket

You can clean a bucket quickly with the following command:

gsutil -m rm gs://example-bucket/**

Working with checksums

When performing copies, the gsutil cp and gsutil rsync commands validate that the checksum of the source file matches the checksum of the destination file. In the rare event that checksums do not match, gsutil will delete the invalid copy and print a warning message. For more information, see command line checksum validation.

You can also use gsutil to get the checksum of a file in a bucket or calculate the checksum of a local object. For example, suppose you copy a Cloud Life Sciences public data file to your working bucket with:

gsutil -m cp gs://genomics-public-data/1000-genomes/vcf/ALL.chrMT.phase1_samtools_si.20101123.snps.low_coverage.genotypes.vcf gs://example-bucket

Now, you can get the checksums of both the public bucket version of the file and your version of the file in your bucket to ensure they match:

gsutil ls -L gs://example-bucket/ALL.chrMT.phase1_samtools_si.20101123.snps.low_coverage.genotypes.vcf
gsutil ls -L gs://genomics-public-data/1000-genomes/vcf/ALL.chrMT.phase1_samtools_si.20101123.snps.low_coverage.genotypes.vcf

Now, suppose your data is in a file at a local data center and you copied it into Cloud Storage. You can use gsutil hash to get the checksum of your local file and then compare that with the checksum of the file you copied to a bucket. To get the checksum of a local file use:

gsutil hash local-file

MD5 values

For non-composite objects, running gsutil ls -L on an object in a bucket returns output like the following:

        Creation time:          Thu, 26 Mar 2015 20:11:51 GMT
        Content-Length:         102400000
        Content-Type:           text/plain
        Hash (crc32c):          FTiauw==
        Hash (md5):             daHmCObxxQdY9P7lp9jj0A==
        ETag:                   CPjo7ILqxsQCEAE=
        Generation:             1427400711419000
        Metageneration:         1
        ACL:            [

Running gsutil hash on a local file returns output like the following:

Hashing     100MBfile.txt:
Hashes [base64] for 100MBfile.txt:
        Hash (crc32c):          FTiauw==
        Hash (md5):             daHmCObxxQdY9P7lp9jj0A==

Both outputs have a CRC32c and MD5 value. There is no MD5 value for composite objects, such as those created from parallel composite uploads.

What's next