Cloud Storage is a key part of storing and working with Big Data on Google Cloud. For example, you can use Cloud Storage to load data into BigQuery, hold staging files and temporary data for Dataflow pipelines, and integrate with Dataproc, so you can run Apache Hadoop or Apache Spark jobs directly on your data in Cloud Storage.
This page describes how to use the gsutil command-line tool to accomplish big data tasks, such as copying large files or copying many files in parallel. For an introduction to gsutil, see the gsutil quickstart guide.
Before you begin
To get the most out of the examples shown on this page, you'll need to complete the following (if you haven't yet):
- Install Python version 3.7.
- Install gsutil.
- Configure gsutil to access protected data.
Copying many files to a bucket
If you have a large number of files to upload you can use the gsutil -m
option, to perform a parallel (multi-threaded/multi-processing) copy. To recursively copy
subdirectories, use the -R
flag of the cp
command.
For example, to copy files including subdirectories from a local directory
named top-level-dir
to a bucket, you can use:
gsutil -m cp -R top-level-dir gs://example-bucket
You can use wildcards to match a specific set of
names for an operation. For example, to copy only files that start with image
:
gsutil -m cp -R top-level-dir/subdir/image* gs://example-bucket
You can remove files using the same wildcard:
gsutil -m rm gs://example-bucket/top-level-dir/subdir/image*
In addition to copying local files to the cloud and vice versa, you can also copy in the cloud, for example:
gsutil -m cp gs://example-bucket/top-level-dir/subdir/** gs://example-bucket/top-level-dir/subdir/subdir2
gsutil automatically detects that you're moving multiple files and
creates them in a new directory named subdir2
.
Synchronizing a local directory
If you want to synchronize a local directory with a bucket or vice versa,
you can do that with the gsutil rsync
command. For example,
to make gs://example-bucket
match the contents of the local directory
local-dir
you can use:
gsutil -m rsync -r local-dir gs://example-bucket
If you use the rsync -d
flag, it signals gsutil to delete files at the destination
(gs://example-bucket
in the command above) that aren't present at the
source (local-dir
). You can also synchronize between two buckets.
Copying large files to a bucket
In general, when working with big data, once your data is in the cloud it should stay there. Once your data is in Google's cloud, it's very fast to transfer it to other services like Compute Engine. Also, egress from buckets to Google Cloud services in the same location is free. For more information, see Network Pricing.
To copy a large local file to a bucket, use:
gsutil cp local-file gs://example-bucket
To copy a large file from an existing bucket (e.g., Cloud Storage public data), use:
gsutil cp gs://example-source-bucket/file gs://example-destination-bucket
gsutil takes full advantage of Google Cloud Storage resumable upload and download features. For large files this is particularly important because the likelihood of a network failure at your ISP increases with the size of the data being transferred. By resuming an upload based on how many bytes the server actually received, gsutil avoids unnecessarily resending bytes and ensures that the upload can eventually be completed. The same logic is applied for downloads based on the size of the local file.
If gsutil cp
does not give you the performance you need when uploading large
files, you can consider configuring parallel composite uploads.
Configuring a bucket
Typical big data tasks where you will want to configure a bucket include when you move data to a different storage class, configure log access, configure object versioning, or set up a lifecycle rule.
You can list a bucket's configuration details with
gsutil ls -L -b
:
gsutil ls -L -b gs://example-bucket
In the output, notice the bucket configuration information, most of which is also configurable via gsutil:
- CORS: controls Cross-Origin-Resource-Sharing settings for a bucket.
- Logging: allows you to log bucket usage.
- Website: allows objects in the bucket to act as web pages or be used as static assets in a website.
- Versioning: causes deletes on objects in the bucket to create noncurrent versions.
- Storage Class: allows you to set the set storage class during bucket creation.
- Lifecycle: allows periodic operations to run on the bucket - the most common is stale object deletion.
For example, suppose you only want to keep files in a particular bucket around for just one day, then you can set up the lifecycle rule for the bucket with:
echo '{ "rule": [{ "action": {"type": "Delete"}, "condition": {"age": 1}}]}' > lifecycle_config.json
gsutil lifecycle set lifecycle_config.json gs://example-bucket
Now, any objects in your bucket older than a day will automatically get deleted
from this bucket. You can verify the configuration you just set with the
gsutil lifecycle
command (other configuration commands work in a
similar fashion):
gsutil lifecycle get gs://example-bucket
Sharing data in a bucket
When working with big data, you will likely work on files collaboratively and
you'll need to be able to give access to specific people or groups.
Identity and Access Management policies define who can access your files and what they're allowed
to do. You can view a bucket's IAM policy using the
gsutil iam
command:
gsutil iam get gs://example-bucket
The response to the command shows principals, which are accounts that are granted access to your bucket, and roles, which are groups of permissions granted to the principals.
Three common scenarios for sharing data are sharing publicly, sharing with a group, and sharing with a person:
Sharing publicly: For a bucket whose contents are meant to be listed and read by anyone on the Internet, you can configure the IAM policy using the 'AllUsers' designation:
gsutil iam ch AllUsers:objectViewer gs://example-bucket
Sharing with a group: For collaborators who do not have access to your other Google Cloud resources, we recommend that you create a Google group and then add the Google group to the bucket. For example, to give access to the gs-announce Google Group, you can configure the following IAM policy:
gsutil iam ch group:gs-announce@googlegroups.com:objectViewer gs://example-bucket
For more information, see Using a Group to Control Access to Objects.
Sharing with one person: For many collaborators, use a group to give access in bulk. For one person, you can grant read access as follows:
gsutil iam ch user:liz@gmail.com:objectViewer gs://example-bucket
Cleaning up a bucket
You can clean a bucket quickly with the following command:
gsutil -m rm gs://example-bucket/**
Working with checksums
When performing copies, the gsutil cp
and gsutil rsync
commands validate
that the checksum of the source file matches the checksum of the destination
file. In the rare event that checksums do not match, gsutil will delete
the invalid copy and print a warning message. For more information, see
command line checksum validation.
You can also use gsutil to get the checksum of a file in a bucket or calculate the checksum of a local object. For example, suppose you copy a Cloud Life Sciences public data file to your working bucket with:
gsutil -m cp gs://genomics-public-data/1000-genomes/vcf/ALL.chrMT.phase1_samtools_si.20101123.snps.low_coverage.genotypes.vcf gs://example-bucket
Now, you can get the checksums of both the public bucket version of the file and your version of the file in your bucket to ensure they match:
gsutil ls -L gs://example-bucket/ALL.chrMT.phase1_samtools_si.20101123.snps.low_coverage.genotypes.vcf
gsutil ls -L gs://genomics-public-data/1000-genomes/vcf/ALL.chrMT.phase1_samtools_si.20101123.snps.low_coverage.genotypes.vcf
Now, suppose your data is in a file at a local data center and you copied it
into Cloud Storage. You can use gsutil hash
to get the checksum of your local file and then compare that with the checksum
of the file you copied to a bucket. To get the checksum of a local file use:
gsutil hash local-file
MD5 values
For non-composite objects, running gsutil ls -L
on an object in a bucket
returns output like the following:
gs://example-bucket/100MBfile.txt: Creation time: Thu, 26 Mar 2015 20:11:51 GMT Content-Length: 102400000 Content-Type: text/plain Hash (crc32c): FTiauw== Hash (md5): daHmCObxxQdY9P7lp9jj0A== ETag: CPjo7ILqxsQCEAE= Generation: 1427400711419000 Metageneration: 1 ACL: [ ....
Running gsutil hash
on a local file returns output like the following:
Hashing 100MBfile.txt: Hashes [base64] for 100MBfile.txt: Hash (crc32c): FTiauw== Hash (md5): daHmCObxxQdY9P7lp9jj0A==
Both outputs have a CRC32c and MD5 value. There is no MD5 value for composite objects, such as those created from parallel composite uploads.