Best Practices for Google Cloud Storage

Introduction

This page contains a summary of best practices drawn from other pages in the Google Cloud Storage documentation. You can use the best practices listed here as a quick reference of what to keep in mind when building an application that uses Google Cloud Storage. These best practices should be used when launching a commercial application as described in the Launch Checklist for Google Cloud Storage.

If you are just starting out with Google Cloud Storage, this page may not be the best place to start, because it does not teach you the basics of how to use Google Cloud Storage. If you are a new user, we suggest that you start with Getting Started: Using the Cloud Platform Console or Getting Started: Using the gsutil Tool.

Naming

  • The bucket namespace is global and publicly visible. Every bucket name must be unique across the entire Google Cloud Storage namespace. For more information, see Bucket and Object Naming Guidelines.

  • If you need a lot of buckets, use GUIDs or an equivalent for bucket names, put retry logic in your code to handle name collisions, and keep a list to cross-reference your buckets. Another option is to use domain-named buckets and manage the bucket names as sub-domains.

  • Don't use user IDs, email addresses, project names, project numbers, or any personally identifiable information (PII) in bucket or object names because anyone can probe for the existence of a bucket or object, and use the 403 Forbidden, 404 Not Found and 409 Conflict errors to determine the bucket or object's name. Also, URLs often end up in caches, browser history, proxy logs, shortcuts, and other locations that allow the name to be read easily.

  • Bucket names should conform to standard DNS naming conventions, because a bucket name can appear in a DNS record as part of a CNAME redirect. For details on bucket name requirements, see Bucket Name Requirements.

  • Forward slashes in objects have no special meaning to Cloud Storage, as there is no native directory support. Because of this, deeply nested directory- like structures using slash delimiters are possible, but won't have the performance of a native filesystem listing deeply nested sub-directories.

  • Object listings are eventually consistent. For example, if you create a new object and immediately list the objects in the bucket, the object you just wrote may not be in the list. For more information, see Consistency.

Traffic

  • Perform a back-of-the-envelope estimation of the amount of traffic that will be sent to Google Cloud Storage. Specifically, think about:

    • Operations per second. How many operations per second do you expect, for both buckets and objects, and for create, update, and delete operations.

    • Bandwidth. How much data will be sent, over what time frame?

    • Cache control. Specifying the cache-control headers on objects will benefit read latency on hot or frequently accessed objects. For gsutil, see working with metadata; for the Cloud Platform Console, see setting object metadata; for the JSON API, see cache-control as a property or header; for the XML API, see Cache-Control.

  • Design your application to minimize spikes in traffic. If there are clients of your application doing updates, spread them out throughout the day.

  • While Google Cloud Storage has no upper bound on the request rate, for the best performance when scaling to high request rates, follow the Request Rate and Access Distribution Guidelines.

  • If you get an error, use exponential backoff to avoid problems due to large traffic bursts.

  • Understand the performance level customers will expect from your application. This information will help you choose a storage option and region when creating new buckets.

Buckets & objects

  • There is a per-project rate limit to bucket creation and deletion of approximately 1 operation every 2 seconds, so plan on fewer buckets and more objects in most cases. For example, a common design choice is to use one bucket per user of your project to make permission management straightforward when objects are created and accessed using end user credentials. However, if you're designing a system that adds many users per second or where objects are created using robot credentials, then design for many users in one bucket (with appropriate ACLs) so that the bucket creation rate limit doesn't become a bottleneck.

  • Highly available applications should not depend on bucket creation or deletion in the critical path of their application. Bucket names are part of a centralized & global namespace: any dependency on this namespace creates a single point of failure for your application. Due to this and the 1 operation per 2 second limit mentioned above, the recommended practice for highly available services on Cloud Storage is to pre-create all the buckets necessary.

  • There is an update limit on each object of once per second, so rapid writes to a single object won’t scale. There is no limit to writes across multiple objects. For more information, see Object immutability in Key Terms.

  • There is no limit to reads of an object, and performance will be much better for publicly cacheable objects. If you have an object being used to control many clients and thus want to disable caching to provide the latest data, consider instead setting cache-control to public with max-age of 15-60 seconds. Most applications can tolerate a minute of spread, and the cache hit rate will improve performance drastically.

  • There is a per-project rate limit on how many components you can compose of approximately 200 per second, so plan how you will use object composition accordingly.

  • If you are concerned that your application software or users might erroneously delete or overwrite data at some point, you can protect that data by enabling object versioning on your buckets. Doing so increases storage costs, which can be partially mitigated by configuring Lifecycle Management to delete older object versions.

Regions & data storage options

  • Data that will be served at a high rate with high availability should use the Multi-Regional Storage or Regional Storage class. These classes provide the best availability with the trade-off of a higher price.

  • Data that will be infrequently accessed and can tolerate slightly lower availability can be stored using the Nearline Storage or Coldline Storage class.

  • Store your data in a region closest to your application's users. For instance, for EU data you might choose an EU bucket, and for US data you might chose a US bucket. For more information, see Bucket Locations.

  • Keep compliance requirements in mind when choosing a location for user data. Are there legal requirements around the locations that your users will be providing data?

Security, ACLs, and access control

  • The first and foremost precaution is: Never share your credentials. Each user should have distinct credentials.

  • Always use TLS (HTTPS) to transport your data when you can. This ensures that your credentials as well as your data are protected as you transport data over the network. For example, to access the Google Cloud Storage API, you should use https://storage.googleapis.com.

  • Make sure that you use an HTTPS library that validates server certificates. A lack of server certificate validation makes your application vulnerable to man-in-the-middle attacks or other attacks. Be aware that HTTPS libraries shipped with certain commonly used implementation languages do not, by default, verify server certificates. For example, Python before version 3.2 has no built-in or complete support for server certificate validation, and you need to use third-party wrapper libraries to ensure your application validates server certificates. Boto includes code that validates server certificates by default.

  • When applications no longer need access to your data, you should revoke their authentication credentials. For Google services and APIs, you can do this by logging into your Google Account and clicking on Authorizing applications and sites. On the next page, you can revoke access for applications by clicking Revoke Access next to the application.

  • When you print out HTTP protocol details, your authentication credentials, such as OAuth 2.0 tokens, are visible in the headers. If you need to post protocol details to a message board or need to supply HTTP protocol details for troubleshooting, make sure that you sanitize or revoke any credentials that appear as part of the output.

  • Make sure that you securely store your credentials. This can be done differently depending on your environment and where you store your credentials. For example, if you store your credentials in a configuration file, make sure that you set appropriate permissions on that file to prevent unwanted access. If you are using Google App Engine, consider using StorageByKeyName to store your credentials.

  • Google Cloud Storage requests refer to buckets and objects by their names. As a result, even though ACLs will prevent unauthorized third parties from operating on buckets or objects, a third party can attempt requests with bucket or object names and determine their existence by observing the error responses. It can then be possible for information in bucket or object names to be leaked. If you are concerned about the privacy of your bucket or object names, you should take appropriate precautions, such as:

    • Choosing bucket and object names that are difficult to guess. For example, a bucket named mybucket-GTbyTuL3 is random enough that unauthorized third parties cannot feasibly guess it or enumerate other bucket names from it.

    • Avoiding use of sensitive information as part of bucket or object names. For example, instead of naming your bucket mysecretproject-prodbucket, name it somemeaninglesscodename-prod. In some applications, you may want to keep sensitive metadata in custom Google Cloud Storage headers such as x-goog-meta, rather than encoding the metadata in object names.

  • Use groups in preference to explicitly listing large numbers of users. Not only does it scale better, it also provides a very efficient way to update the access control for a large number of objects all at once. Lastly, it’s cheaper as you don’t need to make a request per-object to change the ACLs.

  • Before adding objects to a bucket, check that the default object ACLs are set to your requirements first. This could save you a lot of time updating ACLs for individual objects.

  • Bucket and object ACLs are independent of each other, which means that the ACLs on a bucket do not affect the ACLs on objects inside that bucket. It is possible for a user without permissions for a bucket to have permissions for an object inside the bucket. For example, you can create a bucket such that only GroupA is granted permission to list the objects in the bucket, but then upload an object into that bucket that allows GroupB READ access to the object. GroupB will be able to read the object, but will not be able to view the contents of the bucket or perform bucket-related tasks.

  • The Google Cloud Storage access control system includes the ability to specify that objects are publicly readable. Make sure you intend for any objects you write with this permission to be public. Once "published", data on the Internet can be copied to many places, so it's effectively impossible to regain read control over an object written with this permission.

  • The Google Cloud Storage access control system includes the ability to specify that buckets are publicly writable. While configuring a bucket this way can be convenient for various purposes, we recommend against using this permission - it can be abused for distributing illegal content, viruses, and other malware, and the bucket owner is legally and financially responsible for the content stored in their buckets.

    If you need to make content available securely to users who don't have Google accounts we recommend you use signed URLs. For example, with signed URLs you can provide a link to an object and your application's customers do not need to authenticate with Google Cloud Storage to access the object. When you create a signed URL you control the type (read, write, delete) and duration of access.

  • If you use gsutil, see these additional recommendations.

Uploading data to Cloud Storage

  • If you have a content publishing pipeline that needs to validate that the expected objects are present in the cloud, listing the bucket immediately after uploading and/or deleting objects can be problematic due to bucket listing eventual consistency. We recommend instead that you build a manifest of objects you expect to be present in the cloud by listing your local repository, and then fetch the metadata for each object to confirm it is present (and, for extra assurance, that it has the expected checksum). Reading object metadata is strongly consistent, so doing it this way will avoid the bucket listing eventual consistency problem noted above. One way to do this is to script calls to the gsutil stat command without using a wildcard.

  • If you use XMLHttpRequest (XHR) callbacks to get progress updates, do not close and re-open the connection if you detect that progress has stalled. Doing so creates a bad positive feedback loop during times of network congestion. When the network is congested, XHR callbacks can get backlogged behind the acknowledgement (ACK/NACK) activity from the upload stream, and closing and reopening the connection when this happens uses more network capacity at exactly the time when you can least afford it.

  • For upload traffic, we recommend setting reasonably long timeouts. For a good end-user experience, you can set a client-side timer that updates the client status window with a message (e.g., "network congestion") when your application hasn't received an XHR callback for a long time. Don't just close the connection and try again when this happens.

  • If you use Google Compute Engine instances with processes that POST to Cloud Storage to initiate a resumable upload, then you should use Compute Engine instances in the same locations as your Cloud Storage buckets. You can then use a geo IP service to pick the Compute Engine region to which you route customer requests, which will help keep traffic localized to a geo-region.

  • For resumable uploads, the resumable session should stay in the region in which it was created. Doing so reduces cross-region traffic that arises when reading and writing the session state, improving resumable upload performance.

  • Avoid breaking a transfer into smaller chunks if possible and instead upload the entire content in a single chunk. Avoiding chunking removes fixed latency costs and improves throughput, as well as reducing QPS against Google Cloud Storage.

    Situations where you should consider uploading in chunks include when your source data is being generated dynamically, your clients have request size limitations (which is true for many browsers), or your clients are unable to stream bytes in a single request without first loading the full request into memory. If your clients receive an error, they can query the server for the commit offset and resume uploading remaining bytes from that offset.

Website hosting

The Cross-Origin Resource Sharing (CORS) topic describes how to allow scripts hosted on other websites to access static resources stored in a Google Cloud Storage bucket. The converse scenario is when you allow scripts hosted in Google Cloud Storage to access static resources hosted on a website external to Cloud Storage. In the latter scenario, the website is serving CORS headers so that content on storage.googleapis.com is allowed access. It is recommended that you dedicate a specific bucket for this data access. For example, it is better to have the website serve the CORS header Access-Control-Allow-Origin: https://mybucket.storage.googleapis.com instead of Access-Control-Allow-Origin: https://storage.googleapis.com. This approach prevents your site from inadvertently over-exposing static resources to all of storage.googleapis.com.

Send feedback about...

Cloud Storage Documentation