Reference architectures for reliable deployment of Cloud EKM services

Authors: Adit Sinha, Sid Telang

Acknowledgments: Alex Lurye, Lauren Hughes

Last update: March 2022

This page contains a set of recommendations for Google Cloud customers requiring a highly available External Key Manager (EKM) service deployment integrated with Cloud EKM.

Using Cloud EKM and the associated EKM service involves an explicit risk tradeoff for customers between cloud workload reliability and data protection controls. Encrypting data-at-rest in the cloud with off-cloud encryption keys adds new failure modes that may result in the inaccessibility or even loss of data stored in Google Cloud services. To address these, incorporating high availability and fault tolerance into the design of the EKM service is paramount.

These recommendations are aimed at those who develop and operate the EKM service. If you are a customer of a supported partner, you might share some of these responsibilities with the partner, depending on the design of their product and how it integrates with Cloud EKM

Background and terminology

External key manager solutions, and specifically Google Cloud's Cloud EKM, allow cloud customers to use non-cloud resident key material to control the access to their data stored in supported Google Cloud services.

Cloud EKM introduces the ability to create and manage Cloud KMS key resources with the EXTERNAL and EXTERNAL_VPC protection levels. Keys with the EXTERNAL and EXTERNAL_VPC protection levels are stored and managed in an external key management system. These Cloud KMS resources, like Cloud KMS keys of other protection levels, can be used to encrypt data-at-rest in supported Google Cloud services using CMEK. Every cryptographic operation requested on such a Cloud KMS resource results in a cryptographic operation on the external key requested by Cloud KMS. The success of the former operation critically depends on that of the latter.

Cloud KMS requests operations on external keys using a special-purpose API that integrates with the external key management system. Throughout this document we refer to a service that provides this API as an EKM service.

If an EKM service becomes unavailable, reads and writes from the data planes for integrated Google Cloud services may fail. These failures surface in a similar way as failures do when the dependent Cloud KMS key is in an unusable state, for example, when it is disabled. The end-user facing error message describes in detail the source of the error and a course of action. Furthermore, Cloud KMS data access audit logs persist a record of these error messages together with descriptive error types that can be programmatically consumed. More information can be found in the Cloud EKM error reference documentation.

Guiding principles

Google's Site Reliability Engineering book illustrates a number of high-level principles to guide the development and maintenance of reliable systems. In this section, we highlight some of these principles in the context of how your EKM service integrates with Google Cloud. We apply these principles using three reference architectures, and place primary importance on the following three high-level reliability objectives:

  • Low latency, reliable network connectivity
  • High availability
  • Fast failure detection and mitigation

For each objective, we highlight factors that affect reliability and provide recommendations for taking them into account in your EKM service's architecture.

Low-latency, reliable network connectivity

Cloud KMS connects to EKM services via either a Virtual Private Cloud (VPC) or the public internet. VPC solutions will often use hybrid connectivity to host the EKM service in an on-premise datacenter. The connection between Google Cloud and the datacenter must likewise be fast and reliable. When using the public internet, stable, uninterrupted reachability and fast, reliable DNS resolution are of the utmost importance. From the point of view of Google Cloud, any interruption manifests as unavailability of the EKM service and the potential inability to access EKM-protected data.

When a Google Cloud service's data plane communicates with the EKM service, each EKM service-bound call has a defined timeout period (currently 150 ms). The timeout is measured from the Cloud KMS service in the Google Cloud location of the Cloud KMS key. If the Google Cloud location is a multi-region, then the timeout begins in the region where Cloud KMS receives the request, which is typically where the operation on the dependent CMEK-protected data resource occurred. This timeout is adequate to allow an EKM service to handle requests in the nearby Google Cloud region the requests originate from.

Note that this is a much shorter timeout than the 10s timeout that is common across Google Cloud APIs. The reduced timeout helps prevent cascading failures in downstream services that depend on the external key. Tail latency issues that might normally cause a poor user experience in higher level applications can actually manifest as failed accesses to the external key resulting in the failure of the higher-level logical operation.

Recommendations

  • Minimize latency of round-trip communication with Cloud KMS.

    Configure EKM services to serve requests as geographically near as possible to the Google Cloud locations corresponding to the Cloud KMS keys configured to use the EKM service. For more information, see the best practices guide for choosing Google Cloud regions on the basis of latency considerations and the documentation on where Google Cloud regions and zones are located.

  • Use Cloud Interconnect when possible.

    Cloud Interconnect creates a highly available, low-latency connection between Google Cloud and your datacenter via VPC and eliminates dependencies on the public internet.

  • Deploy Google Cloud networking solutions in the region closest to the EKM service, when necessary.

    Ideally, Cloud KMS keys should be in the region nearest to the EKM service. If there is a Google Cloud region that is closer to the EKM service than the region holding the Cloud KMS keys, use Google Cloud networking solutions, such as Cloud VPN, in the region closest to the EKM service. This ensures that network traffic uses Google infrastructure when possible and reduces dependence on the public internet.

  • Use Premium Tier networks for cases where EKM traffic transits through the internet.

    Premium Tier routes traffic through the internet using Google's infrastructure where possible to improve reliability and reduce latency.

High Availability

In general, the reliability of a system is determined by that of its least reliable component. The existence of a single point of failure in the EKM service will reduce the availability of dependent Google Cloud resources to that of the single point of failure. Such points of failure may live in critical dependencies of the EKM service as well as the underlying compute and network infrastructure.

Recommendations

  • Deploy replicas across independent failure domains.

    We recommend deploying at least 2 replicas of the EKM service. EKM services intended to be used with multiregional Google Cloud locations should have at least 2 separate geographical locations with at least 2 replicas each.

    Special care should be taken to ensure that each replica does not only represent a replicated data plane of the EKM service but also cross-replica failure vectors are well understood, minimized and hardened. For example,

    • production mutations, including server binary and configuration pushes, should modify only one replica at a time, and should be carried out under supervision, with tested rollbacks readily available.
    • cross-replica failure modes from the underlying infrastructure should be well-understood and minimized. For example, ensuring replicas depend on independent and redundant power feeds.
  • Replicas should be resilient to single machine outages.

    Each replica of the service itself should comprise at least 3 appliances, machines, or VM hosts. This allows the system to serve traffic while one machine is down for update while another has an unexpected outage (N+2 provisioning).

  • Limit the "blast radius" of control plane issues.

    The Control plane (e.g. key creation/deletion) of the EKM service is marked by requiring configuration or data to be replicated across replicas. The operations are generally more complex, because they require synchronization and affect all replicas. Issues can quickly propagate to affect the entire system. Some strategies to reduce the impact of issues are to:

    • Control propagation speed: By default, changes, should propagate as slowly as is acceptable for usability and security. However, there might be exceptions to allow some changes, like permitting access to a key, to propagate quickly to allow a user to undo a mistake.

    • Shard the system: If many users share the EKM, partition them into logical shards that are completely independent, so that issues triggered by a user in one shard cannot affect users in the other.

    • Preview the effect of changes: If possible, allow users to see the effect of changes applying them. For example, when modifying a key access policy, the EKM could confirm the number of recent requests that would have been rejected under the new policy.

    • Data canarying: First push data to a small subset of the system, and only push everywhere if that subset remains healthy.

  • Implement holistic health checks.

    Create health checks that measure whether the full system is functioning. For example, health checks that only validate network connectivity will not be helpful in responding to many application-level issues. Ideally the health check will closely mirror the dependencies for real traffic.

  • Set up failover across replicas.

    Set up load balancing in your EKM service components such that it consumes the above health checks and actively drains traffic to unhealthy replicas and safely fails over to healthy replicas.

  • Include safety mechanisms to manage overload and avoid cascading failures.

    Systems may become overloaded for a variety of reasons. For example, when some replicas become unhealthy, traffic redirected to the healthy replicas could overload them. When faced with more requests than it can serve, the system should attempt to serve what it can safely and quickly, while rejecting excess traffic. This chapter from the Site Reliability Engineering book contains more details and recommendations for preventing overload.

  • Ensure a robust durability story.

    Data in Google Cloud that is encrypted with an external key in the EKM service is unrecoverable without the external key. Therefore key durability is one of the central design requirements of the EKM service. The EKM service should securely back up redundant copies of key material in multiple physical locations. High value keys, such as roots of trust, should have additional protection measures, such as offline backups. Deletion mechanisms should proceed slowly enough to allow for recovery in cases of accidents and bugs. See this chapter from the Site Reliability Engineering book for more information.

  • Reference: Building Secure Reliable Systems.

    The EKM service is both a security-critical and reliability-critical system. The above book delves into the interesting interplay between security and reliability concerns and provides guiding principles for designing a service that needs to address both.

Fast failure detection and mitigation

For every minute the EKM service suffers an outage, dependent Google Cloud resources may be inaccessible, which can further increase the likelihood of a cascading failure of other dependent components of your infrastructure.

Recommendations

  • Instrument the EKM service to report metrics that signal reliability-threatening incidents.

    Examples of important metrics include response error rates and response latencies.

  • Set up operational practices for timely notification and mitigation of incidents.

    The effectiveness of such efforts should be quantified by tracking Mean Time To Detect (MTTD) and Mean Time To Restore (MTTR) metrics, and defining objectives measured in terms of these metrics. This chapter in the Site Reliability Engineering book contains recommendations on tracking outages. Once these metrics are available, you can find patterns and deficiencies in the current processes and systems for responding to incidents and address them.

Reference Architectures

The following architectures describe a few ways to deploy the EKM service using Google Cloud networking and load balancing products. Each architecture may be deployed be either a customer or by a partner who operates the service for multiple customers, depending on the desired operational model. The architectures are only pertinent to those who are building and operating the EKM.

Direct connection over Cloud VPN or Cloud Interconnect

Direct connection to EKM

In the above architecture, Cloud EKM accesses the EKM service located in Oregon through hybrid connectivity in the us-west1 region without any intermediate load balancing in Google Cloud.

When possible, the Cloud EKM to EKM service connection should be deployed using the 99.9% availability configuration for single region applications. If the connection to the on-premise datacenter uses the Internet, you should use HA VPN.

The primary advantage of this architecture is that there are no intermediate hops in Google Cloud, which reduces latency and potential bottlenecks. Using this architecture with an EKM that is hosted across multiple data centers requires having the load balancer in all data centers use the same (anycast) IP address. A disadvantage is that load balancing and failover among data centers are based solely on route availability. This often means that such decisions will only be based on the state of the network as opposed to the whole state of the EKM deployment.

Load balanced in a VPC in Google Cloud

Load balanced in VPC

In the above architecture, Cloud EKM accesses the EKM service replicated between Oregon and California through hybrid connectivity with layers of intermediate load balancing in the us-west1 region in Google Cloud.

The internal passthrough Network Load Balancer provides a single IP address to which to send traffic using virtual networking. It fails over to the backup datacenter based on actively health checking the backends.

The VM instance group is necessary to proxy traffic, because the internal load balancer cannot route traffic directly to on-premise backends. One strategy to deploy the load balancer proxies is to run an Nginx Docker image from Cloud Marketplace in instance groups. Nginx can be used as a TCP load balancer.

Since this approach uses load balancers in Google Cloud, this could be deployed without an on-premise load balancer. The Google Cloud load balancers can connect directly to instances of the EKM service and balance load among them. Eliminating the on-premise load balancer results in simpler configuration but reduces flexibility available in the EKM service. For example, an on-premise L7 load balancer could automatically retry requests in case one EKM instance returns an error..

Load balanced from public internet in Google Cloud

Load balanced from internet

In the above architecture, the EKM has replicas in on-premise sites in California and Virginia. Each backends is represented in Google Cloud using a hybrid connectivity network endpoint group (NEG). The deployment uses an external proxy Network Load Balancer to forward traffic directly to one of the replicas. Unlike the other approaches, which rely on VPC networking, the TCP proxy has a public IP address, and traffic comes from the public internet.

Each hybrid connectivity network NEG may contain multiple IP addresses, which allows the TCP Proxy Load Balancer to balance traffic directly to instances of the EKM service. An additional load balancer in the on-premise datacenter is not necessary.

The TCP proxy load balancer is not tied to a specific region. It can direct incoming traffic to the nearest healthy region, which makes it suitable for multiregional Cloud KMS keys. However, the load balancer does not allow configuration of primary and failover backends. Traffic would be distributed evenly across multiple backends in a region.

Comparison

Direct connection Load balanced in a VPC Load balanced from public internet Fully-Managed EKM provided by partner
Public internet or VPC VPC VPC Internet Internet
Load balancer in Google Cloud No Yes Yes No
On-premise load balancer required Yes No No Yes (managed by partner)
Supports multiregional Cloud KMS locations No No Yes Yes
Recommended for High throughput applications where the EKM service runs in a single site. Most EKM services where you deploy your EKM. When multiregional Cloud KMS keys are required. Customers who are able to use a partner's EKM instead of deploying their own.