This guide describes the reference architecture used to deploy Google Distributed Cloud (software only) on bare metal. This guide is intended for platform administrators who want to use GKE Enterprise on a bare metal platform in a highly available, geographically redundant configuration. To best understand this guide, you should be familiar with basic GKE Enterprise concepts, as outlined in the GKE Enterprise technical overview. You should also have a basic understanding of Kubernetes concepts and Google Kubernetes Engine (GKE), as described in Learn Kubernetes Basics and the GKE documentation.
This guide has a GitHub source repository that includes scripts that you can use to deploy the architecture described. This guide also describes the architectural components that accompany the scripts and modules that are used to create those components. We recommend that you use these files as a templates to create modules that use your organization's best practices and policies.
Architecture model
In the GKE Enterprise Architecture Foundations guide, the platform architecture is described in layers. The resources at each layer provide a specific set of functions. These resources are owned and managed by one or more personas. As shown in the following diagram, the GKE Enterprise platform architecture for bare metal consists of the following layers and resources:
- Infrastructure: This layer includes storage, compute, and networking, handled with on-premises constructs.
- Data management: For the purposes of this guide, the data management layer requires a SQL database that is managed outside of the Kubernetes clusters being deployed.
- Container management layer: This layer uses GKE clusters.
- Service Management layer: This layer uses Cloud Service Mesh.
- Policy Management layer: This layer uses Config Sync and Policy Controller.
- Application management layer: This layer uses Cloud Build and Cloud Source Repositories.
- Observability layer: This layer uses Google Cloud Observability and Cloud Service Mesh dashboards.
Each of these layers is repeated across the stack for different lifecycle environments, such as development, staging, and production.
The following sections only include additional information that is specific to bare metal deployments. They build upon their respective sections in the GKE Enterprise Architecture Foundations guide. We recommend that you review the guide as you read this article.
Networking
For more information about network requirements, see Network requirements.
For Google Distributed Cloud load balancers there are two options available: bundled and manual.
In bundled mode, L4 load balancing software is deployed during cluster creation. The load balancer processes can run on a dedicated pool of worker nodes, or on the same nodes as the control plane. To advertise virtual IP addresses (VIPs), this load balancer has two options:
- Address Resolution Protocol (ARP): Requires layer 2 connectivity between the nodes running the load balancer.
- Border Gateway Protocol (BGP): Uses peering to interconnect your cluster network–which is an autonomous system–with another autonomous system, like an external network.
In manual mode, you configure your own load balancing solutions for control plane and data plane traffic. There are many hardware and software options available for external load balancers. You must set up the external load balancer for the control plane before creating a bare metal cluster. The external control plane load balancer can also be used for data plane traffic, or you can set up a separate load balancer for the data plane. To determine availability, the load balancer must be able to distribute traffic to a pool of nodes based on a configurable readiness check.
For more information about load balancers for bare metal deployments, see Overview of load balancers.
Cluster architecture
Google Distributed Cloud supports multiple deployment models on bare metal, catering to different availability, isolation, and resource footprint needs. These deployment models are discussed in Choosing a deployment model.
Identity management
Google Distributed Cloud uses the GKE Identity Service to integrate with identity providers. It supports OpenID Connect (OIDC) and Lightweight Directory Access Protocol (LDAP). For applications and services, Cloud Service Mesh can be used with various identity solutions.
For more information about identity management, see Identity management with OIDC in Google Distributed Cloud and Authenticating with OIDC or, Set up GKE Identity Service with LDAP.
Security and policy management
For Google Distributed Cloud security and policy management, we recommend using Config Sync and Policy Controller. Policy Controller lets you create and enforce policies across your clusters. Config Sync evaluates changes and applies them to all clusters to achieve the appropriate state.
Services
When you use Google Distributed Cloud's bundled mode for load
balancing in bare metal deployments, you can create
LoadBalancer
-type
services. When you create these services, Google Distributed Cloud
assigns an IP address from the configured load balancer IP address pool to the
service. The LoadBalancer
service type is used to expose the Kubernetes
service outside of the cluster for
north-south traffic.
When using Google Distributed Cloud, an
IngressGateway
is also created in the cluster by default. You can't create LoadBalancer
-type
services for Google Distributed Cloud in manual mode. Instead, you
can either create an
Ingress
object that uses the IngressGateway
or create
NodePort
-type
services and manually configure your external load balancer to use the
Kubernetes service as a backend.
For Service Management, also referred to as east-west traffic, we recommend using Cloud Service Mesh. Cloud Service Mesh is based on Istio open APIs and provides uniform observability, authentication, encryption, fine-grained traffic controls, and other features and functions. For more information about Service Management, see Cloud Service Mesh.
Persistence and state management
Google Distributed Cloud on bare metal is largely dependent on existing infrastructure for ephemeral storage, volume storage, and PersistentVolume storage. Ephemeral data uses the local disk resources on the node where the Kubernetes Pod is scheduled. For persistent data, GKE Enterprise is compatible with the Container Storage Interface (CSI), an open-standard API that many storage vendors support. For production storage, we recommend installing a CSI driver from an GKE Enterprise Ready storage partner. For the full listing of GKE Enterprise Ready storage partners, see GKE Enterprise Ready storage partners.
For more information about storage, see Configuring storage.
Databases
Google Distributed Cloud doesn't provide additional database-specific capabilities beyond the standard capabilities of the GKE Enterprise platform. Most databases run on an external data management system. Workloads on the GKE Enterprise platform can also be configured to connect to any accessible external databases.
Observability
Google Cloud Observability collects logs and monitoring metrics for
Google Distributed Cloud clusters in a way that is similar to the
collection and monitoring policies of GKE clusters. By
default, the cluster logs and the system component metrics are sent to
Cloud Monitoring.
To have Google Cloud Observability collect application logs and
metrics, enable the clusterOperations.enableApplication
option in the cluster
configuration YAML file.
For more information about observability, see Configuring logging and monitoring.
Use case: Cymbal Bank deployment
For this guide, the Cymbal Bank/Bank of Anthos application is used to simulate the planning, platform deployment, and application deployment process for Google Distributed Cloud on bare metal.
The remainder of this document is comprised of three sections. The Planning section outlines the decisions made based on the options discussed in the architecture model sections. The Platform deployment section discusses the scripts and modules that are provided by a source repository to deploy the GKE Enterprise platform. Finally, in the Application deployment section, the Cymbal Bank application is deployed on the platform.
This Google Distributed Cloud guide can be used to deploy to self-managed hosts or Compute Engine instances instances. By using Google Cloud resources, anyone can complete this guide without needing access to physical hardware. The use of Compute Engine instances is for demonstration purposes only. Do NOT use these instances for production workloads. When access to physical hardware is available and the same IP address ranges are used, you can use the provided source repository as-is. If the environment differs from what is outlined in the Planning section, you can modify the scripts and modules to accommodate the differences. The associated source repository contains instructions for both the physical hardware and the Compute Engine instance scenarios.
Planning
The following section details the architectural decisions made while planning and designing the platform for the deployment of the Bank of GKE Enterprise application on Google Distributed Cloud. These sections focus on a production environment. To build lower environments like development or staging, you can use similar steps.
Google Cloud projects
When creating projects in Google Cloud for Google Distributed Cloud, a fleet host project is required. Additional projects are recommended for each environment or business function. This project configuration lets you organize resources based on the persona that is interacting with the resource.
The following subsections discuss the recommended project types and the personas associated with them.
Hub project
The hub project hub-prod
is for the network administrator
persona. This project is where the on-premise data center is connected to
Google Cloud using your selected form of hybrid connectivity. For more
information about hybrid connectivity options see
Google Cloud Connectivity
Fleet host project
The fleet
host project fleet-prod
is for the platform
administrators persona. The project is where the
Google Distributed Cloud clusters are registered. This project is
also where the platform-related Google Cloud resources reside. These
resources include Google Cloud Observability, the
Cloud Source Repositories, and others. A given Google Cloud project can only have
a single fleet (or no fleets) associated with it. This restriction reinforces
using Google Cloud projects to provide stronger isolation between resources
that are not governed or consumed together.
Application or team project
The application or team project app-banking-prod
is for the developer
persona. This project is where application-specific or team-specific
Google Cloud resources reside. The project includes everything except
GKE clusters. Depending on the number of teams or
applications, there might be multiple instances of this project type. Creating
separate projects for different teams lets you separately manage
IAM, billing, and quota for each team.
Networking
Each Google Distributed Cloud cluster requires the following IP address subnets:
- Node IP addresses
- Kubernetes Pod IP addresses
- Kubernetes service/cluster IP addresses
- Load balancer IP addresses (bundled mode)
To use the same non-routable IP address ranges for the Kubernetes Pod and service subnets in each cluster, select an island mode network model. In this configuration, Pods can directly talk to each other inside a cluster, but can't be reached directly from outside a cluster (using Pod IP addresses). This configuration forms an island within the network that isn't connected to the external network. The clusters form a full node-to-node mesh across the cluster nodes within the island, letting the Pod directly reach other Pods within the cluster.
IP address allocation
Cluster | Node | Pod | Services | Load balancer |
---|---|---|---|---|
metal-admin-dc1-000-prod | 10.185.0.0/24 | 192.168.0.0/16 | 10.96.0.0/12 | N/A |
metal-user-dc1a-000-prod | 10.185.1.0/24 | 192.168.0.0/16 | 10.96.0.0/12 | 10.185.1.3-10.185.1.10 |
metal-user-dc1b-000-prod | 10.185.2.0/24 | 192.168.0.0/16 | 10.96.0.0/12 | 10.185.2.3-10.185.2.10 |
metal-admin-dc2-000-prod | 10.195.0.0/24 | 192.168.0.0/16 | 10.96.0.0/12 | N/A |
metal-user-dc2a-000-prod | 10.195.1.0/24 | 192.168.0.0/16 | 10.96.0.0/12 | 10.195.1.3-10.195.1.10 |
metal-user-dc2b-000-prod | 10.195.2.0/24 | 192.168.0.0/16 | 10.96.0.0/12 | 10.195.2.3-10.195.2.10 |
In island mode, it's important to ensure that the IP address subnets chosen for the Kubernetes Pods and services aren't in use or routable from the node network.
Network requirements
To provide an integrated load balancer for
Google Distributed Cloud that doesn't require configuration, use the
bundled load balancer mode in each cluster. When workloads run LoadBalancer
services, an IP address is assigned from the load balancer pool.
To read detailed information about the bundled load balancer's requirements and configuration, see Overview of load balancers and Configuring bundled load balancing.
Cluster architecture
For a production environment, we recommend using an admin and user cluster deployment model with an admin cluster and two user clusters in each geographical location to achieve the greatest redundancy and fault tolerance for Google Distributed Cloud.
We recommend using a minimum of four user clusters for each production environment. Use two geographically redundant locations that each contain two fault-tolerant clusters. Each fault-tolerant cluster has redundant hardware and redundant network connections. Decreasing the number of clusters reduces either the redundancy or the fault tolerance of the architecture.
To help ensure high availability, the control plane for each cluster uses three nodes. With a minimum of three worker nodes per user cluster, you can distribute workloads across those nodes to lower the impact if a node goes offline. The number and sizing of worker nodes is largely dependent on the type and number of workloads that run in the cluster. The recommended sizing for each of the nodes is discussed in Configuring hardware for Google Distributed Cloud.
The following table describes the recommended node sizing for CPU cores, memory, and local disk storage in this use case.
Node type | CPUs/vCPUs | Memory | Storage |
---|---|---|---|
Control plane | 8 core | 32 GiB | 256 GiB |
Worker | 8 core | 64 GiB | 512 GiB |
For more information about machine prerequisites and sizing, see Cluster node machine prerequisites.
Identity management
For identity management, we recommend an integration with OIDC through GKE Identity Service. In the examples provided in the source repository, local authentication is used to simplify the requirements. If OIDC is available, you can modify the example to use it. For more information, see Identity management with OIDC in Google Distributed Cloud.
Security and policy management
In the Cymbal Bank use case, Config Sync and Policy Controller
is used for policy management. A Cloud Source Repositories is created to store the
configuration data that Config Sync uses. The
ConfigManagement
operator, which is used to install and manage Config Sync
and Policy Controller, needs read-only access to the
configuration source repository. To grant that access, use a form of acceptable
authentication. In this example, a Google service account is used.
Services
For Service Management in this use case, Cloud Service Mesh is used to
provide a base on which distributed services are built. By default, an
IngressGateway
is also created in the cluster which handles standard
Kubernetes Ingress
objects.
Persistence and state management
Because persistent storage is largely dependent on existing infrastructure, this use case doesn't require it. In other cases, however, we suggest using storage options from GKE Enterprise Ready Storage Partners. If a CSI storage option is available, it can be installed on the cluster using the vendor-provided instructions. For proof of concept and advanced use cases, you can use local volumes. However, for most use cases, we don't recommend using local volumes in production environments.
Databases
Many stateful applications on Google Distributed Cloud use databases as their persistence store. A stateful database application needs access to a database to provide its business logic to clients. There are no restrictions on the type of Datastore used by Google Distributed Cloud. Data-storage decisions, therefore, should be made by the developer or by associated data management teams. Since different applications might require different datastores, those datastores can be used without limitation as well. Databases can be managed in-cluster, on-premises, or even in the cloud.
The Cymbal Bank application is a stateful application that accesses two PostgreSQL databases. Database access is configured through environment variables. The PostgreSQL database needs to be accessible from the nodes running the workloads, even if the database is managed externally from the cluster. In this example, the application accesses an existing, external PostgreSQL database. While the application runs on the platform, the database is managed externally. As such, the database isn't part of the GKE Enterprise platform. If a PostgreSQL database is available, use it. If not, create and use a Cloud SQL database for the Cymbal Bank application.
Observability
Each cluster in the Cymbal Bank use case is configured to have Google Cloud Observability collect logs and metrics for both the system components and applications. There are several Cloud Monitoring dashboards created by the Google Cloud console installer which can be viewed from the Monitoring dashboards page. For more information about observability, see Configuring logging and monitoring, and How Logging and Monitoring for Google Distributed Cloud works.
Platform deployment
For more information, see the Deploy the Platform section of the documentation in the GitHub source repository.
Application deployment
For more information, see the Deploy the Application section of the documentation in the GitHub source repository.
What's next
- Read more about Cloud Service Mesh, Config Sync, and Policy Controller.
- Look at some of the other GKE Enterprise reference architectures.
- Explore reference architectures, diagrams, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.