GKE Enterprise reference architecture: Google Distributed Cloud (software only) on bare metal

Last reviewed 2023-08-15 UTC

This guide describes the reference architecture used to deploy Google Distributed Cloud (software only) on bare metal. This guide is intended for platform administrators who want to use GKE Enterprise on a bare metal platform in a highly available, geographically redundant configuration. To best understand this guide, you should be familiar with basic GKE Enterprise concepts, as outlined in the GKE Enterprise technical overview. You should also have a basic understanding of Kubernetes concepts and Google Kubernetes Engine (GKE), as described in Learn Kubernetes Basics and the GKE documentation.

This guide has a GitHub source repository that includes scripts that you can use to deploy the architecture described. This guide also describes the architectural components that accompany the scripts and modules that are used to create those components. We recommend that you use these files as a templates to create modules that use your organization's best practices and policies.

Architecture model

In the GKE Enterprise Architecture Foundations guide, the platform architecture is described in layers. The resources at each layer provide a specific set of functions. These resources are owned and managed by one or more personas. As shown in the following diagram, the GKE Enterprise platform architecture for bare metal consists of the following layers and resources:

A Google Distributed Cloud mental model that shows the layers that the document discusses.

Infrastructure: This layer includes storage, compute, and networking, handled with on-premises constructs.
Data management: For the purposes of this guide, the data management layer requires a SQL database that is managed outside of the Kubernetes clusters being deployed.
Container management layer: This layer uses GKE clusters.
Service Management layer: This layer uses Cloud Service Mesh.
Policy Management layer: This layer uses Config Sync and Policy Controller.
Application management layer: This layer uses Cloud Build and Cloud Source Repositories.
Observability layer: This layer uses Google Cloud Observability and Cloud Service Mesh dashboards.

Each of these layers is repeated across the stack for different lifecycle environments, such as development, staging, and production.

The following sections only include additional information that is specific to bare metal deployments. They build upon their respective sections in the GKE Enterprise Architecture Foundations guide. We recommend that you review the guide as you read this article.

Networking

For more information about network requirements, see Network requirements.

For Google Distributed Cloud load balancers there are two options available: bundled and manual.

In bundled mode, L4 load balancing software is deployed during cluster creation. The load balancer processes can run on a dedicated pool of worker nodes, or on the same nodes as the control plane. To advertise virtual IP addresses (VIPs), this load balancer has two options:

Address Resolution Protocol (ARP): Requires layer 2 connectivity between the nodes running the load balancer.
Border Gateway Protocol (BGP): Uses peering to interconnect your cluster network–which is an autonomous system–with another autonomous system, like an external network.

In manual mode, you configure your own load balancing solutions for control plane and data plane traffic. There are many hardware and software options available for external load balancers. You must set up the external load balancer for the control plane before creating a bare metal cluster. The external control plane load balancer can also be used for data plane traffic, or you can set up a separate load balancer for the data plane. To determine availability, the load balancer must be able to distribute traffic to a pool of nodes based on a configurable readiness check.

For more information about load balancers for bare metal deployments, see Overview of load balancers.

Cluster architecture

Google Distributed Cloud supports multiple deployment models on bare metal, catering to different availability, isolation, and resource footprint needs. These deployment models are discussed in Choosing a deployment model.

Identity management

Google Distributed Cloud uses the GKE Identity Service to integrate with identity providers. It supports OpenID Connect (OIDC) and Lightweight Directory Access Protocol (LDAP). For applications and services, Cloud Service Mesh can be used with various identity solutions.

For more information about identity management, see Identity management with OIDC in Google Distributed Cloud and Authenticating with OIDC or, Set up GKE Identity Service with LDAP.

Security and policy management

For Google Distributed Cloud security and policy management, we recommend using Config Sync and Policy Controller. Policy Controller lets you create and enforce policies across your clusters. Config Sync evaluates changes and applies them to all clusters to achieve the appropriate state.

Services

When you use Google Distributed Cloud's bundled mode for load balancing in bare metal deployments, you can create LoadBalancer-type services. When you create these services, Google Distributed Cloud assigns an IP address from the configured load balancer IP address pool to the service. The LoadBalancer service type is used to expose the Kubernetes service outside of the cluster for north-south traffic. When using Google Distributed Cloud, an IngressGateway is also created in the cluster by default. You can't create LoadBalancer-type services for Google Distributed Cloud in manual mode. Instead, you can either create an Ingress object that uses the IngressGateway or create NodePort-type services and manually configure your external load balancer to use the Kubernetes service as a backend.

For Service Management, also referred to as east-west traffic, we recommend using Cloud Service Mesh. Cloud Service Mesh is based on Istio open APIs and provides uniform observability, authentication, encryption, fine-grained traffic controls, and other features and functions. For more information about Service Management, see Cloud Service Mesh.

Persistence and state management

Google Distributed Cloud on bare metal is largely dependent on existing infrastructure for ephemeral storage, volume storage, and PersistentVolume storage. Ephemeral data uses the local disk resources on the node where the Kubernetes Pod is scheduled. For persistent data, GKE Enterprise is compatible with the Container Storage Interface (CSI), an open-standard API that many storage vendors support. For production storage, we recommend installing a CSI driver from an GKE Enterprise Ready storage partner. For the full listing of GKE Enterprise Ready storage partners, see GKE Enterprise Ready storage partners.

For more information about storage, see Configuring storage.

Databases

Google Distributed Cloud doesn't provide additional database-specific capabilities beyond the standard capabilities of the GKE Enterprise platform. Most databases run on an external data management system. Workloads on the GKE Enterprise platform can also be configured to connect to any accessible external databases.

Observability

Google Cloud Observability collects logs and monitoring metrics for Google Distributed Cloud clusters in a way that is similar to the collection and monitoring policies of GKE clusters. By default, the cluster logs and the system component metrics are sent to Cloud Monitoring. To have Google Cloud Observability collect application logs and metrics, enable the clusterOperations.enableApplication option in the cluster configuration YAML file.

For more information about observability, see Configuring logging and monitoring.

Use case: Cymbal Bank deployment

For this guide, the Cymbal Bank/Bank of Anthos application is used to simulate the planning, platform deployment, and application deployment process for Google Distributed Cloud on bare metal.

The remainder of this document is comprised of three sections. The Planning section outlines the decisions made based on the options discussed in the architecture model sections. The Platform deployment section discusses the scripts and modules that are provided by a source repository to deploy the GKE Enterprise platform. Finally, in the Application deployment section, the Cymbal Bank application is deployed on the platform.

This Google Distributed Cloud guide can be used to deploy to self-managed hosts or Compute Engine instances instances. By using Google Cloud resources, anyone can complete this guide without needing access to physical hardware. The use of Compute Engine instances is for demonstration purposes only. Do NOT use these instances for production workloads. When access to physical hardware is available and the same IP address ranges are used, you can use the provided source repository as-is. If the environment differs from what is outlined in the Planning section, you can modify the scripts and modules to accommodate the differences. The associated source repository contains instructions for both the physical hardware and the Compute Engine instance scenarios.

Planning

The following section details the architectural decisions made while planning and designing the platform for the deployment of the Bank of GKE Enterprise application on Google Distributed Cloud. These sections focus on a production environment. To build lower environments like development or staging, you can use similar steps.

Google Cloud projects

When creating projects in Google Cloud for Google Distributed Cloud, a fleet host project is required. Additional projects are recommended for each environment or business function. This project configuration lets you organize resources based on the persona that is interacting with the resource.

The following subsections discuss the recommended project types and the personas associated with them.

Hub project

The hub project hub-prod is for the network administrator persona. This project is where the on-premise data center is connected to Google Cloud using your selected form of hybrid connectivity. For more information about hybrid connectivity options see Google Cloud Connectivity

Fleet host project

The fleet host project fleet-prodis for the platform administrators persona. The project is where the Google Distributed Cloud clusters are registered. This project is also where the platform-related Google Cloud resources reside. These resources include Google Cloud Observability, the Cloud Source Repositories, and others. A given Google Cloud project can only have a single fleet (or no fleets) associated with it. This restriction reinforces using Google Cloud projects to provide stronger isolation between resources that are not governed or consumed together.

Application or team project

The application or team project app-banking-prod is for the developer persona. This project is where application-specific or team-specific Google Cloud resources reside. The project includes everything except GKE clusters. Depending on the number of teams or applications, there might be multiple instances of this project type. Creating separate projects for different teams lets you separately manage IAM, billing, and quota for each team.

Networking

Each Google Distributed Cloud cluster requires the following IP address subnets:

Node IP addresses
Kubernetes Pod IP addresses
Kubernetes service/cluster IP addresses
Load balancer IP addresses (bundled mode)

To use the same non-routable IP address ranges for the Kubernetes Pod and service subnets in each cluster, select an island mode network model. In this configuration, Pods can directly talk to each other inside a cluster, but can't be reached directly from outside a cluster (using Pod IP addresses). This configuration forms an island within the network that isn't connected to the external network. The clusters form a full node-to-node mesh across the cluster nodes within the island, letting the Pod directly reach other Pods within the cluster.

IP address allocation

Cluster	Node	Pod	Services	Load balancer
metal-admin-dc1-000-prod	10.185.0.0/24	192.168.0.0/16	10.96.0.0/12	N/A
metal-user-dc1a-000-prod	10.185.1.0/24	192.168.0.0/16	10.96.0.0/12	10.185.1.3-10.185.1.10
metal-user-dc1b-000-prod	10.185.2.0/24	192.168.0.0/16	10.96.0.0/12	10.185.2.3-10.185.2.10
metal-admin-dc2-000-prod	10.195.0.0/24	192.168.0.0/16	10.96.0.0/12	N/A
metal-user-dc2a-000-prod	10.195.1.0/24	192.168.0.0/16	10.96.0.0/12	10.195.1.3-10.195.1.10
metal-user-dc2b-000-prod	10.195.2.0/24	192.168.0.0/16	10.96.0.0/12	10.195.2.3-10.195.2.10

In island mode, it's important to ensure that the IP address subnets chosen for the Kubernetes Pods and services aren't in use or routable from the node network.

Network requirements

To provide an integrated load balancer for Google Distributed Cloud that doesn't require configuration, use the bundled load balancer mode in each cluster. When workloads run LoadBalancer services, an IP address is assigned from the load balancer pool.

To read detailed information about the bundled load balancer's requirements and configuration, see Overview of load balancers and Configuring bundled load balancing.

Cluster architecture

For a production environment, we recommend using an admin and user cluster deployment model with an admin cluster and two user clusters in each geographical location to achieve the greatest redundancy and fault tolerance for Google Distributed Cloud.

We recommend using a minimum of four user clusters for each production environment. Use two geographically redundant locations that each contain two fault-tolerant clusters. Each fault-tolerant cluster has redundant hardware and redundant network connections. Decreasing the number of clusters reduces either the redundancy or the fault tolerance of the architecture.

To help ensure high availability, the control plane for each cluster uses three nodes. With a minimum of three worker nodes per user cluster, you can distribute workloads across those nodes to lower the impact if a node goes offline. The number and sizing of worker nodes is largely dependent on the type and number of workloads that run in the cluster. The recommended sizing for each of the nodes is discussed in Configuring hardware for Google Distributed Cloud.

The following table describes the recommended node sizing for CPU cores, memory, and local disk storage in this use case.

Node sizing
Node type	CPUs/vCPUs	Memory	Storage
Control plane	8 core	32 GiB	256 GiB
Worker	8 core	64 GiB	512 GiB

For more information about machine prerequisites and sizing, see Cluster node machine prerequisites.

Identity management

For identity management, we recommend an integration with OIDC through GKE Identity Service. In the examples provided in the source repository, local authentication is used to simplify the requirements. If OIDC is available, you can modify the example to use it. For more information, see Identity management with OIDC in Google Distributed Cloud.

Security and policy management

In the Cymbal Bank use case, Config Sync and Policy Controller is used for policy management. A Cloud Source Repositories is created to store the configuration data that Config Sync uses. The ConfigManagement operator, which is used to install and manage Config Sync and Policy Controller, needs read-only access to the configuration source repository. To grant that access, use a form of acceptable authentication. In this example, a Google service account is used.

Services

For Service Management in this use case, Cloud Service Mesh is used to provide a base on which distributed services are built. By default, an IngressGateway is also created in the cluster which handles standard Kubernetes Ingress objects.

Persistence and state management

Because persistent storage is largely dependent on existing infrastructure, this use case doesn't require it. In other cases, however, we suggest using storage options from GKE Enterprise Ready Storage Partners. If a CSI storage option is available, it can be installed on the cluster using the vendor-provided instructions. For proof of concept and advanced use cases, you can use local volumes. However, for most use cases, we don't recommend using local volumes in production environments.

Databases

Many stateful applications on Google Distributed Cloud use databases as their persistence store. A stateful database application needs access to a database to provide its business logic to clients. There are no restrictions on the type of Datastore used by Google Distributed Cloud. Data-storage decisions, therefore, should be made by the developer or by associated data management teams. Since different applications might require different datastores, those datastores can be used without limitation as well. Databases can be managed in-cluster, on-premises, or even in the cloud.

The Cymbal Bank application is a stateful application that accesses two PostgreSQL databases. Database access is configured through environment variables. The PostgreSQL database needs to be accessible from the nodes running the workloads, even if the database is managed externally from the cluster. In this example, the application accesses an existing, external PostgreSQL database. While the application runs on the platform, the database is managed externally. As such, the database isn't part of the GKE Enterprise platform. If a PostgreSQL database is available, use it. If not, create and use a Cloud SQL database for the Cymbal Bank application.

Observability

Each cluster in the Cymbal Bank use case is configured to have Google Cloud Observability collect logs and metrics for both the system components and applications. There are several Cloud Monitoring dashboards created by the Google Cloud console installer which can be viewed from the Monitoring dashboards page. For more information about observability, see Configuring logging and monitoring, and How Logging and Monitoring for Google Distributed Cloud works.

Platform deployment

For more information, see the Deploy the Platform section of the documentation in the GitHub source repository.

Application deployment

For more information, see the Deploy the Application section of the documentation in the GitHub source repository.

What's next

Read more about Cloud Service Mesh, Config Sync, and Policy Controller.
Look at some of the other GKE Enterprise reference architectures.
Explore reference architectures, diagrams, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.