Hybrid & Multi-Cloud

Set up Anthos Service Mesh for multiple GKE clusters using Terraform

#anthos

Anthos Service Mesh is a managed service mesh for Google Kubernetes Engine (GKE) clusters. Anthos Service Mesh allows GKE clusters to use a single logical service mesh, so that pods can communicate across clusters securely and services can share a single Virtual Private Cloud (VPC). 

Using Anthos Service Mesh requires GKE clusters and firewall rules. As well, access to the GKE GKE control plane needs to be granted, if private clusters are used. Infrastructure-as-code (IaC) makes bootstrapping Anthos Service Mesh significantly easier. In this blog post, we explain the new features of Anthos Service Mesh, and how to implement it across two private GKE clusters using Terraform. We also provide automation scripts, giving a guided tour for setting up a cloud environment.

For those who want to get started immediately, there is a Git repo with complete source code and README instructions. There are also bonus sections at the end, for mesh traffic security scanning and external databases respectively.

Supported version

The supported versions are Anthos Service Mesh 1.7 and 1.8. For more information on Anthos Service Mesh versions, please check the Anthos Service Mesh release notes.

Fig 3.1 - Anthos Service Mesh version release notes.jpg
Fig 3.1 - Anthos Service Mesh version release notes

Shared VPCs

Anthos Service Mesh 1.8 can be used for a single shared VPC, even across multiple projects. Please consult the documentation on Anthos Service Mesh 1.8 multi-cluster support for complete details:

Fig 3.2 - Anthos Service Mesh multi-cluster support.jpg
Fig 3.2 - Anthos Service Mesh multi-cluster support

SSL/TLS termination

TLS termination for external requests is supported with Anthos Service Mesh 1.8. Doing so requires modifying the Anthos Service Mesh setup files.

You can set up Anthos Service Mesh using the install_asm script. A custom istio-operator.yaml file can be used by running install_asm with the --custom_overlay option.

In order for Istio (i.e., Anthos Service Mesh) to allow access to external services, change the egress policy to REGISTRY_ONLY. Please see the blocking-by-default Istio documentation for more details.

For TLS termination of requests to Prisma Cloud (Twistlock), please the below section on Prisma Cloud.

Security

Anthos Service Mesh has inherent security features (and limitations), as described in the security overview documentation. Additionally, please follow the GKE best practices for security.

NOTE: Anthos Service Mesh inherently implements Istio security best practices, such as namespaces and limited service accounts. Workload identity is an optional GKE-specific service account, limited to a namespace.

The Istio ingress gateway needs to be secured manually. Please see the Secure Gateways Istio documentation for more details.

For security scanning of GKE cluster ingress, please see the below section on Prisma Cloud.

Container workload security

GKE cluster network policies allow you to define workload access across pods and namespaces. This is built on top of the Kubernetes NetworkPolicy API. There is also a helpful tutorial on configuring GKE network policies for applications.

There are detailed steps for securing container workloads in GKE. This involves a layered approach to node security, pod/container security contexts and pod security policies. As well, Google Cloud's Container-Optimized OS (both cos and cos_containerd) apply the default Docker AppArmor security policies to all containers started by Kubernetes.

Container runtime (Containerd)

We recommend using the cos_containerd runtime for GKE clusters using Anthos Service Mesh. The current Docker container runtime is being sunsetted from GKE. Adopting cos_containerd now will avoid having to migrate in the future.

Using Containerd as the container runtime still allows developers to use Docker to build containers. Here are some potential conflicts, when migrating from Docker to Containerd:

  • running privileged Pods executing Docker commands

  • running scripts on nodes outside of Kubernetes infrastructure (for example, using ssh to troubleshoot issues)

  • using third-party tools that perform such similarly privileged operations

  • using tooling that was configured to react to Docker-specific log messages in your monitoring system

To avoid such conflicts, we recommend a canary deployment of your clusters with cos_containerd. You can find Instructions for canary deployments in the above-linked migration documentation.

Security scanning with Prisma Cloud (formerly Twistlock)

To do a security scan of the pod traffic on Anthos Service Mesh, you can use Palo Alto Networks’ Prisma Cloud (formerly Twistlock), a cloud security posture management (CSPM) and cloud workload protection platform (CWPP) that provides multi-cloud visibility and threat detection. Please consult the Prisma Cloud admin guide (latest as of January 7, 2021) for more details.

Prisma Cloud setup

For setup instructions, please see the Twistlock folder README file in the anthos-service-mesh-multicluster source code repository. The table below contains links to the official Prisma Cloud setup documentation.

Table 4.1 - Prisma Versions.jpg
Table 4.1 - Prisma Versions

TLS termination

Prisma Cloud TLS requests are terminated at the Prisma Cloud console. When a request comes from Prisma Cloud SaaS to a Twistlock container, the API call is also terminated with a TLS certificate.

External databases with Google Cloud SQL for PostgreSQL

Many organizations wish to establish external database connectivity to their Anthos Service Mesh environment. One common example uses Google Cloud SQL for PostgreSQL (Cloud SQL).

Cloud SQL is external to GKE, thus requiring GKE to do SSL termination for external services. With Anthos Service Mesh, you can use an Istio ingress gateway, which allows SSL passthrough, so that the server certificates can reside in a container. However, this approach is problematic for many PostgreSQL databases.

PostgreSQL uses application-level protocol negotiation for SSL connections. The Istio proxy currently uses TCP-level protocol negotiation. This causes the Istio proxy sidecar to error out during the SSL handshake, when it tries to auto-encrypt the connection with PostgreSQL. Fortunately Cloud SQL can itself host a sidecar for TLS termination.

For setup instructions, please see the postgres folder README file in the anthos-service-mesh-multicluster source code repository.

Towards federated clusters

Anthos Service Mesh 1.7 and 1.8 can now federate multiple GKE clusters. Taken as "managed Istio" in a single VPC, this container orchestration model takes GKE to its full potential, and can be configured using tools like Terraform and shell scripts that are available in the anthos-service-mesh-multicluster Git repo.

If you have not already tried out the sample code, please navigate to the Git repo and do so. This is a good next step as the README files are detailed and instructive. Learning-by-doing is an effective way to understand Anthos Service Mesh. As well, the Terraform code uses the latest Google Cloud modules, giving you valuable tools for your toolbox.

We encourage you to make contributions to the Git repo, using Google Cloud Professional Services’ contributing instructions.


NOTE: As of November 12, 2020, Anthos Service Mesh, Mesh CA and the Anthos Service Mesh dashboards in Google Cloud Console are available for any GKE customer and do not require the purchase of Anthos. See pricing for details.
[1] Prisma Cloud SaaS Version Administrator's Guide
[2] Twistlock Reference Architecture