Securing apps for Googlers using Anthos Service Mesh
David Challoner
Site Reliability Engineer
Anthony Bushong
Developer Relations Engineer
Securing apps for Googlers using Anthos Service Mesh
Hi there! I'm David Challoner from Access Site Reliability Engineering (SRE), here with Anthony Bushong from Developer Relations to talk about how Corp Eng is adopting Anthos Service Mesh internally at Google.
Corp Eng is Google's take on "Enterprise IT". A big part of the Corp Eng mission is running the first and third party software that powers internal business processes - from legal and finance to floor planning and even the app hosting our cafe menus - all with the same security or production standards as any of Google's first party applications.
Googlers need to access these applications, which sometimes then need to access other applications or other Google Cloud services. This traffic can cross different trust boundaries which can trigger different policies.
Access SRE runs the systems that mediate this access, and we implemented Anthos Service Mesh as part of our solution to secure the way Googlers access these applications.
But why?
You can probably tell, but the applications Corp Eng is responsible for have disparate requirements. This often means that certain applications are tied to disparate infrastructure due to legal, business or technical reasons - which can be challenging when those infrastructures work and operate differently.
Enter Anthos. Google Cloud built Anthos to provide a consistent platform interface unifying the experience of working with apps on these varying underlying infrastructures, with the Kubernetes API at its foundation.
So when searching for the right tool to build a common authorization framework to mediate access to CorpEng services, we turned to Anthos - specifically Anthos Service Mesh, powered by the open-source project, Istio. Whether these services were deployed in Google Cloud, in Corp Eng data centers, or at the edge onsite at actual Google campuses, Anthos Service Mesh delivered a consistent means for us to program secure connectivity.
To frame the impact ASM had on our organization, it's helpful to introduce the roles of the folks who manage and use it:
Figure 1 - Anthos Service Mesh empowers multiple people across different roles to connect services securely
For security stakeholders, ASM provides an extensible policy enforcement point running next to each application capable of provisioning a certificate based on the identity of the workload and enforcing mandatory fine-grained application-aware access controls.
For platform operators, ASM is delivered as a managed product, which reduces operational overhead by providing out-of-the-box release channels, maintenance windows, and a published Service Level Objective(SLO).
For service owners, ASM enables the decoupling of their applications from networking concerns, while also providing features like rate limiting, load shedding, request tracing, monitoring, and more. Features like these were typically only available for applications that ran on Borg, Google's first-party cluster manager that ultimately inspired the creation of Kubernetes.
In sum, we were able to secure access to a plethora of different services with minimal operational overhead, all while providing service owners granular traffic control.
Let's see what this looks like in practice!
The architecture
Figure 2 - High-level architecture for Corp Eng services and Anthos
In this flow, user access first reaches the Google Cloud Global Load Balancer [1], configured with Identity Aware Proxy (IAP) and Cloud Armor. IAP is the publicly available implementation of Google's internal philosophy of BeyondCorp, providing an authentication layer that works from untrusted networks without the need for a VPN.
Once a user is authenticated, their request then flows to the Ingress Gateway provided by Anthos Service Mesh [2]. This provides additional checks that traffic flows to services only when the request has come through IAP, while also enforcing mutual TLS (mTLS) between the Anthos Service Mesh Gateway to the Corp services owned by various teams.
Finally, additional policies are enforced by the sidecar running in every single service Pod [3]. Policies are pulled from source control using Anthos Config Management[4], and are propagated to all sidecars by the managed control plane provided by Anthos Service Mesh[5].
Managing the mesh
If you're not familiar with how Istio works, it follows the pattern of a control plane and a data plane. We talked a little bit about the data plane - it is made up of the sidecar containers running alongside all of our service Pods. The control plane, however, is what's responsible for updating these sidecars with the policies we want to enforce:
Figure 3 - High-level architecture for Istio
Thus, it is critical for us to ensure that the control plane is healthy. This is where Anthos Service Mesh gives our platform owners a huge advantage with its support for a fully-managed control plane. To provision cloud resources, like many other companies, our organization uses Terraform, the popular open-source infrastructure as code project. This gave us a declarative and familiar means for provisioning the Anthos Service Mesh control plane.
First, you enable the managed control plane feature for GKE by creating the google_gke_hub_feature
resource below using Terraform.
Keep in mind that at publication time, this is only available via the google-beta
provider in Terraform.
Once created, we then provision a ControlPlaneRevision
custom resource in a GKE cluster to spin up a managed control plane for ASM in that cluster:
Using this custom resource, we are able to set the release channel for the ASM managed control plane. This allows for our platform team to define the pace of upgrades in accordance with our team's needs.
In addition to managing the control plane, ASM also provides management functionality around the data plane to ensure each sidecar Envoy is kept up to date with the latest security updates and is compatible with the control plane - one less thing for service operators to worry about. It does this using Kubernetes Mutating Admission Webhooks and Namespace labels to modify our Pod workload definitions to inject the appropriate sidecar proxy version.
Syncing mandatory access policies
With the core Anthos Service Mesh components in place, our security practitioners can define consistent, mandatory security policies for every single GKE cluster, using Istio APIs.
For example, one policy is enforcing strict mTLS between Pods using automatically provisioned workload identity certificates. Earlier, we talked about how this is enforced between the Istio Gateway; that same policy enforces mTLS between all Pods in our cluster.
Another policy we implement is denying all egress traffic by default, requiring service teams to explicitly declare their outbound dependencies. The following is an example of using an Istio Service Entry to allow granular access to a specific external service - in this case, Google. This helps prevent unintended access to external services.
These policies are automatically synced to all service mesh namespaces in each cluster using Anthos Config Management. By using our internal source control system as a source of truth, Anthos Config Management can sync and reconcile policies across all of our GKE clusters, ensuring that these policies are in place for every single one of our services. You can find more details about our implementation of Anthos Config Management here.
With this in place, our team plans on eventually migrating away from security automation that operates solely based on explicit IP, port and protocol policies.
Integration with Identity-aware Proxy
The publicly available version of the BeyondCorp proxy used by CorpEng is called Identity-aware Proxy (IAP), which offers an integration with Anthos Service Mesh. IAP allows you to authenticate users trying to access your services and apply Context-Aware-Access policies. This integration comes with two main benefits:
Ensuring that user traffic to services in the service mesh only come through Identity-aware Proxy
Enforcing Context-aware access (CAA) trust levels for devices, defined by multiple device signals we collect
Identity-aware Proxy allows us to capture this information in a Request Context Token (RCToken), which is a JSON Web Token (JWT) created by Identity-aware Proxy that can be verified by ASM. IAP inserts this JWT into the Ingress-Authorization
header. Using Istio Authorization Policies similar to the following policy, any requests without this JWT are denied:
Here is an example policy that requires a fullyTrustedDevice
access level - this might be a device in your organization that is known to be corporate-owned, fully-updated, and running an IT-approved configuration :
This allows our security team to not only secure service to service communications, or outbound calls from services, but also specifically require incoming requests come from trusted devices and authenticated users using a trusted device.
Enabling service teams
As an SRE, one of our priorities is ensuring Service-level indicators (SLIs), SLOs, and Service-level agreements (SLAs) exist for services. Anthos Service Mesh helps us empower service owners to do this for their services, as it exposes horizontal request metrics like latency and availability to all services in the mesh.
Before Anthos Service Mesh, each application had to export these separately (if at all). With ASM service owners can easily define their Service's SLOs in the cloud console or via terraform using these horizontally exported metrics. This then allows us to integrate SLOs into our higher-level service definitions so we can enable SLO monitoring and alerting by default. You can see the SRE book for more details on SLOs and Error budgets.
The takeaway
ASM is a powerful tool that enterprises can use to modernize their IT infrastructure. It provides:
A shared environment-agnostic enforcement point to manage security policy
A unified way to provision identities, describe application dependencies
This also enables previously unheard of operational capabilities such as distributed tracing or incremental canary rollouts - which were difficult to find in the typical enterprise application landscape.
Because it can be incrementally adopted and composed with existing authorization systems to close gaps - barriers to adoption are low and we recommend you start evaluating it today!