From open source to managed services: Maisons du Monde’s service mesh journey
Michaël Lopez
Cloud Product Manager, Maisons du Monde
Guillaume Marceau
Lead SRE, Maisons du Monde
Maisons du Monde is the European leader in inspiring and affordable homes. As a brand characterized by openness and dialogue, it unites its 7.5 million customers around desirable and sustainable lifestyles. Atmospheres for the home across multiple styles can be found in its constantly renewed range of furniture and decor. With optimism, creativity, commitment and proximity as its core values, the brand is based on a high-performance, omni-channel model. With its digitalization, digital sales, customer service, nothing can stand in the way of this love brand and its company purpose: “To inspire people to be open to the world, so that together we can create unique, welcoming and sustainable places to live.” With 360 stores across France, Italy, Spain, Belgium, Luxembourg, Germany, Switzerland, and Portugal our systems and website caters to traffic from across Europe. Our team of Operations Engineers and Site Reliability Engineers (SRE) manage the MDM website and ensure the availability of our services. Our APIs and systems run omnichannel services such as orders, users, and carriers. Our cloud migration aims involve increasing visibility into our network environment, maximizing our engineering resources, and providing our customers with a faster, more efficient website experience.
1.0 Migration challenges
In 2018, when Maisons du Monde made the decision to modernize our infrastructure, we knew containerization was our best option. The flexibility and scalability of Kubernetes, continuously optimized with Google Kubernetes Engine (GKE), enables our technical teams to use real-time data to improve our customer’s experience while providing a higher quality of service.
As we seamlessly began our migration to GKE, we started exploring the full power of microservices inside optimized Kubernetes; compared to our limited on-premises environment. Here, we realized we need a service mesh, a networking layer that helps our developers secure, connect, and monitor our new microservices so they can communicate and share data. Placing a service mesh on top of our GKE cluster allows us to accelerate innovation in three key areas:
- Traffic Management
- Security
- Observability
We began by deeply assessing a service mesh’s role in our network environment, from improving our security posture to providing advanced routing, and aligning our priorities with the company’s business goals.
1.1 Why we need to solve these challenges
The migration to Kubernetes and containers, while gaining flexibility and agility, also means regaining visibility into and control of certain aspects of our network environment. The first key feature we’ve focused on is traffic management. Why? Because it allows us to return ownership of an application’s traffic management to the development teams. Both Istio, and Anthos Service Mesh (ASM), come with CRDs (Custom Resource Definitions) that extend GKE objects. With those new CRDs, we split load balancing management, which the Ops Team owns, and the traffic/routing management which is specific to the application owned by the development team. With these changes, traffic management, previously handled by advanced load balancing tools, became fully customizable.
2.0 Why Istio
We began by evaluating several tools that could fit our needs:
GCE Ingress Controller
Nginx Ingress Controller
Traefik
Istio
Our goal was to provide our teams with modern and manageable tools. We wanted to solve some of the Kubernetes pain points while offering our developers new features, and additional tools for testing, security enhancement and advanced traffic management.
With our requirements top of mind, we drafted a series of criteria that our tool needs to meet:
A unique endpoint (Ingress) for multiple services and namespaces
Network observability inside the cluster / between services and more
Active product and developer community
Traffic management inside and outside the cluster
Security policies between services
Solution / Setup Cost
After some POCs and analyses, Istio seemed to be the candidate, at least at the start.
2.1 Deploying Istio on Google Cloud
From day one, Istio’s service mesh offering met a number of our requirements. Monitoring each release, we began selecting features that could bring the most value to our teams. Here are the most valuable features we used at Maisons du Monde:
Dedicated application URL for isolated feature development.
Network Gateway management mapped with a (unique) Global or Internal Load Balancer. A single gateway (ingress or egress) for multiple services
JWT Token verification from IAP Authentication
Communication control between services
Egress TLS Origination
Fault Injection (Testing in progress)
Canary Deployment (Testing in progress)
With Istio deployed in our network environment, we had granular controls over traffic management and security posture, ensuring that our applications and services were available and observable. Yet, despite the ability to enforce consistent policies and automate load balancing, our Istio deployment was missing some critical requirements for our modernization.
2.2 What challenges does ASM solves over Istio
While Istio allows us to observe and secure our applications, it requires significant investment of our engineers to maintain and update the platform to ensure optimization. Because security continues to be a core pillar in our upgrade, our engineers must constantly maintain Istio to provide visibility into our network. However, the effort to constantly upgrade Istio conflicts with one of our primary goals of our cloud modernization: maximizing the efficiency of our engineering teams.
While we initially considered employing Anthos Service Mesh (ASM) at the beginning of our journey, at the time, it was part of a package of features and functionalities that we felt were unnecessary for our environment. So we chose an open-source alternative, Istio. However, when Google began offering ASM as a standalone product, we jumped at the opportunity to prioritize the investment in our developers and divert their resources into accelerating innovation for our customers.
Thanks to ASM’s wide-ranging functionalities, our developer team no longer spends long hours manually configuring settings and optimizing clusters. Now, ASM automates systems such as:
Managed versions channels with auto-update
Control plane (IstioD) is managed outside the cluster and monitored by Google
UI is managed by Google (Google ASM UI and Google Trace which are used by other GCP product like LB)
mTLS managed via MeshCA
GCP Support
Inter Cluster and/or project connection and visibility
3.0 ASM migration journey: From Istio 1.4 to ASM Managed control plane
When beginning our migration to the cloud, we deployed Istio v1.4 with plans to upgrade as final builds were released. Due to limited SRE engineers resources, upgrading and optimizing Istio took time, delaying our ability to innovate features and harden our network security posture. While the first upgrades were painful, we began implementing canary upgrades for future rollouts.
From v1.6 of Istio, the upgrade process became easier. The Community worked a lot to simplify the rollout workload, avoiding downtime or service interruptions. Everything was smooth until the upgrade from v1.10 to v1.11. We’ve used ASM’s terraform module for handling ASM deployment in our clusters. When we wanted to upgrade to v1.11, the ASM’s Terraform module changed to manage ASM with only Control Plane enabled. While we wanted to move to v1.11 before activating the managed control plane, step by step, we decided to manage both concurrently.
3.1 Road to ASM Managed
Moving from Istio v1.9 to ASM 1.9 is seamless. However, it’s important to use ASM’s script to set up additional ASM components. Asmcli is just a fork of istioctl script which supports ASM components managed by Google.
During our first attempt to integrate asmcli, we faced challenges because we didn’t want to use a managed control plane. Instead, we were replicating our methods of incorporating isitoctl. We realized that to gain the full benefit of ASM, we need to modernize our workloads to work natively with features supported by Anthos Service Mesh.
Here’s the quick and easy steps of migrating from Istio to ASM Managed today:
Upgrade you ASM installation to v1.10
Migrate to ASM v1.11 with managed control plane enabled
The migration process works smoothly and requires adjusting some configurations and your chosen release channel.
Before your migration, please note the EnvoyFilter object is incompatible with the ASM and its managed control plane. If you use it, you should find an alternative.
3.2 The day-to-day differences of a Managed Control Plane
The big advantage of using a managed control plane is that istiod is not inside the cluster anymore. It’s now a hidden service fully managed by Google Cloud including installation, maintenance and upgrades.
Thanks to ASM Managed, our SRE engineers no longer need to devote their limited time and resources to preparing and optimizing our environment for an Istio update. In fact, ASM is already accelerating our ability to assess and update applications and services. With three channels available on ASM, we stagger our development and testing into segmented clusters:
Rapid (latest istio release)
Regular (one release before latest)
Stable (two releases before latest)
To maximize productivity, we are using the Regular channel on our development cluster, while the Stable channel is used for Quality Assurance. Segmenting allows us to test a recent version on a sandboxed cluster to manage releases that have specific configurations, and allows us to identify potential compatibility issues prior to release. When it’s time to roll out the upgrade, we know that ASM will handle orchestration and updates across the environment.
The one regret we have is losing visibility on the status of istio proxies (istioctl proxy-status); this command was useful for identifying problems inside the mesh and when some envoy proxies fail. However, to overcome that, we have been using Prometheus metrics on each envoy, but that does add some delay when debugging. The ASM Managed UI is continuously adding new features, and its seamless integration into the Google Cloud interface provides an out-of-the-box experience that makes cluster management a delight.
Recently, we had to secure exposed API endpoints for our partners. We wanted to use IAP authentication and ASM objects to secure and control which partner has access to sensitive data. Because our applications span across cloud and on-premises environments, maintaining a hardened security posture is a challenge. To limit network access to potential threat actors, we decided to constrict access to a single endpoint. By maintaining a single endpoint, our users are accessing Istio objects regardless of whether they’re accessing the on-premises or the GKE cluster. We verify the JWT Token, check to see if the requester has the correct permissions to access the Istio object, and forward the request to the appropriate endpoint (on-premises or GKE cluster). The cherry on top of our sundae: all communications inside the mesh are secured by design thanks to a mutual TLS that is fully managed with ASM.
Our migration experience has been a journey of twists and turns. While our implementation of Istio was successful, we were devoting too many engineering resources towards simple maintenance and compatibility issues. When we began assessing the cost/benefit of an open-source vs. a managed service, it became apparent that our options were between innovative development and managing our own environment. With ASM Managed, we know that Google Cloud can administer our ASM, so we can dedicate our developer resources towards creative and innovative projects that provide value to our customers.
Looking back on our migration, we finally found our way to the path that was right for us, but it’s clear that there were bumps in the road and detours. Some key things to consider for your migration:
Utilize canary deployments when possible. While Kubernetes natively supports canary implementations, it can still be challenging to update and support rollouts in a heavy-traffic environment without a service mesh.
An active support and developer community are critical information, innovation, and technical acceleration resources. Contributing and engaging with your community is necessary to build connections with subject matter experts and establish your thought leadership brand.
Planning your engineering resources is critical at every step, and it’s never too early to reassess when you get new data. The time and well-being of your engineers can have severe effects on morale and productivity, so make sure not to waste them on projects that can be automated or managed.
As we continue the subsequent phases of our modernization with Google Cloud, we’re grateful for the close partnership and collaboration they’ve shown us. Innovation is at the heart of Maisons du Monde, from our workshops to the sales floor, and thanks to ASM Managed, it’s at the heart of our cloud environment.