DevOps Awards winner Kakao Mobility on balancing speed and stability
Chief Technology Officer, Kakao Mobility
Kakao Mobility is the leading mobility service provider in South Korea, providing taxi services, turn-by-turn directions, public transport, parking space searching, and real-time traffic information to more than 30M users in South Korea. In this blog post, we’re highlighting Kakao Mobility for the DevOps achievements that earned them the ‘Optimizing for speed without sacrificing stability’ award in the 2022 DevOps Awards. If you want to learn more about the winners and how they used DORA metrics and practices to grow their businesses, start here.
Kakao Mobility customers rely heavily on our service, especially during the rush hour commute, and so we want to ensure a 100% uptime service level agreement. Incorporating a wide range of services, from real-time traffic monitoring and payment systems to booking drivers, we need cloud infrastructure that can scale on demand if we wish to continually expand. The current experience for developers was stress-inducing, as their environment lacks the scalability and flexibility of the cloud. When trying to deploy new services and features, scaling was continually causing delays and bottlenecks, leading to frustrating app performance for users.
The unexpected nature of city traffic, from fluctuating commute times to unforeseen accidents, meant that we often faced spikes in user traffic which hampered the customer experience. Furthermore, our company provides its services via APIs to third-party platforms to integrate our wide range of traffic and mobility services within native app experiences. To ensure our offerings remain secure, we also wanted to harden its API resiliency while maintaining its 100% SLA target for its customers. Recognizing that any reduction in availability and responsiveness can lead to distrust from our consumer base, and a loss in profits, we needed to provide a fault-tolerant, resilient system that could scale on-demand.
To provide users with a highly-available app experience as we expand our market share, countries of operation, and offer new features and services to customers, we needed to focus on reliability. Because reliability is a critical factor in ensuring that our commuting customers arrive at work on time, we are pursuing a multi-cloud strategy to maintain availability even in the worst of scenarios.
There are three main objectives of our migration strategy:
Uncouple and modernize the application to incorporate Kubernetes containers and microservices
Improve the performance speed of deployment while relieving developer burden
Improve the availability, reliability, and performance of the service
Speed + stability
Our multi-step migration process began with modernizing our environment to build a multi-cloud, hybrid cloud architecture. Over the course of 2021 and 2022, our teams worked to refactor the application into microservices, migrating workloads to Google Cloud, and began using Anthos Service Mesh (ASM) as an API orchestration platform. The elasticity and scalability of cloud resources allows the team to experience a cost-effective, reliable solution.
Throughout 2022, we successfully migrated our flagship application to Google Cloud, implementing it using Google Kubernetes Engine (GKE) clusters to maintain scalability and reliability. Working closely with Google Cloud, our DevOps team has been modernizing their conventional services to support the new multi-cloud strategy between on-premises and Google Cloud.
Using Anthos Service Mesh, we deploy gateways to control the ingress and egress of traffic throughout the application, and separate the resources across several GKE clusters. APIs play a critical role in our ability to provide a reliable and scalable service to users, including in third-party, offline applications.
For major seasonal events where we expect significant increases in traffic — Korean traditional holidays, Christmas holidays and New Years Eve — Google Cloud’s Event Management Service (EMS) helps ensure reliability and availability. Working closely with the Google Cloud team helps us not only reinforce infrastructure to maintain stability throughout the entire seasonal event, but we run simulation and tabletop exercises to prepare the teams for any eventuality.
Thanks to Anthos Service Mesh, Kakao Mobility has modernized our IT environment and enhanced cloud security, with plans to adopt more Google Cloud services in the future.
Pod scaling time is no longer included in deployment time. The deployment manager does not need to wait for the pod to start, and the deployment is carried out only with traffic control. The deployment manager can deliver as much traffic as one needs to the new version and verify the change quickly. If a hot fix deployment is required, the deployment can be completed by switching traffic within 10 seconds! In a real example of a time when an urgent deployment was required, it took us only about 10 minutes to provision a new version of the pod (including the nodes), and the service processed traffic stably for those 10 minutes.
Developer teams can also provide security-enhanced access to their APIs while whitelisting traffic with Anthos Service Mesh, so we’re not inundated with latency during commute times. The development team in charge of the entire service was operating two microservices, but have increased that to nine since implementing ASM — a 450% increase.
None of this would have been possible without the DORA research that taught us how to focus on continuous improvement within our organization. While this project was not always a linear path to success, failing helped us better understand what to focus on in the future so we can serve our customers more effectively and efficiently.