DevOps & SRE

Hakuhodo Technologies: The transformative impact of SRE

August 12, 2024

Yoshimasa Suzuki

CCoE Team Leader, Hakuhodo Technologies Inc.

Takumi Kondo

CCoE Team Tech Lead, Hakuhodo Technologies Inc.

Try Gemini 3.1 Pro

Our most intelligent model available yet for complex tasks on Gemini Enterprise and Vertex AI

Try now

Hakuhodo Technologies, a specialized technology company of the Hakuhodo DY Group — one of Japan’s leading advertising and media holding companies — is dedicated to enhancing our software development process to deliver new value and experiences to society and consumers through the integration of marketing and technology.

Our IT Infrastructure Team at Hakuhodo Technologies operates cross-functionally, ensuring the stable operation of the public cloud that supports the diverse services within the Hakuhodo DY Group. We also provide expertise and operational support for public cloud initiatives.

Our value is to excel in the cloud and infrastructure domain, exhibiting a strong sense of ownership, and embracing the challenge of creating new value.

Background and challenges

The infrastructure team is tasked with developing and operating the application infrastructure tailored to each internal organization and service, in addition to managing shared infrastructure resources.

Following the principles of platform engineering and site reliability engineering (SRE), each team within the organization has adopted elements of SRE, including the implementation of post-mortems and the development of observability mechanisms. However, we encountered two primary challenges:

As the infrastructure expanded, the number of people on the team grew rapidly, bringing in new members from diverse backgrounds. This made it necessary to clarify and standardize tasks, and provide a collective understanding of our current situation and alignment on our goals.
We mainly communicate with the app team through a ticket-based system. In addition to expanding our workforce, we have also introduced remote working. As a result, team members may not be as well-acquainted as before. This lack of familiarity could potentially cause small misunderstandings that can escalate quickly.

As our systems and organization expand, we believe that strengthening common understanding and cooperative relationships within the infrastructure team and the application team is essential for sustainable business growth. This has become a core element of our strategy.

We believe that fostering an SRE mindset among both infrastructure and application team members and creating a culture based on that common understanding is essential to solving the issues above. To achieve this, we decided to implement the "SRE Core" program by Google Cloud Consulting, which serves as the first step in adopting SRE practices.

Change

First, through the "SRE Core" program, we revitalized communication between the application and infrastructure teams, which had previously had limited interaction. For example, some aspects of the program required information that was challenging for infrastructure members to gather and understand on their own, making cooperation with the application team essential.

Our critical user journey (CUJ), one of the SRE metrics, was established based on the business requirements of the app and the behavior of actual users. This information is typically managed by the app team, which frequently communicates with the business side. This time, we collaborated with the application team to create a CUJ, set service level indicators (SLIs) and service level objectives (SLOs) which included error budgets, performed risk analysis, and designed the necessary elements for SRE.

This collaborative work and shared understanding served as a starting point. As we continued to build a closer working relationship even after the program ended, with infrastructure members also participating in sprint meetings that had previously been held only for the app team.

https://storage.googleapis.com/gweb-cloudblog-publish/images/Hakuhodo_-_Next_Tokyo.max-2200x2200.png

Additionally, as an infrastructure team, we systematically learned when and why SRE activities are necessary, allowing us to reflect on and strengthen our SRE efforts that had been partially implemented.

For example, I recently understood that the purpose of postmortems is not only to prevent the recurrence of incidents but also to gain insights from the differences in perspectives between team members. Learning the purpose of postmortems changed our team’s mindset. We now practice immediate improvement activities, such as formalizing the postmortem process, clarifying the creation of tickets for action items, and sharing postmortem minutes with the app team, which were previously kept internal.

We also reaffirmed the importance of observability to consistently review and improve our current system. Regular meetings between the infrastructure and application teams allow us to jointly check metrics, which in turn helps maintain application performance and prevent potential issues.

By elevating our previous partial SRE activities and integrating individual initiatives, the infrastructure team created an organizational activity cycle that has earned more trust. This enhanced cycle is now getting integrated into our original operational workflows.

Future plans

With the experience gained through the SRE Core program, the infrastructure team looks forward to expanding collaboration with application and business teams and increasing proactive activities. Currently, we are starting with collaborations on select applications, but we aim to use these success stories to broaden similar initiatives across the organization.

It is important to remember that each app has different team members, business partners, environments, and cultures, so SRE activities must be tailored to each unique situation. We aim to harmonize and apply the content learned in this program with the understanding that SRE activities are not the goal, but are elements that support the goals of the apps and the business.

Additionally, our company has a Cloud Center of Excellence (CCoE) team dedicated to cross-organizational activities. The CCoE manages a portal site for company-wide information dissemination and a community platform for developers to connect. We plan to share the insights we've gained through these channels with other respective teams within our group companies. As the CCoE's internal activities mature, we also intend to share our knowledge and experiences externally.

Through these initiatives, we hope to continue our activities with the hope that internal members — beyond the CCoE and infrastructure organizations — take psychological safety into consideration during discussions and actions.

Supplement: Regarding psychological safety

At our company, we have a diverse workforce with varying years of experience and perspectives. We believe that ensuring psychological safety is essential for achieving high performance.

When psychological safety is lacking, for instance, if the person delivering bad news is blamed, reports tend to become superficial and do not lead to substantive discussions.

This issue can also arise from psychological barriers, such as the omission of tasks known only to experienced employees, leading to problems caused by the fear of asking for clarification.

In a situation where psychological safety is ensured, we focus on systems rather than individuals, viewing problems as opportunities. For example, if errors occur due to manual work, the manual process itself is seen as the issue. Similarly, if a system failure with no prior similar case arises, it is considered an opportunity to gain new knowledge.

By adopting this mindset, fear is removed from the equation, allowing for unbiased discussions and work.

This allows every employee to perform at their best, regardless of their years of experience. Of course, this is not something that can be achieved through a single person. It will require a whole team or organization to recognize this to make it a reality.

Posted in