How Psyonix wins with better logging
When you grow your peak concurrent users by 5x nearly overnight, ensuring that your operations can successfully support that growth can be a make or break for your success. Rocket League is a popular online multiplayer game created by Psyonix described as arcade-style soccer and vehicular mayhem. In the summer of 2020, the game maker decided to switch the business model of the game from an upfront purchase to a free to play model.
For the free to play transition, they predicted there would be a substantial increase in the number of concurrent users, up to 5x, which was great for the business and a challenge for their IT team. To meet this lofty goal, the Psyonix team transformed key aspects of their operations to create a scalable, high-quality gaming experience for their players.
From one week to one day to spin up a new environment
Industries such as gaming, media, healthcare and retail face a common operational challenge: a highly variable, and sometimes hard to predict, number of peak concurrent users of their services.
Dan Schoenblum, Principal Engineer, and Matt Sanders, Lead Online Services Engineer, lead the team at Psyonix that makes the gameplay arenas come to life with players. On any given day, the number of gamers requesting to enter the soccer-style matches can vary greatly based on factors such as time of day, holidays, special promotional activities and more. To deal with the high variance, the team turned to Google Kubernetes Engine (GKE) to help them scale their environments quickly based on demand.
Matt recalls, “It used to take us a week to spin up a new environment, including all of the customized configurations required by our services. Since we started using GKE, we can do it in a day. There’s an API and command line tool for everything. It helps us automate a lot of that setup with Ansible and Terraform and that results in fewer misconfigurations.”
But scaling up quickly was not the only value that the team at Psyonix recognized with GKE.
“As a developer, Google Cloud is friendly to use and it is easy to find what I am looking for. When you’re using GKE, the operational overhead of managing Kubernetes is eliminated. Logging and monitoring are already part of the ecosystem, so it’s easy to access your logs. Aside from the log metrics, there are a number of system metrics that are packaged together along with the app logs. So as a developer things just feel more integrated and that makes it easy to use,” said Dan.
An increased number of gamers brings the need to troubleshoot more efficiently, and ensure that your critical operations are reliable. For Psyonix, the key to those needs was a sharper focus on logging data.
Making the most out of logs data
Developers, operations, and SRE teams use logs – point in time snapshots of systems and applications – to get to the bottom of technical problems. With a large increase in the number of gamers and infrastructure to support them, the volume of logs made Psyonix’s manual searches nearly impossible. They turned to labeling and Logs Explorer, capabilities of Cloud Logging, to add helpful metadata, then analyze it.
Search a full day’s worth of logs in just seconds
The operations team for Psyonix’s player matching service is tasked with getting to the bottom of the toughest issues escalated from front-line support. They are often dealing with vague information since players will not have an in-depth understanding of Psyonix’s infrastructure or services architecture.
Dan explains, “Our players described problems like: they were unable to join a match, or there was a glitch and it happened yesterday sometime. In our old system it was impractical to troubleshoot these issues unless we had a very good idea of what type of problem it was and exactly when it occurred. We did not have an effective way to search through our logs data.”
Enter labeling. Labels helped the team group requests by fields such as Player ID so it is indexed and searchable in Cloud Logging. Now they can query logs associated with individual players within seconds and begin the troubleshooting process. Dan continued,
“This is a level of visibility we would not have been able to achieve manually. Labels and Cloud Logging allow us to troubleshoot more effectively as our user base grows. Our goal was to do more with the same amount of resources to deliver a great gaming experience, and we achieved it.”
Seconds count for critical gameplay and revenue
Switching from a prepaid revenue model to a free to play model means that your customer’s experiences, and their ability to buy upgrades whenever they want them, will be at the forefront of your business success. The Psyonix team uses Cloud Monitoring dashboards and proactive alerting to stay ahead of issues.
Matt describes the day that they transitioned to free to play: “When we did our free to play launch, we ran load tests and built a model of expected behavior, tracking all of it in Cloud Monitoring dashboards. Then when we actually went live, we were able to track the real metrics against the modeled metrics to verify that things were behaving as we expected. On launch day, pretty much everyone on the team had a bunch of these dashboards up that were being powered by log metrics.”
Psyonix also uses custom metrics, created from logs data to monitor and alert on critical aspects of their business. If their metrics fall outside normal operating thresholds, they get an alert and they can investigate immediately. The team configured alerts to get notified for decreases in important events such as:
Percentage of successful login attempts
Percentage of successful matchmaking attempts
Percentage of successful in-game purchase events
A common best practice for monitoring is: “measure what is important for your business.” With the metrics mentioned above, and others, they are ensuring that the needs of their players and needs of their business are being monitored at all times.
What’s next for Psyonix
Labeling, custom metrics, dashboards and alerts were just the beginning for the Psyonix team. Their next big project will be creating a common library of tags that are logged as part of requests. This methodology is termed “canonical logging” and it helps with standardization of metrics. When the same type of information is logged with every request you can build more robust metrics, analyze patterns more effectively and parse the data faster for purposes such as ticketing systems. As Matt points out, “when hunting down logs for troubleshooting, you might have to go back into the code to see how the developer labeled a particular log. With canonical logging, all of that is standard, so logs are more useful.”
Start your journey with GKE and Cloud Logging
Google Cloud automatically ingests system logs and metrics, as well as logs and metrics from configured applications into Cloud Logging and Cloud Monitoring to make the developer, operator and SRE experience faster and more productive. If you are already using GKE, explore the GKE Dashboard (right now, with no setup required!) for a consolidated view of metrics, logs, alerts, events, and SLOs.
Visit Google Cloud for gaming to learn more about building great player experiences and empowering your developers with better infrastructure and data insights.