Building multiplayer Google Doodle games with GKE, Open Match and Agones
Jacob Howcroft
Google Doodles, Senior Software Engineer
Mark Mandel
Google Cloud, Developer Advocate
Google makes its own games for the Search homepage using Kubernetes, managed game services and open source software.
Introduction
As anyone who visits the Google Search homepage may realise, there is often a surprising, creative and delightful change in our logo that sits above the Search bar. They’re called “Google Doodles.”
In recent years, we’ve seen them not only be interactive, but also multiplayer! Letting anyone who may be browsing the Google homepage play a game with one or more other people also browsing the same page. But the question begets itself -- what technologies are they using behind the scenes to power these games that are serving this global audience of Google users? And are those technologies available to the reader to build similar experiences?
We’re joined by Jacob Howcroft, who has been a software engineer on the Google Doodle team for the past seven years, and has been leading efforts into the journey the Google Doodle team has been going on into multiplayer Google Doodle experiences.
What is a Google Doodle?
Jacob: A Google “Doodle” is a spontaneous change to the Google logo that celebrates diverse holidays, people, places, and things that have impacted and shaped local and global culture. The original Doodles were static illustrations, but now we also create games and other interactive experiences.
What is a multiplayer Google Doodle?
Jacob: It’s an interactive experience that connects players together, so that they can play with their friends or other random users online. Anyone who visits google.com in a browser or mobile device sees the Google logo transformed with a “play” button, and clicking it takes them to a multiplayer game.
What multiplayer Google Doodles did you recently launch? How did they go?
Jacob: Our first multiplayer Doodle was The Great Ghoul Duel, for Halloween 2018. Since it was one of our most popular Doodles ever, we launched The Great Ghoul Duel 2 on Halloween in 2022. We also launched Celebrating Pétanque, a multiplayer game celebrating the French sport, in July 2022.
Millions of people played the Petanque and Halloween Doodles, and we also saw that a lot of them played multiple rounds, especially for Halloween. "The Great Ghoul Duel" has a fan community on Twitter and Discord that has remained active since 2018, and of course they were really happy to see the sequel.
What are the challenges of launching a multiplayer game for the Google Homepage?
Jacob: The biggest challenge is getting ready for scale. Every day, there are billions of searches on Google, and we’re making a game available to those users, so we have to handle a huge number of players right at launch. Additionally, these games launch all over the world in a matter of hours, so we have to scale up really fast. Since the Doodle is a surprise, we don’t have an opportunity to test in a beta release. This means we need to perform very rigorous QA, load testing, and internal playtesting.
How many players do you get on a multiplayer Google Doodle?
Jacob: When a Doodle is live on our homepage, our users play many millions of times. After the Doodle’s run on the homepage, we regularly see thousands of games played each day on our google.com/doodles archive.
How does a browser based game differ from other types of multiplayer games?
Jacob: The main difference for browser-based games, written in JavaScript or TypeScript, is that they use the WebSocket protocol, whereas mobile, console, or PC games would typically use UDP. The challenge this creates is the requirement for each GameServer to have a certificate for HTTPS / Web Socket Secure (WSS). Since the game runs on an https page, this means that all the WebSocket connections need to be secure as well. Doing this also requires DNS addresses for the game server, as opposed to connecting directly to IP and port, as they would for a more traditional UDP multiplayer game connection.
Can you tell us how you built these multiplayer Doodle games?
We used a bunch of Google Cloud products and took advantage of freely available game infrastructure and distributed systems open source projects. For the game itself, we use many TypeScript tools and libraries that we’ve created over the years for Doodles, as well as some open source engines like PIXI.
What Google Cloud Products did you use?
Jacob: Google Kubernetes Engine (GKE) runs most of everything.
We utilise Ingress for External HTTP(S) Load Balancing to handle global load balancing for the matchmaker and for the game server clusters.
Redis Memorystore is used to configure routing for game servers, and to store matchmaking tickets for our matchmaker.
Google Managed Certificates allows for the automatic creation and rotation of SSL certificates for the matchmakers and game servers. This was really helpful, since it turned certificate handling into zero work.
Cloud Logging, Cloud Monitoring and Logs-based Metrics allowed us to have a set of dashboards to monitor production, looking at metrics like game server status, number of matchmaking tickets, etc.
Cloud Run and Cloud Endpoints power the multicluster game server selection system, which chooses appropriate game server instances for our players from a set of GKE clusters and marks those game servers as “Allocated” - basically that they have players on them, and therefore shouldn’t be interrupted. We can go deeper into this later, but the multicluster allocator system helped a lot for scaling.
Why did you choose to use open source projects for the backend?
Jacob: We are mostly a frontend team - we build content for the web, and the frontend servers. When it comes to deeper backend tech like Google Cloud and Kubernetes, it isn’t our expertise. But the open source systems available bridge this skills gap, and let us work outside our comfort zone to create these more complex experiences.
Which open source projects did you take advantage of?
Jacob: Our first multiplayer Doodle back in 2018 took advantage of Open Match to matchmake our players together to play a game, and it worked fantastically, so we have continued to use that in our newer multiplayer game stack.
Agones is installed on our GKE clusters to host and scale our game servers and make sure they run uninterrupted while players are playing the game. Agones provides us with the GameServer and Fleet custom resource definitions we can use to declare the types and number of game server processes we are running in our GKE clusters. It also gives us the ability that we just talked about to select and allocate game servers, and mark them ready for player connection. We run multiple GKE and Agones clusters around the world to ensure that we have game servers near all our potential players (which is all of the users of Google.com!).
Also, not only were these terrific solutions, but since Google Cloud founded both Open Match and Agones and continues to maintain these projects with help from their respective communities, we knew we were in good hands and had people we could ask for help (and we did!).
We also use Traefik proxy to coordinate the websocket connections for matchmade players to specific Agones GameServer Pods hosted on our GKE clusters, so that players can play their games together.
For development and operational tools, we use Terraform for provisioning our infrastructure, and Helm and Helmfile for coordinating deployments of the open source tooling and our custom components into all of our GKE clusters.
How much custom code did you end up writing?
Jacob: Not as much as you would think!
Open Match requires you to write several custom components for match functions and integration with external systems like Agones, but thanks to the interoperability of gRPC, we could write all of them in Node.js even though Open Match is written in Go. We wrote our websocket based game servers in Node.js as well. This had the advantage of allowing us to share code between the client and server and speed up our development time.
We also wrote a custom Kubernetes controller to configure Traefik Routes. Traefik can be set up to detect configurations from Redis by writing to certain key formats. Our controller watched for Agones GameServer events, and wrote the name, pod IP, and port to Memorystore when one occurred. Traefik dynamically updates its routing so that we can access the GameServer using wss://<project subdomain>.cloud.doodles.goog/<game server name>
.
Can you step us through your architecture for matchmaking and assigning a game server to a group of players?
Jacob: First, players connect to a global Multi Cluster Ingress HTTP(S) load balancer, which routes players to their closest match maker frontend hosted in globally distributed GKE clusters. We usually have a handful of these clusters, with one or two per continent.
Through the matchmaker frontend, a Ticket is created in Open Match, which is matched with other players.
When the Agones Fleet starts new GameServers, our custom Agones-Traefik controller writes their Pod IPs to a Memorystore Redis instance. Traefik Proxy reads from this instance to set up routes in the cluster as we mentioned before - this way, players will connect to GameServer “foo” by url, i.e. “wss://cluster.game.cloud.doodles.goog/foo”.
The Open Match Director component, which we wrote in NodeJS, receives a Match and uses the Cloud Run-powered Agones multicluster Allocation service to allocate a GameServer. Each Matchmaker cluster allocates from one or more GameServer clusters.
When doing our load tests we noticed that the optimal GameServer Fleet
(a grouping of GameServers
) size is mostly limited by the Kubernetes control plane. So the allocation rate, amount of game server activity, plus your control plane size determines the size you can handle. Our game sessions are usually pretty short, and all that turnover creates a lot of load on the Kubernetes API.
We originally had an instance of our Open Match matchmaker and Agones GameServer Fleet in each cluster. However, we found that there was an optimal size for our GameServer clusters, around 6,000 Agones GameServers, so a one-to-many architecture for Matchmaker and GameServer clusters allowed us to scale higher, since more clusters equaled more overall throughput.
Once a GameServer allocation succeeds, the matchmaker replies to the players with the url of the GameServer. Players connect to it through a separate Ingress, and Traefik Proxy routes the request to the appropriate GameServer.
At the end of the game session, we don’t shutdown the game server, instead, we reset the game server and move it back to the “Ready” state through the Agones SDK, so it can be allocated for a new game. This saves us some load on the Kubernetes API and the Traefik proxy, by reducing the GameServer turnover.
We usually advocate for a single instance of Open Match to support a global player base - why have an instance in each region?
Jacob: While we know that Open Match can handle a global player base, we used multiple Open Match instances in different regions, so that we could rely on the Load Balancing to route traffic to the nearest Matchmaker. Since we had such a massive pool of players, we didn’t need to worry about sharding our player base - there were plenty for each Matchmaker.
Since we didn’t have the option of beta testing before launch, having more instances of Open Match around the world reduced launch day risks and allowed us to horizontally scale Open Match instances and the Agones clusters behind them if we had to.
Can you go into more detail on the websocket based communications? That sounds quite tricky.
Jacob: WebSocket games require Web Socket Secure (WSS), which means that every GameServer needs to be addressed by URL and have an SSL certificate. Managing this for the dynamic Fleets of thousands of GameServers required a novel solution.
For the Halloween 2018 Doodle, we had statically generated thousands of DNS entries routing to GameServers by subdomain. However, this approach stopped us from taking advantage of all the dynamic, scaleable features of Agones and GKE.
So for our new setup, we found Traefik Proxy. We chose this one because routes can be configured dynamically with various providers, one of them being Redis. So, we wrote a Kubernetes controller that watches GameServers, and writes the configurations to Redis. We were concerned about how a proxy would affect players’ ping, but with some load testing, we found that it didn’t have a noticeable impact. This setup allowed us to dynamically scale the Fleet and have the new GameServers available for connection as soon as they could spin up without having to manipulate DNS entries.
We also wrote a WebSocket library, using Google Closure WebSockets on the client and ws on the servers. For each game, we define an API using protocol buffers - a client-to-server message and a server-to-client message. Usually, this message contains a oneof field for various sub message types for different actions within the game. The library uses TypeScript generics to make sure that the game code sends and receives messages using this API.
Ensuring a game is fun is one the most important aspects of any game - how did you test for that?
Jacob: We test for fun in a few ways. The first one is a weekly team playtest; the whole Doodle team gathers together to play the latest demos and share feedback and ideas. The second way is internal user tests, or “Cafe tests”. We find some random Googlers, or set up a stand in a micro-kitchen, and get their feedback in real time. Googlers are usually pretty honest, so we get some good results this way. Finally, we will set up playtests with external users from a variety of backgrounds and tech familiarities, which provides very valuable and diverse perspectives that inform the final product.
Given you are on the front page of Google, how do you load test for that kind of traffic?
Jacob: The first thing, from an early stage, is to make a headless game client. This means that the game can be played by a script, without interacting with any UI. Then, we wrote a Node-based load testing client that simulated a user.
Using Kubernetes Jobs, we would deploy thousands of these fake users to GKE, and point them at a cluster. We start with a small number of users, and if the cluster handles them properly, then we step it up, and repeat this process until something breaks down.
Then, we would debug and fix the issue, and start over. Using this process, we were able to find the optimal Fleet size for our Agones clusters, which was about 6,000 GameServers. Note that this number depends on your Kubernetes provider, the size of your matches, the duration of your games, and other factors, so it isn’t a one-size-fits-all.
How did you estimate the amount of traffic to load test for?
Jacob: We looked at metrics from our past games, and used that to estimate the potential traffic for this one. However, it can vary a ton depending on a lot of different factors, so we have to aim as high as we can. Even still, we ended up deploying additional clusters during launch because the peak traffic was so high.
When you were ready to deploy to production, how did you do this?
Once we had determined the approximate scale via load testing and analysing past Doodle performance, we would define the entire Fleet
in Terraform and Helmfile, and deploy it. The Doodle begins rolling out based on local time around the world, so we plan for traffic to increase in large steps over about 24 hours.
Did you autoscale any of your resources, or just keep a static infrastructure?
Jacob: We enabled node autoscaling on our GKE node pools, except for some cases where we wanted to avoid scale-down, to make operational maintenance easier. Since we had long-running WebSocket connections, we had to reduce the amount of pods being moved around as nodes turned over. We generally didn’t use autoscaling for our Kubernetes Deployments, because we knew we would be receiving a lot of traffic in a short time; we preferred to pre-scale to high levels and then manually adjust throughout the short launch window. We also didn’t use Agones Fleet autoscaling, as we ran at 100% capacity for the entire game launch, however, we took advantage of being able to easily resize the Agones Fleets by running: kubectl scale fleet halloween --replicas=N
, and since we use node autoscaling, the cluster would automatically resize to fit the Fleet.
How many GKE clusters did you have, and how did you manage them?
Jacob: For the most recent game (Halloween 2022), we had 30 clusters. On average, we had 3-5 GameServer clusters per matchmaker cluster. Using Terraform and Helm and taking advantage of their loop constructs was crucial to manage all these clusters. It made configs that could have been hundreds of lines into a few dozen.
You mentioned using Cloud Monitoring, what did you actively monitor?
Jacob: We monitored a variety of metrics across the system, but to highlight some of the most important ones:
We had dashboards for Agones Fleet status showing how many GameServers were in each state, especially Ready vs Allocated. This let us know when a Fleet was reaching its capacity. Since the launch period is only about 48 hours, we didn’t set up alerts and just stayed on call throughout. However, we set up some alerts based on error rates for when the game runs on the google.com/doodles archive.
We also monitored game server and matchmaker connections. Especially for the matchmaker, we monitored the rate of successful matches made. This was important as it was our number one metric of the user experience - failing to find a match was the least fun thing that could happen for the user. Another important metric is how long users spend waiting for enough players to join a match. Luckily we didn’t have to worry about this one, because we had so much traffic.
We tracked our own custom “Game Over” events. Looking at the running total of games completed was really exciting! We did this by creating a chart summing our custom log-based metrics for things such as matchmaker connections, successes and failures in the open match components we wrote, game server connections, etc.
And finally, we utilised all the built-in metrics from Agones and Open Match to give us an overview of the entire platform health.
What would you have done differently if you had to do it again?
Jacob: One lesson we learned was that we should have obtained more than double the amount of quota to handle our expected traffic. The system was designed so that it was really easy to bring new clusters online during launch, which we made use of and was really helpful in scaling up at the last minute. However, we didn’t have enough quota in some regions to do a full blue-green deployment (running an entire deployment of a new version before switching over from an older version). That meant we had to incur some downtime when making infrastructure changes during launch. This can be avoided by having plenty of headroom in your quotas. So our recommendation is to work with your Google Cloud customer account representatives to ensure that you have a full launch review with quota requirements well ahead of your launch date.
Another thing that was difficult was getting the entire matchmaking and game server orchestration systems to spillover gracefully to another region if they had no more Agones GameServers available in their region. This meant that during launch, we had to make sure each Open Match instance was backed by enough GameServers that we wouldn’t run out, given the peak load we expected and encountered. We avoided the cascading failures during launch, but the tradeoff was having to manually scale up the busiest clusters.
To explain, users connect to the matchmaking frontend and request to join a match. They wait until there’s enough players, and then an allocation request is sent to Agones. But if all Agones clusters are full, it will reply with an error. At this point, there might be thousands of players attempting to join a match already, and then all of those matchmaking requests will fail, since there is nowhere for them to go (remember, we run at 100% capacity). Additionally, new users are still connecting to the matchmaking system, and all those requests will fail as well, leading to cascading failure for the matchmaking cluster. Ideally, the matchmaking system would stop accepting new match making requests early if it knows that its Fleet is above a certain Allocated threshold.
To better handle this in the future, we need the frontend to have up-to-date data on the Fleets behind each Open Match instance, so it can start telling clients to use a different instance before it becomes overloaded. If you wait until the Allocator Service reports that the Fleet is exhausted, you may already have a problem.
We are also considering switching to Multi Cluster Gateways, which would allow us to configure rate limits. Then we could just compute our max QPS for a cluster based on its maximum capacity, set the rate limit accordingly, and let GCP handle the spillover. It would be a less dynamic solution, but it would be very simple to implement.
Conclusion
Thank you to Jacob for such a detailed description of how the Google Doodle team develops, hosts, scales and orchestrates their multiplayer Google Doodle games.
The combination of Google Cloud - particularly Google Kubernetes Engine - and the open source Google Cloud Gaming solutions allowed the Google Doodle team to successfully run successive multiplayer Doodles at Google Search scale - a truly impressive feat!
It’s also pretty amazing to see that such a comprehensive multiplayer game platform could be built with relatively minor custom code, when utilising Open Source solutions for multiplayer games from Google Cloud.
If any of this sounds interesting and you want to learn more about any of the solutions listed above, you can go to:
And finally, if you want to play/see any of the over 5000 Google Doodles we’ve launched around the globe throughout the decades, you can do so at google.com/doodles!