Building a real-time game that has the potential to receive global media attention and host a large number of simultaneous players can be a daunting challenge. Typically, it requires integrating various hardware, software, and platforms to satisfy common requirements of such a game, such as:
- Scalability to handle a large number of simultaneous users
- High availability to be tolerant of a datacenter-wide downtime
- Low latency to process player actions within a few hundred milliseconds
The easiest way to overcome these challenges is to learn from an existing production service. In this paper, we walk through World Wide Maze, a game that has solved these challenges by utilizing the Google Cloud Platform and cutting-edge web technologies.
Chrome Experiments: World Wide Maze
Chrome Experiments is a website created by Google to showcase applications that demonstrate creative web applications built with the latest HTML5 technologies. One of those web applications is an interactive 3D maze game called World Wide Maze (WWM), developed by the Chrome team in cooperation with third-party partners.
In WWM, players specify a web site and a 3D maze is generated from the visual layout of the web page. The player uses an Android/iOS device as a game controller to roll a ball on the maze by tilting the device.
One of the unique features of WWM is that all real-time interactions are orchestrated by Google Cloud Platform; specifically, Google Compute Engine and Google App Engine. To create a playable game experience, low latency between controller input and game screen is a key requirement.
When it was released, WWM was featured in many major online media outlets and attracted a large increase in users. This resulted in a huge spike of traffic (as shown in Figure 2) with 6,600 simultaneous users across the world, 7,000 page views per minutes, and 1,000 requests per second. As the system was designed to utilize the massive scalability of Google Cloud Platform, it handled the large traffic spike by smoothly scaling out the Compute Engine and App Engine instances.
One of the biggest challenges when building a real-time game is being able to handle the persistent connections between the clients and servers. In WWM, the game controller and game screen real-time communication needs to be handled within 200 milliseconds. This real-time communication includes:
- The request from the controller received by the server
- The request routed to the screen from the server
- The maze rendered on the screen based on the request
This paper focuses on how WWM was designed and developed to overcome these challenges by utilizing unique features of Google Cloud Platform.
Orchestrating Google Compute Engine Instances with Google App Engine
The architecture of the World Wide Maze utilizes many components of the Google Cloud Platform, as highlighted in Figure 3:
- Game Screen
- Game Controller
- Web Frontend Server (App Engine)
- WebSocket Server (Compute Engine)
- Node.js + Socket.IO (handles WebSocket communication)
- Stage Builder (Compute Engine)
- PhantomJS (responsible for game stage rendering)
- OpenCV (responsible for game stage image processing)
- Database Server (Compute Engine)
- MySQL (handles request queuing)
The client side has two components: the Game Screen (Chrome browser on PC/Mac/Linux that displays the game screen) and the Game Controller (Chrome on Android/iOS). Both client side components are HTML5 applications running on Chrome.
On the server side, WWM uses a combination of App Engine and Compute Engine. Compute Engine accepts WebSocket connections that are used to implement the bi-directional low latency communication between the controller and the screen. Compute Engine is also used to host the database servers and to build game stages. App Engine accepts HTTP requests, orchestrates Compute Engine instances, and connects the clients with available Compute Engine instances.
One of the interesting aspects of the architecture is how the strengths of both App Engine and Compute Engine are combined to create a scalable solution for real-time gaming. It is important to understand the various design considerations for each platform in order to "place the right person in the right job" when designing the architecture.
Using Google App Engine for Web Frontend
App Engine is used for the Web Frontends and connects Game Screens with available Compute Engine instances. App Engine takes HTTP requests, serves static files to the clients, and dispatches each game session to available WebSocket servers.
The biggest advantages of App Engine are its automatic scalability and availability. App Engine applications must follow the design guidelines related to each unique App Engine runtime environment and various App Engine service APIs. The runtime and APIs are carefully designed to abstract datacenters as one massive parallel computer that powers your application and isolates it from single machine failures. As a result of these powerful features, developers can access a platform that is highly tolerant to events such as huge traffic spikes, rapid change in the number of users and services, and datacenter-wide downtime.
By using App Engine as a load distributer, WWM was able to handle peak traffic, which reached 1,000 requests per second as shown in Figure 5. This was made possible by using the auto-scaling features of App Engine to smoothly dispatch the real-time game sessions to the Compute Engine instances.
Using Compute Engine for WebSocket Server
Initially, the developers were considering using only App Engine to implement the whole game service, and evaluating Channel API and Socket API of App Engine if they are capable of supporting the real-time communication requirements. But they found those APIs are not suited for implementing the bi-directional connection for WWM that can transmit messages in less than 200 milliseconds. So they decided to combine App Engine with Compute Engine to implement the connection with WebSocket servers.
Using Compute Engine for Stage Builder
Another interesting feature of WWM is that any website can be turned into a playable stage by using an on-demand rendering system.
Stage generation is completed by the Stage Builder, which runs on Compute Engine. Figure 5 shows how the stage generation process works:
The steps for stage generation are as follows:
- Game Screen sends the user specified URL of web site to WebSocket Server via WebSocket connection.
- WebSocket Server adds a rendering request message on the Database Server (MySQL) queue. The message is fetched by one of the Stage Builders.
- Stage Builder renders an image of the web site with PhantomJS. The positions
of constructs such as
imgtags are generated as a JSON document.
- Stage Builder generates maze structure data by using OpenCV image processing and the construct position data.
- Stage Builder returns a JSON document, which includes the maze structure and images, to the Game Screen.
The screen image rendered by PhantomJS is processed by OpenCV. The process includes removing unnecessary area, creating the islands, connecting them with bridges, and placing a goal, etc. The original web image is processed by OpenCV to generate the maze structure.
Challenges in Building a Cluster with Compute Engine
During the development, there are two major challenges when building a cluster of Compute Engine instances. The first challenge is to determine the scalability the Stage Builder. The second challenge is to design fail-over for the Compute Engine instance across different zones.
Load Testing of World Wide Maze
To determine the scalability of Stage Builder, the developers conducted an extensive load testing on the cluster of Stage Builder instances. Stage Builder handles the creation of playable levels by using PhantomJS and is very CPU-intensive. The upper rate limit of stages that can be created by a single instance is determined based on the load testing.
The following figure depicts the measurements resulting from load testing the Stage Builder, where 36 virtual clients continually sent rendering requests to a single instance. The average turnaround time for each stage building request was about 12 seconds.
The limits of the WebSocket servers were determined by load testing. The tests confirmed that the servers can handle 360 WebSocket connections with an average latency of 175 milliseconds.
A sudden increase in traffic occurred after launch, but the administrators quickly provisioned Compute Engine instances of Stage Builder and WebSocket server smoothly. As a result, the traffic spike did not pose any serious problems.
Fail-over Between Compute Engine Instances in Different Zones
The second challenge was to handle both planned and unplanned downtime of Compute Engine instances, from a single instance all the way up to an availability zone (a group of instances in different geographical regions). In WWM, the developers followed the best practices, deploying instances across multiple zones with systems in place to distribute load and quickly fail-over between zones.
Two techniques are utilized by WWM to handle the planned and unplanned failures gracefully:
- Persistent Disk for data backup
- Switching external static IP addresses between Compute Engine instances
Persistent Disk for Data Backup
Persistent Disk is a highly scalable and available persistent storage device for Compute Engine. By utilizing Persistent Disk for backup and recovery, data is secure and survives any unplanned or planned downtime of Compute Engine. WWM's Database Server uses Persistent Disk for periodic backups of the MySQL database. In the case of Compute Engine's planned maintenance window, an operator of WWM invokes a restoration process before the downtime. The operator uses MySQL data that is backed-up on Persistent Disk, and restores it on Compute Engine instances which are reserved in another zone.
Switching External Static IP Addresses Between Compute Engine Instances
The second technique used for fail-over is static IP address switching. With Compute Engine, you can switch external static IP addresses of Compute Engine instances instantly by using simple commands. This feature enables smooth migration between the zones within a region without having a frontend such as reverse proxy or load balancer.
WWM uses the following procedure to move instances out of a zone with an upcoming maintenance window:
- On the Web Frontend (App Engine), switches from "production" version to "maintenance" version to show the maintenance screen.
- Backs-up the Database and put the backup file on Persistent Disk.
- Copies the Compute Engine instances to the destination zone. Switch the assignment of external IP addresses for Compute Engine instances from the original zone to the destination zone.
- Restores the Database Server records from the backup file on Persistent Disk.
- Checks that the Compute Engine instances in the destination zone works properly by running tests.
- Switches the Web Frontend;s version from "maintenance" to "production."
Google Cloud Platform provides compelling technology to solve the challenges of building a production quality real-time gaming with App Engine and Compute Engine. It is important for developers to understand the design constraints of these two technologies to fully leverage each technology to its full capabilities. When launching a production service, it is critical to conduct an extensive load testing on the system to do a feasibility study in terms of the latency, scalability and availability. By building on the Google Cloud Platform and thoroughly testing, WWM was able to launch worldwide and survive the massive user spike that resulted from all the media attention.