Real-time Gaming with Node.js + WebSocket on Google Cloud Platform

Building a real-time game that has the potential to receive global media attention and host a large number of simultaneous players can be a daunting challenge. Typically, it requires integrating various hardware, software, and platforms to satisfy common requirements of such a game, such as:

  • Scalability to handle a large number of simultaneous users
  • High availability to be tolerant of a datacenter-wide downtime
  • Low latency to process player actions within a few hundred milliseconds

The easiest way to overcome these challenges is to learn from an existing production service. In this paper, we walk through World Wide Maze, a game that has solved these challenges by utilizing the Google Cloud Platform and cutting-edge web technologies.

Chrome Experiments: World Wide Maze

Chrome Experiments is a website created by Google to showcase applications that demonstrate creative web applications built with the latest HTML5 technologies. One of those web applications is an interactive 3D maze game called World Wide Maze (WWM), developed by the Chrome team in cooperation with third-party partners.

In WWM, players specify a web site and a 3D maze is generated from the visual layout of the web page. The player uses an Android/iOS device as a game controller to roll a ball on the maze by tilting the device.

Figure 1: World Wide Maze: Real-time 3D maze game playable with Chrome and mobile devices

One of the unique features of WWM is that all real-time interactions are orchestrated by Google Cloud Platform; specifically, Google Compute Engine and Google App Engine. To create a playable game experience, low latency between controller input and game screen is a key requirement.

When it was released, WWM was featured in many major online media outlets and attracted a large increase in users. This resulted in a huge spike of traffic (as shown in Figure 2) with 6,600 simultaneous users across the world, 7,000 page views per minute, and 1,000 requests per second. As the system was designed to utilize the massive scalability of Google Cloud Platform, it handled the large traffic spike by smoothly scaling out the Compute Engine and App Engine instances.

Figure 2: Google Analytics Report for WWM Traffic Analysis

One of the biggest challenges when building a real-time game is being able to handle the persistent connections between the clients and servers. In WWM, the game controller and game screen real-time communication needs to be handled within 200 milliseconds. This real-time communication includes:

  • The request from the controller received by the server
  • The request routed to the screen from the server
  • The maze rendered on the screen based on the request

This paper focuses on how WWM was designed and developed to overcome these challenges by utilizing unique features of Google Cloud Platform.

Architecture Overview

Orchestrating Google Compute Engine Instances with Google App Engine

The architecture of the World Wide Maze utilizes many components of the Google Cloud Platform, as highlighted in Figure 3:

Figure 3: Client and server components of World Wide Maze

  1. Clients
    • Game Screen
    • Game Controller
  2. Servers
    • Web Frontend Server (App Engine)
    • WebSocket Server (Compute Engine)
      • Node.js + Socket.IO (handles WebSocket communication)
    • Stage Builder (Compute Engine)
      • PhantomJS (responsible for game stage rendering)
      • OpenCV (responsible for game stage image processing)
    • Database Server (Compute Engine)
      • MySQL (handles request queuing)

The client side has two components: the Game Screen (Chrome browser on PC/Mac/Linux that displays the game screen) and the Game Controller (Chrome on Android/iOS). Both client side components are HTML5 applications running on Chrome.

On the server side, WWM uses a combination of App Engine and Compute Engine. Compute Engine accepts WebSocket connections that are used to implement the bi-directional low latency communication between the controller and the screen. Compute Engine is also used to host the database servers and to build game stages. App Engine accepts HTTP requests, orchestrates Compute Engine instances, and connects the clients with available Compute Engine instances.

One of the interesting aspects of the architecture is how the strengths of both App Engine and Compute Engine are combined to create a scalable solution for real-time gaming. It is important to understand the various design considerations for each platform in order to "place the right person in the right job" when designing the architecture.

Using Google App Engine for Web Frontend

App Engine is used for the Web Frontends and connects Game Screens with available Compute Engine instances. App Engine takes HTTP requests, serves static files to the clients, and dispatches each game session to available WebSocket servers.

Figure 4: Using App Engine for Web Frontend

The biggest advantages of App Engine are its automatic scalability and availability. App Engine applications must follow the design guidelines related to each unique App Engine runtime environment and various App Engine service APIs. The runtime and APIs are carefully designed to abstract datacenters as one massive parallel computer that powers your application and isolates it from single machine failures. As a result of these powerful features, developers can access a platform that is highly tolerant to events such as huge traffic spikes, rapid change in the number of users and services, and datacenter-wide downtime.

Figure 5: Traffic peak on WWM's App Engine admin console

By using App Engine as a load distributer, WWM was able to handle peak traffic, which reached 1,000 requests per second as shown in Figure 5. This was made possible by using the auto-scaling features of App Engine to smoothly dispatch the real-time game sessions to the Compute Engine instances.

Using Compute Engine for WebSocket Server

The developers decided to combine App Engine with Compute Engine to implement the connection with WebSocket servers.

Figure 6: Using Compute Engine for WebSocket Server

The WebSocket server is implemented by Node.js and Socket.IO running on the Compute Engine instances. Node.js is a popular JavaScript runtime platform that features an event-driven, non-blocking I/O model for real-time applications. Socket.IO is a popular library for Node.js that provides a real-time transport between the web browser and the Node.js server. It supports various protocols and methods for real-time transport, including WebSocket, Flash socket, Comet, and polling. By default, Socket.IO tries to connect with WebSocket protocol for better performance. But if it detects that the protocol is not available on the particular network, it falls back to other protocols and tries to establish a real-time connection with potentially higher performance.

Using Compute Engine for Stage Builder

Another interesting feature of WWM is that any website can be turned into a playable stage by using an on-demand rendering system.

Figure 7: Using Compute Engine for Stage Builder

Stage generation is completed by the Stage Builder, which runs on Compute Engine. Figure 5 shows how the stage generation process works:

Figure 8: Stage generation process

The steps for stage generation are as follows:

  1. Game Screen sends the user specified URL of web site to WebSocket Server via WebSocket connection.
  2. WebSocket Server adds a rendering request message on the Database Server (MySQL) queue. The message is fetched by one of the Stage Builders.
  3. Stage Builder renders an image of the web site with PhantomJS. The positions of constructs such as div or img tags are generated as a JSON document.
  4. Stage Builder generates maze structure data by using OpenCV image processing and the construct position data.
  5. Stage Builder returns a JSON document, which includes the maze structure and images, to the Game Screen.

PhantomJS is a full stack web browser designed for servers without screens. The tool automates web page rendering for various purposes, including automated testing, screen capture, web page scraping and web site performance monitoring. Developers can write JavaScript or CoffeeScript to instruct PhantomJS to execute required processes, such as taking a screenshot.

The screen image rendered by PhantomJS is processed by OpenCV. The process includes removing unnecessary area, creating the islands, connecting them with bridges, and placing a goal, etc. The original web image is processed by OpenCV to generate the maze structure.

Challenges in Building a Cluster with Compute Engine

During the development, there are two major challenges when building a cluster of Compute Engine instances. The first challenge is to determine the scalability the Stage Builder. The second challenge is to design fail-over for the Compute Engine instance across different zones.

Load Testing of World Wide Maze

To determine the scalability of Stage Builder, the developers conducted an extensive load testing on the cluster of Stage Builder instances. Stage Builder handles the creation of playable levels by using PhantomJS and is very CPU-intensive. The upper rate limit of stages that can be created by a single instance is determined based on the load testing.

The following figure depicts the measurements resulting from load testing the Stage Builder, where 36 virtual clients continually sent rendering requests to a single instance. The average turnaround time for each stage building request was about 12 seconds.

Figure 9: Load Testing result of Stage Builder (CPU usage)

The limits of the WebSocket servers were determined by load testing. The tests confirmed that the servers can handle 360 WebSocket connections with an average latency of 175 milliseconds.

A sudden increase in traffic occurred after launch, but the administrators quickly provisioned Compute Engine instances of Stage Builder and WebSocket server smoothly. As a result, the traffic spike did not pose any serious problems.

Fail-over Between Compute Engine Instances in Different Zones

The second challenge was to handle both planned and unplanned downtime of Compute Engine instances, from a single instance all the way up to an availability zone (a group of instances in different geographical regions). In WWM, the developers followed the best practices, deploying instances across multiple zones with systems in place to distribute load and quickly fail-over between zones.

Two techniques are utilized by WWM to handle the planned and unplanned failures gracefully:

  • Persistent Disk for data backup
  • Switching external static IP addresses between Compute Engine instances

Persistent Disk for Data Backup

Persistent Disk is a highly scalable and available persistent storage device for Compute Engine. By utilizing Persistent Disk for backup and recovery, data is secure and survives any unplanned or planned downtime of Compute Engine. WWM's Database Server uses Persistent Disk for periodic backups of the MySQL database. In the case of Compute Engine's planned maintenance window, an operator of WWM invokes a restoration process before the downtime. The operator uses MySQL data that is backed-up on Persistent Disk, and restores it on Compute Engine instances which are reserved in another zone.

Switching External Static IP Addresses Between Compute Engine Instances

The second technique used for fail-over is static IP address switching. With Compute Engine, you can switch external static IP addresses of Compute Engine instances instantly by using simple commands. This feature enables smooth migration between the zones within a region without having a frontend such as reverse proxy or load balancer.

WWM uses the following procedure to move instances out of a zone with an upcoming maintenance window:

  1. On the Web Frontend (App Engine), switches from "production" version to "maintenance" version to show the maintenance screen.
  2. Backs-up the Database and put the backup file on Persistent Disk.
  3. Copies the Compute Engine instances to the destination zone. Switch the assignment of external IP addresses for Compute Engine instances from the original zone to the destination zone.
  4. Restores the Database Server records from the backup file on Persistent Disk.
  5. Checks that the Compute Engine instances in the destination zone works properly by running tests.
  6. Switches the Web Frontend;s version from "maintenance" to "production."

Conclusion

Google Cloud Platform provides compelling technology to solve the challenges of building a production quality real-time gaming with App Engine and Compute Engine. It is important for developers to understand the design constraints of these two technologies to fully leverage each technology to its full capabilities. When launching a production service, it is critical to conduct an extensive load testing on the system to do a feasibility study in terms of the latency, scalability and availability. By building on the Google Cloud Platform and thoroughly testing, WWM was able to launch worldwide and survive the massive user spike that resulted from all the media attention.

Send feedback about...