Overview of cloud game infrastructure

Last reviewed 2022-01-28 UTC

This solution provides an overview of common components and design patterns used to host game infrastructure on cloud platforms.

Video games have evolved over the last several decades into a thriving entertainment business. With the broadband internet becoming widespread, one of the key factors in the growth of games has been online play.

Online play comes in several forms, such as session-based multiplayer matches, massively multiplayer virtual worlds, and intertwined single-player experiences.

In the past, games using a client-server model required the purchase and maintenance of dedicated on-premises or co-located servers to run the online infrastructure, something only large studios and publishers could afford. In addition, extensive projections and capacity planning were required to meet customer demand without overspending on fixed hardware. With today's cloud-based compute resources, game developers and publishers of any size can request and receive any resources on demand, avoiding costly up-front monetary outlays and the dangers of over or under provisioning hardware.

High-level components

The following diagram illustrates the online portion of a gaming architecture.

Diagram of game infrastructure on Google Cloud with frontend and backend components.

The frontend components of the gaming architecture include:

Game platform services that provide extra-game functionality.
Dedicated game servers that host the game.

The backend components of the gaming architecture include:

Game state, persisted in the system of record and typically stored in the game database.
An analytics stack that stores and queries analytics and gameplay events.

These components can be hosted on a variety of environments: on-premises, private or public cloud, or even a fully managed solution. As long as the system meets your latency requirements for communication between the components and end users, any of these can work.

Frontend

The frontend provides interfaces that clients can interact with, either directly or through a load-balancing layer.

For example, in a session-based first person shooter, the frontend typically includes a matchmaking service like Open Match. This service distributes connection information for dedicated game server instances to clients:

A client sends a request to the matchmaking service.
The matchmaking service assigns the player to a dedicated game server instance based on matching criteria.
The matchmaking service sends connection information to client.
The client can then connect directly to the dedicated game server instance using User Datagram Protocol (UDP).

Frontend services don't have to be used exclusively by external clients. It is common for frontend services to communicate with each other and with the backend.

Since frontend services are available over the internet, however, they can have additional exposure to attacks. You should consider hardening your frontend service against denial-of-service attacks and malformed packets to help address these security and reliability concerns. In comparison, backend services are generally only accessible to trusted code, and might therefore be harder to attack.

Game platform services

Common names for this component are platform services or online services. Platform services provide interfaces for essential meta-game functions, such as allowing players to join the same dedicated game server instance, or holding the "friend list" social graph for your game. It's common for the platform your game is running on, such as Steam, Xbox Live, or Google Play Games, to provide these services:

Leaderboard and match history
Matchmaking
Online lobby
Chat
Inventory management
Authorization
Party/group
Profile
Guild
Cross-platform unlock
Feeds
Social identity mapping
Analytics
Presence

Game platform services evolved in a similar way compared to web services:

In the early 2000s, a typical suite of platform services would run as a monolithic service, frequently implemented as a singleton. Even when federated, this pattern is not recommended for cloud deployments.
The now-familiar Service oriented architecture pattern (SOA) became popular in the mid-2000s, as the industry changed the various services to be independently scalable. In addition, the services could now be accessed not only by game clients and servers, but also by web services and, eventually, smartphone apps.
The last half-decade has seen many developers adopting the microservices approach championed by fast-scaling web companies. Many of the fundamental challenges of platform services and web applications are the same, such as enabling fast development cycles and running highly distributed services all over the world. Microservices can help address these problems and are an excellent choice when designing applications to run on cloud platforms.
In addition, there are now many hosted or managed services that provide a way to build either platform services or fully managed platform services.

Backend platform services

Although most platform services are accessed by external clients, sometimes it makes sense for a platform service to be accessed only by other portions of your online infrastructure, such as an internal competitive player ranking service that isn't exposed to the public internet. Although such backend platform services typically lack an external network route and IP address, they follow the same design practices as frontend platform services.

Google Cloud game platform service resources

The following resources provide more information about how to build platform services on Google Cloud.

Dedicated game server

Dedicated game servers provide the game logic. To minimize latency perceived by the user, client game apps typically communicate directly with the dedicated game servers. This makes them part of the frontend service architecture.

The industry doesn't have standard terminology, so for the purposes of this article:

machine refers to the physical or virtual machine the game-server processes run on.
game server refers to the game-server process. Multiple game-server processes may run simultaneously on one machine.
instance refers to a single game-server process.

Types of dedicated game servers

The term dedicated can be misleading for today's backend game servers. In its original context, dedicated referred to game servers that ran on dedicated hardware in a 1:1 ratio. Today, most game publishers manage multiple game-server processes running concurrently on a machine. Despite the fact that these processes now rarely have entire machines dedicated to them, the term dedicated game server is still in frequent use in the gaming industry.

Dedicated game servers are as varied as types of games they run. A few of high-level game server categories are discussed in the following section.

Real-time simulations

Until recently, almost every dedicated game server for a commercially shipped product was part of the frontend for a real-time simulation game. Real-time simulation game servers have historically pushed the limits of vertical scaling. The most demanding games have moved to manual-horizontal scaling tactics such as running multiple server processes per machine, or geographically sharding the world. UDP communication with custom flow control, reliability, and congestion avoidance is the dominant networking paradigm.

Most real-time simulation game servers are implemented as an endless loop, where the goal is to finish the loop within a given time budget. Typical time budgets are 16 or 33 milliseconds which yields a 60 or 30 times-per-second state update rate, respectively. Update rate is also referred to as frame rate or tick rate. Although the server is updating its simulation at a high frequency, it is not uncommon for the server to communicate state updates to clients only after multiple updates have passed. This keeps the network bandwidth requirements reasonable. The effects of updating less frequently can be mitigated using strategies such as lag compensation, interpolation, and extrapolation.

All of this means that real-time simulation game servers run latency-sensitive, compute- and bandwidth-intensive workloads requiring careful consideration of the game server design and the compute platforms it runs on. Google Cloud founded the Agones open source project to help simplify running dedicated game servers on Kubernetes clusters such as Google Kubernetes Engine (GKE).

Session- or match-based games

Games where the servers are designed to run discrete sessions are very common today. Typical examples are the multiplayer sessions of first-person shooter (FPS) games such as Call of Duty™, Fortnite™, or Titanfall™ or multiplayer online battle arena (MOBA) games such as Dota 2™ or League of Legends™. These games have servers that require twitch-fast gameplay and detailed game-state calculations, frequently with threads devoted to AI or physics simulation.

Massively multiplayer persistent worlds

Almost two decades ago, Ultima Online™ paved the way for a huge explosion of massively multiplayer online (MMO) games. Today's most popular MMOs, such as World of Warcraft™ and Final Fantaxy XIV™, are characterized by complicated server designs with an ever-evolving set of features.

Complex issues are common in MMO game servers, such as passing game entities between server instances, sharding or phasing the game world, and physically co-locating the instances simulating adjacent game world areas. The compute and memory requirements to calculate state updates for a persistent world containing hundreds or thousands of players can lead to solutions such as the time dilation of Eve Online™.

Request/response based servers

Technically, all dedicated game servers are based on a series of requests and responses. In particular, however, mobile game servers, without a critical demand for real-time communication, have adopted HTTP request and response semantics like those used in web hosting.

The challenges for request/response game servers are the same as those for any web service, including:

Keeping the response time of the game server as fast as possible.
Distributing the game servers globally to reduce latency and add redundancy.
Validating the game client's actions on the server to protect against exploits or cheating.
Hardening the game servers against denial of service and other attacks.
Implementing exponential delay for communication retries on the client side.
Creating “sticky” sessions or externalizing process state.

The strengths of request/response game servers, such as compact communication semantics and ease of retries after an application or network failure, work well for turn-based and mobile games. We recommend that servers of this type use a serverless API on a platform such as App Engine or Cloud Run.

Externalizing the game world state

Increasingly, players expect zero game downtime. This means you need to protect their experience from issues affecting individual server instances. To help do so, a game should persist the player state outside of a single game-server process. The advantages are many, such as resilience against crashed server processes and the ability to effectively load-balance.

Unfortunately, simply using externalized state patterns popular in web services can be problematic for a number of reasons, including:

The speed at which updates can be written to external state can be a challenge when you have many unique entities updating dozens of times per second. This is true even if you use memory-cached key-value stores such as Memcached or Redis.
The tail-end latency of queries against external-state caches represents a large problem. It's difficult to meet state-update deadlines if 1%, or even 0.1%, of your queries have latency an order of magnitude larger than the update deadline.
Determining which processes have read-only versus read-write authority over objects in your external state cache introduces complexity to the server model.

However, solving these problems has several beneficial side-effects. Successfully externalized state available to many processes with proper access management in place can greatly simplify the ability to calculate portions of the game state update in parallel. It is similarly advantageous for migrating entities between instances.

Google Cloud dedicated game server resources

The following articles describe how to run dedicated game servers on Google Cloud.

Backend

Backend services present interfaces only to other frontend and backend services. External clients can't directly communicate with a backend service. Backend services typically provide a way for you to store and access data, such as game state data in a database, or logging and analytics events in a data warehouse.

Game database

Among the scenarios that can cause players to quit playing your game and never return are non-working servers and the loss of player progress. Unfortunately, both are possible if you have a poorly designed database layer.

The database that holds the game-world state and player progression data could be considered the most critical piece of your game’s infrastructure.

You should evaluate the ability of the database to handle not only your expected workload, but also the workload required if your game becomes a massive success. A backend designed and tested for an estimated player base, but which suddenly receives an order of magnitude more load is unlikely to be able to serve anyone reliably. Failure to plan for unexpected success can cause your game to fail, as players may abandon your game when it becomes unplayable due to database issues.

Games are particularly vulnerable to this issue. Most businesses with a successful product can expect gradual, organic growth. But a typical game will see a large spike of initial interest followed by quick fall-off to a much lower amount of usage. If your game is a hit, an overtaxed database may have massive delays before saving user progress, or even fail to save the progress altogether. Being in a situation where you're forced to decide which features of your game are no longer going to support real-time updates is not a situation any game developer wants to be in, so plan your database resources carefully.

When designing a game database:

Make an informed decision. Do not use a database during development because it's easy to test against and then allow it to become your production database without evaluating all the options. It is important to understand the type and frequency of the database access from your game at your expected player base, and at 10x those estimates. Then you can make an informed choice as to what backend can best handle those scenarios. Don't put yourself in a situation of trying to learn how to deal with a database crisis when the crisis hits.
Don't assume one solution is the right solution. There is no rule that you can only run one type of database. Many successful games store account information and process in-game purchases using a relational database while keeping game state information in a separate NoSQL database. The NoSQL database is better at handling high-volume, low-latency workloads, while the relational database can provide guaranteed transactions.
Back up your data. Regular and geographically distributed backups are important to recovering from database failure.

Relational databases

Many game development teams begin with a single relational database. When the data and traffic grows to the point where the database performance is no longer acceptable, a common first approach is to scale the database. Once scaling is no longer feasible, many developers implement a custom database service layer. In this layer, you can prioritize queries and cache results, both of which limit database access. By adding scaling and a database service layer you can produce a game backend that can handle huge numbers of players, but these methods can have some common issues:

Scaling—Traditional relational databases focus on a scale-up (vertical scaling) approach. When planning a cloud-native game backend, however, it is strongly recommended to use a scale-out (horizontal scaling) approach instead, as the number of cores that can be present in a single VM will always be limited, while adding additional VMs to your cloud project is quite simple. Relational databases have patterns for horizontal scaling such as sharding, clustering, and tiered replicas, but they can be difficult to add to a running database without downtime. If there is any chance that your traffic or data will outgrow a single database, start with a small cluster. You don't want to have to learn how to scale your database when the crisis hits. Adding nodes to a cluster while it’s running isn't without challenges, but it is possible.
Schema changes—Very few successful games launch with a database schema that lasts throughout the lifetime of the game. Players demand new features and content, and these additions require saving new types of data to the database. Early in your development process, you should determine how you’ll update your schema. Trying to update your schema after launching your game without an established process could result in unplanned downtime or even loss of player data.
Administration—Scaling a running relational database and updating its schema are both complex operations. Automatically managed relational databases are common services of cloud platforms, but the adoption rate of automatically managed databases for game backends is currently low. This is because of the write-heavy workloads of game backends.

Google offers Spanner, which is a managed relational database that can help you to mitigate these issues. Spanner is a scalable, enterprise-grade, globally distributed, and strongly consistent database service that is built for the cloud. It combines the benefits of relational database structure with non-relational horizontal scale. Many game companies find Spanner to be well-suited to replace both game state and authentication databases in production-scale systems. You can scale for additional performance or storage by using the Google Cloud console to add nodes. Spanner can transparently handle global replication with strong consistency, so that you don't have to manage regional replicas. For more information, see Best practices for using Spanner as a gaming database.

NoSQL databases

Non relational databases can provide the solution to operating at scale, especially with write-heavy workloads. However they require that you understand NoSQL data models, access patterns, and transactional guarantees.

There are many types of NoSQL databases, and those well-suited for storing game world state have the following features:

Scaling—They are designed with horizontal scaling in mind, and often use it by default. Resizing clusters is typically an operation that can be done without downtime, though sometimes there is some performance loss until the additional nodes are fully integrated.
Schema Changes—The schema is dynamic and enforced by the application layer. This is a huge advantage and means adding a new field for a new game feature can be trivial.
Administration—Most cloud providers offer at least one hosted or managed NoSQL data storage engine, and Google Cloud offers several options.

Google Cloud game database resources

Analytics

Analytics has grown into an important component of modern games. Both online services and game clients can send analytics and telemetry events to a common collection point, where the events are stored in a database. They can then be queried by everyone from gameplay programmers and designers to business intelligence analysts and customer service representatives. As the complexity of the analytics that are being collected grows, so does the need to keep these events in a format that can be easily and quickly queried.

The last decade has seen a massive rise in the popularity of Apache™ Hadoop®, the open-source framework based on published work from Google. The expansion of the Hadoop ecosystem has increased the use of complex batch extract, transform, and load (ETL) operations to format and insert analytics events into a data warehouse. Use of MapReduce sped up the rate at which actionable results were delivered, and this speed in turn helped enable new, more compute-intensive analytics.

Meanwhile, the technologies available in the cloud have continued to evolve. Many of them are available as managed services that are quick to learn and require no dedicated operations staff. Google's latest streaming ETL paradigm provides a unified approach to both batch and stream processing, and is available both as a managed cloud service and as the open source project Apache Beam. Continued improvements in cloud data storage prices now make it possible to keep huge amounts of logs and analytics events in massive, managed, cloud databases that optimize the way that data is written and read. The latest query engines for these databases are capable of aggregating TB of data in seconds. For an example of this, see analyzing 50 billion Wikipedia pageviews in 5 seconds.

We recommend that you format your analytics for the future. When you decide which events and metrics your game writes to your analytics backend, consider what format is easiest for you to data mine for insights. Although you can use ETL to copy the data your app writes into a format that works well for analytics queries, it can take time and money to do so. Investing in the design of your analytics output format can lead to significant cost savings and the possibility of real-time analytics insights.

Use batch processing for existing formats

If you want to analyze metrics data that's in an output format that you don't control (for example, data from a third-party integration or service), we recommend that you start by saving the metrics data to Cloud Storage. If the data format is supported, you can query it directly from the BigQuery interface using BigQuery federated queries. If the data format isn't supported, you can use ETL to copy the data from Cloud Storage using Dataflow or other tools, and then store the resulting formatted data in BigQuery alongside data from other sources. We recommend that you set up a regular batch job to save costs instead of streaming, unless you have an urgent need for the data in real time. For more information about this approach, see Optimizing large-scale ingestion of analytics events and logs.

Predict churn and spend with proven models

You might already be using Firebase for your mobile game for one of its many other features like remote config, in-app messaging, or Firestore client libraries. Firebase also offers built-in churn and spend prediction machine learning (ML) models. You can integrate Remote Config personalization to apply ML to your analytics data, which can create dynamic user segments based on your users' predicted behavior. This data can be used to trigger other Firebase features, or exported to BigQuery for more flexibility. Firebase also offers Unity and C++ clients, and its use isn't limited to mobile games.

Normalize data for AutoML Tables custom-model training

Generating an effective ML model typically requires extensive ML expertise to select relevant features and tune hyperparameters. However, following data preparation guidelines improves the ability of the latest automated tools to perform these tasks for you and generate a useful model on your behalf. After a model is generated, it can be hosted on Google Cloud to do online or batch predictions—for example, predicting if a player will make a purchase in the game, or if they will quit playing.

Although analytics events and player data are useful for traditional analytics queries and business intelligence metrics, you need a different format to train an ML model. A common use case for ML in games is to make a custom model to predict when players will first spend money in the game. AutoML Tables can help to simplify the training process. For more information about AutoML Tables, see Preparing your training data and Best practices for creating training data.

Multiple game studios and publishers have achieved results by using a daily-rollup format as the basis for training. A daily rollup is a normalized row format which has one field for each significant analytics event. The row format contains a cumulative count of the number of times that the player triggered the event to date. This row provides a daily snapshot of all the potentially important events that a player triggered to date, along with a true or false has made a purchase flag.

The process described in the AutoML Tables quickstart documentation can result in high-quality models when training with data formatted in this way. You can then give the model a daily-rollup row and provide predictions of how likely it is that the player will make a purchase. You can also use similar approaches to formatting data alongside different flags to train models to make different predictions, including churn or other player behaviors. Making an up-front investment in building normalized data formats can help you rapidly try out models to predict any player action that you want. This modeling can potentially help you monetize your game or prioritize features that result in desirable player outcomes.

Google Cloud game analytics solutions

What’s next

Solutions for online games follow a common pattern: clients talk to a frontend of services and game servers, which communicate to a backend of analytics and state storage. You can run each of these components on-premises, in the cloud, or some a mixture of the two. For more in-depth patterns, see gaming solutions.

Explore reference architectures, diagrams, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.