Building a Mobile Gaming Analytics Platform - a Reference Architecture

Mobile game applications generate a large amount of player-telemetry and game-event data. This data has the potential to provide insights into player behaviour and their engagement with the game. The nature of mobile games — large number of client devices, irregular and slow Internet connections, battery and power management concerns — means that player telemetry and game event analytics face unique challenges.

This reference architecture provides a high level approach to collect, store, and analyze large amounts of player-telemetry data on Google Cloud.

More specifically, you will learn two main architecture patterns for analyzing mobile game events:

  • Real-time processing of individual events using a streaming processing pattern
  • Bulk processing of aggregated events using a batch processing pattern

Reference architecture for game telemetry

Figure 1: Game telemetry reference architecture

Figure 1 illustrates the processing pipelines for both patterns, as well as some optional components to further explore, visualize, and share the results. The Reference Architecture is highly available and allows you to scale as your data volumes increase. Also note that this architecture is composed solely of managed services for your data-analytics pipelines, eliminating the need to run virtual machines or to manage operating systems. This is especially true if your game server handles user authentication through App Engine. The rest of this article takes you through this architecture step by step.

Real-time event processing using streaming pattern

This section explores an example architecture pattern that ingests, processes and analyzes a large number of events concurrently from many different sources. The processing happens as events in your game unfold, enabling you to respond and make decisions in real-time.

Many mobile games generate a large number of event messages. Some are triggered by the player, others by the time of day and so on. As a consequence, the data sets are unbounded and you don't know how many events to process in advance. Therefore, the right approach is to process the data using a streaming execution engine.

Imagine that your mobile app is a role-playing game that allows players to fight evil forces by attempting quests to defeat powerful monsters. To track the progress of the player, a typical event message includes a unique identifier for the player, an event time stamp, metrics indicating the quest, the player’s current health, and so forth. Maybe it will look something similar to this example, showing an end-of-battle event message with an event identifier of playerkillednpc.


This example featured a battle-related event, but event messages can include any kind of information that is relevant to your business, for example, in-game purchase events.

Since you can’t predict what questions you’ll want to ask about your data in the future, it’s a good strategy to log as many data points as you can. This provides the needed additional context for your future data queries. For example, what is more insightful: the fact that a player made an in-game purchase costing 50 cents, or that they bought a potent magic spell against the boss in quest 15 and that the player was killed by that boss five times in a row in the 30 minutes preceding the purchase? Capturing rich events data gives you detailed insight into exactly what’s going on in your game.

Message source: mobile or game server?

Regardless of the content fields of your event messages, you must decide whether to send the event messages directly from the end-user device running the mobile app to the Pub/Sub ingestion layer, or to go through the game server. The main benefit of going through the game server is that authentication and validation is handled by your application. A drawback is that additional server processing capacity will be required to handle the load of event messages coming in from mobile devices.

Processing game client and game server events in real time

Figure 2: Real-time processing of events from game clients and game servers

Regardless of the source of the data, the backend architecture remains largely the same. As shown in Figure 2, there are five main parts:

  1. Real-time event messages are sent by a large number of sources, for example millions of mobile apps.
  2. Authentication is handled by the Game Server.
  3. Pub/Sub ingests and temporarily stores these messages.
  4. Dataflow transforms the JSON event into structured, schema-based data.
  5. That data is loaded into the BigQuery analytics engine.

Pub/Sub: ingesting events at scale

To handle this load, you need a scalable service capable of receiving and temporarily storing these event messages. Since each individual event is quite small in size, the total number of messages is more of a concern than the overall storage requirement.

Another requirement for this ingestion service is to allow for multiple output methods. This means the events should be consumable by multiple destinations. Finally, there should be a choice between the service acting as a queue, where each destination polls to fetch new messages, and a push method, that sends events proactively the moment they are received.

Fortunately, the Pub/Sub service provides all of this functionality. Figure 3, shows the recommended ingestion layer, capable of processing millions of messages per second and storing them for up to 7 days on persistent storage. Pub/Sub operates a publish/subscribe pattern where you can have one or more publishers sending messages to one or more topics. There can also be multiple subscribers for each topic.

Pub/Sub publish subscribe model with persistent storage

Figure 3: Pub/Sub publish subscribe model with persistent storage

You can choose the number of topics and the grouping of each. Because there is a direct relationship between the number of topics and the number of Dataflow pipelines that you create; it's best to group logically connected events. For example, a publisher can be an individual mobile device or a game server, with multiple publishers to a single topic. All that is required is the ability to authenticate and send a properly formatted message over HTTPS.

Once a message—in this case a player telemetry event—is received by the Pub/Sub service, it is durably stored in the Message Store until every Topic Subscription has retrieved that message.

Dataflow streaming pipeline

Dataflow provides a high-level language to describe data processing pipelines in an easy way. You can run these pipelines using the Dataflow managed service. The Dataflow pipeline runs in streaming mode and gets messages from the Pub/Sub topic as they arrive through a subscription. Dataflow then does any required processing before adding them to a BigQuery table.

This processing can take the form of simple element-wise operations like making all usernames lower case, or joining elements with other data sources; for example joining a table of usernames to player statistics.

Transform JSON messages into BigQuery table format

Figure 4: Transform JSON messages into BigQuery table format

Dataflow can perform any number of data- processing tasks, including real-time validation of the input data. One use case is fraud detection, for example, highlighting a player with maximum hit-points outside the valid range. Another use case is data cleansing, such as ensuring that the event message is well-formed and matches the BigQuery schema.

The example in Figure 4 unmarshals the original JSON message from the game server and converts it into a BigQuery schema. The key thing is that you don't have to manage any servers or virtual machines to perform this action, Dataflow does the work of starting, running and stopping compute resources to process your pipeline in parallel. You can also reuse the same code for both streaming and batch processing.

The open-source Dataflow SDK provides pipeline execution separate from your Dataflow program's execution. Your Dataflow program constructs the pipeline, and the code you've written generates a series of steps to be executed by a pipeline runner. The pipeline runner can be the Dataflow managed service on Google Cloud, a third-party runner service like Spark and Flink, or a local pipeline runner that executes the steps directly in the local environment, which is especially useful for development and testing purposes.

Dataflow ensures that each element is processed exactly once, and you can also take advantage of the time windowing and triggers features to aggregate events based on the actual time they occurred (event time) as opposed to the time they are sent to Dataflow (processing time). Some messages may be delayed from the source due to mobile Internet connection issues or batteries running out, but you still want to group events together by user session eventually. Dataflow's built-in session windowing functionality supports this. See The world beyond batch: Streaming 101 for an excellent introduction to data windowing strategies.

Alternatively, if your events contain a unique session field, such as a universally unique identifier (UUID), you can use this key to group related events, as well. The most appropriate choice will depend on your specific scenario.

BigQuery: fully managed data warehouse for large-scale data analytics

BigQuery consists of two main parts: a storage system that provides persistence of your data with geographic redundancy and high availability. On top of that, it is the analytics engine that allows you to run SQL-like queries against massive datasets. BigQuery organizes its data into Datasets that can contain multiple tables. BigQuery requires a schema to be defined for each table, and the main job of Dataflow in the previous section was to structure the JSON-formatted raw event data into a BigQuery schema structure using the built-in BigQuery connector.

Once the data is loaded into a BigQuery table, you can run interactive SQL queries against it to extract valuable intelligence. BigQuery is built for very large scale, and allows you to run aggregation queries against petabyte-scale datasets with fast response times. This is great for interactive analysis.

Data visualization tools

Google Data Studio allows you to create and share interactive dashboards that access a wide variety of data sources, including BigQuery.

BigQuery also integrates with popular BI and visualization tools, such as Looker. Finally, you can run BigQuery queries directly from Google Sheets and Microsoft Excel.

Bulk processing using batch pattern

The other major pattern is regular processing of large, bounded, data sets that don't have to be processed in real-time. The batch pipelines will often be used for creating reports or combined with real-time sources for a best-of-both worlds: reliable historical data that includes the latest information that's come in through the real-time stream pipelines.

Batch processing of events and logs from game servers

Figure 5: Batch processing of events and logs from game servers

Cloud Storage is the recommended place to store large files. It is a durable, highly available, and cost effective object storage service. But one question you may ask is: How do I get the data into Cloud Storage in the first place? The answer to your question depends on your data sources, and is the topic of the next few sections. You will learn about three different data source scenarios: on-premises, other cloud providers and Google Cloud.

Scenario 1: transferring files from on-premises servers

You can securely transfer log files from your on-premises data center using a number of methods. The most common way is to use the open-source gsutil command line utility to setup recurring file transfers. The gsutil command offers several useful features, including multi-threaded parallel uploads when you have many files, automatic synchronization of a local directory, resumable uploads for large files, and, for very large files, you can break the file into smaller parts and upload it in parallel. These features shorten the upload time and utilize as much of your network connection as possible.

If you have insufficient internet bandwidth to achieve timely uploads, you can connect directly to the Google Cloud using Direct Peering or Carrier Peering. Alternatively, if you prefer to send physical media — consider using the offline Transfer Appliance.

Scenario 2: transferring files from other cloud providers

Some log files may be stored with other cloud providers. Perhaps you run a game server there and output the logs to a storage service on that cloud provider. Or you use a service that has a default storage target. Amazon Cloudfront (a Content Delivery Network service), for example, stores its logs in an Amazon S3 bucket. Fortunately, moving data from Amazon S3 to Cloud Storage is straightforward.

If you are transferring large amounts of files on a daily basis from Amazon S3 to Cloud Storage, you can use the Storage Transfer Service to transfer files from sources including Amazon S3 and HTTP/HTTPS services. You can set up regularly recurring transfers and Storage Transfer Service supports several advanced options. This service takes advantage of the large network bandwidth between major cloud providers and uses advanced bandwidth optimization techniques to achieve very high transfer speeds.

The guideline is to use this service for 1TB-10TB or more per transfer since it could save you the operational overhead of running the gsutil tool mentioned in Scenario 1 on multiple servers. For smaller transfers, or where you need to move data multiple times per day, you may find using the gsutil tool mentioned in Scenario 1 a good choice.

Scenario 3: the data is already in Google Cloud

In some cases, the data you want is automatically stored in Cloud Storage by default. For example, data from the Google Play Store, including Reviews, Financial Reports, Installs, Crash, and Application Not Responding (ANR) reports, are available in a Cloud Storage bucket under your Google Play Developer account. In these cases, you can keep the data in the original bucket it was exported to unless you have reasons to move them to another bucket.

Asynchronous transfer service pattern

A long term and scalable approach is to implement an asynchronous transfer service pattern where you use one or more queues and messaging to initiate transfers based on events or triggers. For example, when a new log file is written to disk or the source file store, a message is sent to the queue and workers then take on the job of successfully transferring that object to Cloud Storage, removing the message from the queue only when successfully completed.

Alternative batch pattern: direct loading from Cloud Storage to BigQuery

You may be asking yourself if it is strictly necessary to use Dataflow between Cloud Storage and BigQuery. You can directly load the files from JSON files in Cloud Storage by providing a schema and starting a load job. Or you can directly query CSV, JSON or Datastore backup files in a Cloud Storage bucket. This may be an acceptable solution to get started, but keep in mind the benefits of using Dataflow:

  • Transform the data before committing to storage. For example, you can aggregate data before loading it into BigQuery, grouping different types of data in separate tables. This can help reduce your BigQuery costs by minimizing the number of rows you query against. In a real-time scenario, you can use Dataflow to compute leaderboards within individual sessions or cohorts like guilds and teams.

  • Interchangeably use streaming and batch data pipelines written in Dataflow. Change the data source and data sink, for example from Pub/Sub to Cloud Storage, and the same code will work in both scenarios.

  • Be more flexible in terms of database schema. For example, if you have added additional fields to your events over time, you may wish to add the additional raw JSON data to a catch-all field in the schema and use BigQuery JSON query functionality to query inside that field. Then you can query over multiple BigQuery tables even though the source event strictly would require different schemas. This is shown in Figure 6.

Additional column to capture new event fields in raw JSON

Figure 6: Additional column to capture new event fields in raw JSON

Operational considerations for the reference architectures

Once you have established and created your pipelines, it is important to monitor the performance and any exceptions that may occur. The Dataflow Monitoring User Interface provides a graphical view of your data pipeline jobs as well as key metrics. You can see a sample screenshot of this in Figure 7.

Built-in Dataflow monitoring console

Figure 7: Built-in Dataflow monitoring console

The Dataflow console provides information on the pipeline execution graph as well as current performance statistics such as the number of messages being processed at each step, the estimated system lag and data watermark. For more detailed information, Dataflow is integrated with the Cloud Logging service; you can see an example screenshot of this in Figure 8.

Cloud Logging is integrated with Dataflow

Figure 8: Cloud Logging is integrated with Dataflow

Further reading and additional resources