Infrastructure options for serving advertising workloads (part 1)

This article explains the components that are shared across different ad tech platforms, including ad servers and bidders. The article offers you options for implementing those components.

Ad servers and bidders are often complementary platforms with overlapping technology. To avoid duplicating content, this article and its companion, part 2 , provide context for the series:

For a high-level overview of the entire series and the terminology used throughout, see Building advertising platforms (overview).

Platform considerations

When dealing with either the buy or sell side of the platforms, consider the following:

  • Compute platform: Programmatic advertising platforms comprise several services, where each service offers one or more functions. Decide early if you can containerize some or all of these functions, or if the service must run directly on virtual machine (VM) instances.
  • Geographic locations: Deploying your infrastructure close to your customers and providers helps reduce networking latency.
  • Reproducibility: When you replicate a system in different regions across the globe, the ability to consistently deploy the same infrastructure provides consistency across the platform.
  • Load balancing: A single machine cannot handle ad-tech loads. Distribute both internal and external requests across multiple servers.
  • Autoscaling: Ad-request loads fluctuate over the course of the day. You can reduce costs and increase availability by automatically scaling your system up and down.
  • Network communication: With a distributed system come communication questions. For example, suppose that you are bidding in Oregon but your campaign management database is in Europe. Even if the communication consists of offline synchronization, you probably don't want to communicate over the public internet.

Compute platform

Google Cloud offers several options for running your computational workloads. Consider the following options:

  • App Engine for running a web user interface (UI) minus most of the operational overhead.
  • Compute Engine for installing and managing some relational databases or custom code not supported by App Engine.
  • Google Kubernetes Engine (GKE) for setting up stateless frontends or for running containerized applications.

These options are recommendations and are often interchangeable. Your requirements are ultimately the deciding factor, whether that factor is cost, operational overhead, or performance.

Compute Engine and GKE both support preemptible VMs, which are often used in ad-tech workloads to save on cost. Preemptible VMs can be preempted with only a one-minute warning, however, so you might want to do the following:

  • If you use Compute Engine, two different managed instance groups (one preemptible and the other with standard VMs) can reside behind the same load balancer. By making sure that one group consists of standard VMs, you ensure that your frontend is always able to handle incoming requests. The following diagram shows this approach.

    two different managed instance groups in the same load balancer

  • If you use GKE, you can mitigate cost with availability by creating both non-preemptible and preemptible node pools in your GKE cluster.

Geographic locations

Advertisers might want to target customers in all regions around the globe. Adding a few extra milliseconds to one of the platform's UI frontends won't decrease your advertisers' experience when, for example, they are visualizing performance data. Be careful, however, if additional networking distance adds a few extra milliseconds to the bidding response. Those few milliseconds might prove to be the difference in whether the advertiser's bid gets accepted and their ad gets served to the customer.

Google Cloud has a presence in several regions, including us-east, us-west, europe-west, asia-southeast, and asia-east. Each region includes multiple zones to offer both high availability and scale.

If latency is critical, you might want to distribute some of your services across those zones in different regions. You can customize your setup based on your needs. For example, you could decide to have some frontend servers in us-east4 and us-west1, but have data stored in a database in us-central. In some cases, you could replicate some of the database data locally onto the frontend servers; alternatively you might consider a multi-regional Cloud Spanner instance.


Reproducibility offers simpler maintenance and deployment, and having the platform run in all relevant geographic locations is key to returning bids before the critical deadline. To ensure reproducibility, every region must perform similar work. The main difference is the workload, and how many machines and zones are required to scale to meet the regional demand.

With Compute Engine, instance templates are the base to set up similar regional managed instance groups. These groups can be located in different regions for proximity to the SSPs, and they can span multiple zones for high availability. The following diagram shows what this process looks like.

Use instance template to set up regional managed instance groups

Containers offer a higher level of abstraction than machine virtualization. Kubernetes facilitates application reproducibility natively through YAML configuration files that can define pods, services, and deployments with consistency across the different clusters.

Load balancing

Two main scenarios require load balancing:

If you decide to use Kubernetes for some parts of the infrastructure, we recommend that you use GKE. Some Kubernetes features might need some extra implementation if your provider does not natively support them. With GKE, Kubernetes can use native Google Cloud features:

GKE also supports container-native load balancing to minimize networking time and possible extra networking hops. At a high level, the load balancer prevents a request from being routed to an instance that does not host a pod of the requested service.


Because your platform must be able to parse and calculate billions of ad requests per day, load-balancing is a must. Besides, a single machine would be inadequate for that task. However, the number of requests tends to change throughout the day, meaning that your infrastructure must be able to scale up and down based on demand.

If you decide to use Compute Engine, you can create autoscaling managed instance groups from instance templates. You can then scale those groups on different metrics such as HTTP load, CPU, and Cloud Monitoring custom metrics, such as application latency. You can also scale these groups on a combination of these metrics.

Autoscaling decisions are based on the metric average over the last ten minutes and are made every minute using a sliding window. Every instance group can have its own set of scaling rules.

If you decide to use GKE, Kubernetes' Cluster Autoscaler, you can implement it natively by using GKE Cluster Autoscaler. GKE Cluster Autoscaler behaves differently than the Compute Engine Autoscaler and spins up new nodes when new pods can no longer be scheduled on the existing nodes due to a shortage of CPU or memory on the underlying nodes. Scaling down works automatically when CPU or memory is released again.

Network communication

Virtual Private Clouds (VPCs) can span multiple regions. In other words, if you have database read replicas in us-east and a master database in asia-southeast within the same VPC, they can securely communicate using their private IPs or hostnames without ever leaving the Google network.

In the following diagram, all instances are in the same VPC and can communicate directly without the need of VPN, even if they are in different regions.

All instances are in the same VPC and in different regions

GKE clusters are assigned a VPC when they are created and can use many of those existing networking features.

Google also offers two types of networking infrastructure:

  • Premium: Uses the Google global private network. Prioritize this option for critical workloads such as cross-region database replication.
  • Standard: Prioritize this option if you are price conscious and can use the public internet.

When you use managed products such as Cloud Bigtable, Cloud Storage, or BigQuery, Google Cloud provides Private access to those products through the VPC.

User frontend

Your user frontend is important, but it requires the least technical overhead because it is handling much smaller workloads. The user frontend offers platform users the ability to administer advertising resources such as campaigns, creatives, billing, or bids. The frontend also offers the ability to interact with reporting tools—for example, to monitor campaigns or ad performance.

Both of these features require web serving to provide a UI to the platform user, and datastores to store either transactional or analytical data.

Web serving

Your advertising frontend likely needs to:

  • Offer high availability.
  • Handle hundreds of requests per second.
  • Be globally available at an acceptable latency to offer a good user experience.
  • Provide a UI.

Your interface likely includes functionality such as a dashboard and pages to set up advertisers, campaigns, and their related components. The UI design itself is a separate discipline and beyond the scope of this article.

To minimize technical overhead, we recommend using App Engine as a frontend platform. That choice will help minimize the time you spend managing your website infrastructure. If you require a custom runtime, consider Custom Runtimes. Alternatively, you can use GKE or Compute Engine if your preferred application stack has other requirements.


There are two datastore types in user frontends:

  • Administration datastores that require online transaction processing (OLTP) databases. Options are detailed in the metadata management store section.
  • Reporting datastores that require online analytical processing (OLAP) databases. Options are detailed in the reporting/dashboarding store section.

Handling requests


Requests are sent to be processed to an HTTP(S) endpoint that your platform provides. The key components are as follows:

  • A load balancer able to process several hundreds of thousands of QPS.
  • A pool of instances that can scale up and down quickly based on various KPIs.
  • A possible API that throttles and/or authenticates to the endpoint.

Both Compute Engine and GKE are good options as computing platforms:

  • Compute Engine uses Cloud Load Balancing and managed instance groups that are mentioned in the scaling section.
  • GKE uses Cloud Load Balancing and Ingress (or Istio Ingress Gateway), Horizontal Pod Autoscaler, and Cluster Autoscaler.

Because pod scaling is faster than node scaling, GKE might offer faster autoscaling on a per-service level. GKE also supports container-native load balancing to optimize requests routing directly to an instance that hosts a pod of the relevant service.

Throttling and authentication can be managed with technologies like Apigee or Service Infrastructure.


Ad requests are commonly formatted in JSON or protobuf format with information such as IP address, user agent, or site category. It's critical to extract this data, which might also contain details on the (unique) user, to then retrieve segments to select and filter ads.

Static filtering

Some requests, typically received on the buyer side, can be discarded by making use of static rules. Such early filtering can reduce the amount of data and the complex processing that is required downstream.

Rules might be publisher blocklists or content type exclusion. During initialization the workers can pull and load these rules from a file hosted on Cloud Storage.

Ad selection

Ad selection can be performed in the various services or platforms, including: the publisher ad server, the advertiser ad server, or the DSP. There are different levels of complexity when selecting an ad:

  • Some selections can be as simple as selecting an ad for a specific category of the publisher's website or page. In this case, prices do not differ on a per-ad basis.
  • More advanced selections incorporate user attributes and segments and potentially involve machine learning–based ad-recommendation systems.
  • RTB systems usually make the most complex decisions. Ads are selected based on attributes such as (unique) user segments and previous bid prices. The selection also includes a bid calculation to optimize the bid price on a per-request basis.

Choosing the relevant ad is the core function of your system. You have many factors to consider, including advanced rules–based or ML-selection algorithms. This article, however, continues to focus on the high-level process and the interactions with the different datastores.

The ad selection process consists of the following steps:

  1. Retrieve the segments associated with the targeted user from the (unique) user profile store.
  2. Select the campaigns or ads that are a good match with the user's segments. This selection requires reading metadata from the metadata management store, which is why this store requires you to implement one of the heavy-read storing patterns.
  3. Filter the selected campaigns or ads in alignment with the metrics, for example the remaining budget, stored in one of the context stores.
  4. Select the ad.

Bidders have more steps related to bids and auctions, and they have harder latency requirements. For more details about bidder requirements during the ad selection, see Infrastructure options for RTB bidders (part 4).

Heavy-read storing patterns

Most of the decisions made when selecting an ad require intelligent data that:

  • Is read in milliseconds, sometimes sub-milliseconds.
  • Is written as soon as possible, especially for time-sensitive counters.
  • Is written often as part of an offline process that uses background analytical or machine learning tasks.

How you choose your datastore depends on how you prioritize the following requirements:

  • Minimizing read or write latencies: If latency is critical, you need a store that is close to your servers and that can also handle fast reads or writes at scale.
  • Minimizing operational overhead: If you have a small engineering team, you might need a fully managed database.
  • Scaling: To support either millions of targeted users or billions of events per day, the datastore must scale horizontally.
  • Adapting querying style: Some queries can use a specific key, where others might need to retrieve records that meet a different set of conditions. In some cases, query data can be encoded in the key. In other cases, the queries need SQL-like capabilities.
  • Maximizing data freshness: Some counters, such as budget, must be updated as quickly as possible. Other data such as the audience segment or counters (for example, daily caps) can be updated at a later time.
  • Minimizing costs: It might not always be economical or practical to handle billions of reads and writes every day with strong consistency globally in a single database to minimize DevOps.

There are different options to address heavy-read requirements. These include read replicas, caching systems, in-memory NoSQL databases, and managed wide-column NoSQL databases.

RDBMS read replicas

When using Cloud SQL (or an equivalent RDBMS installed and managed on Compute Engine), you can offload the reads from the master instance. Many databases natively support this feature. Workers could query needed information by:

  • Using read replicas that match the number of workers.
  • Using a pooling proxy.

The following diagram shows what this process looks like.

Database where the reads are offloaded from the master instance

Read replicas are designed to serve heavy read traffic, but scalability is not linear and performance can suffer with larger number of replicas. If you need either reads or writes that can scale, with global consistency and minimum operational overhead, then consider using Cloud Spanner.

Local caching layer

You can use a caching layer such as Redis on Compute Engine with optional local replicas on the workers. This layer can greatly minimize latency in both reading and networking. The following diagram shows what this layer looks like.

Minimize latency by leveraging a caching layer

If you decide to use Kubernetes in this case, look at DaemonSet and the affinity to make sure that:

  • You limit the amount of replicated data.
  • Data remains close to a serving pod.

In-memory key-value NoSQL

Deploy an in-memory database such as Aerospike or Redis to provide fast reads at scale. This solution can be useful for regional data and counters. If you are concerned by the size of the data structures stored, you can also leverage in-memory databases that can write to SSD disks. The following diagram shows what this solution looks like.

An in-memory database that can write to SSD disks

Managed wide-column NoSQL

Wide-column datastores are key-value stores that can provide fast reads and writes at scale. You can install a common open source database such as Cassandra or HBase.

If you decide to use such a store, we recommend using Bigtable to minimize operational overhead. These stores enable you to scale your input/output operations (IOPs) linearly with the number of nodes. With a proper key design, ad selectors can read and the data pipeline can write at single-digit-millisecond speeds to the first byte on petabytes of data. The following diagram shows what this process looks like.

Wide-column data stores

Static object storage

For static data that can be saved in protobuf, AVRO, or JSON format, workers can load from Cloud Storage during initialization and persist the content in RAM. The following diagram shows what this process looks like.

Load data from Cloud Storage

There is no-one-size-fits-all solution. Choose between the solutions based on your priorities, and balance latency, cost, operations, read/write performance, and data size.

Solution Latency Cost Operational overhead Read/Write performance Data size
RDBMS Read replicas milliseconds Service or compute-based High Limited Limited
Cloud Spanner milliseconds Service based Low Linear scale with number of nodes Petabytes
In-memory stores sub-millisecond Compute based High Scales with number of nodes Scales with number of nodes
Cloud Bigtable Single-digit milliseconds Service-based Low Linear scale with number of nodes Petabytes

Advertising datastores

This section covers data storage options that address one of three different scenarios:

  • Ad-serving stores are used by services concerned with ad selection. Serving workloads requires low latency and the ability to handle billions of reads per day. Data size depends on the type of data.
  • Analytical stores are used offline through ad hoc queries or batch data pipelines. They support hundreds of terabytes of data stored daily.
  • Reporting/dashboard stores can be used for pre-made dashboards, time series, or custom reporting. These options differentiate your frontend, enabling your platform users to quickly gain insights and visualize how their business is performing.

Ad-serving stores can be further broken down, as follows:

  • The metadata management store is used when selecting relevant campaigns and ads. The user frontend creates and updates data. This data requires persistence.
  • The (unique) user profile store is used when profiling the (unique) user to match the user with an ad. Data is updated mostly using events (in part 2). This data requires persistence.
  • The serving context store is used to filter ads and campaigns based on multiple counters such as budget or frequency caps. Data is updated using events. This data is frequently overwritten.

Metadata management store

The metadata management store contains the reference data to which you apply rules when making the ad selection. Some resources stored here are specific to a platform, but others might overlap:

  • For a sell-side ad server, publishers manage data about campaigns, creatives, advertisers, ad slots, and pricing. Some frontends might also grant access to their buyers.
  • For a buy-side ad server, buyers manage data about campaigns, creatives, advertisers, and pricing. Advertisers can often update this data themselves through a UI.
  • For a DSP, buyers manage data about campaigns, creatives, advertisers, and bid prices. Advertisers can often update the data themselves through a UI.

The metadata store contains relational or hierarchical semi-static data:

  • Writes are the result of platform user edits through the frontend and happen infrequently.
  • Data is read billions of times per day by the ad-selection servers.

Focusing on the user frontend, the campaign metadata database must be able to manage resource relationships and hierarchies, and store megabytes to a few gigabytes of data. The database must also provide reads and writes in the range of hundreds of QPS. To address these requirements, Google Cloud offers several database options, both managed and unmanaged:

  • Cloud SQL: A fully managed database service that can run MySQL or PostgreSQL.
  • Datastore: A highly scalable, managed, and distributed NoSQL database service. It supports ACID transactions, provides a SQL-like query language, and has strong and eventual consistency levels.
  • Spanner: A horizontally scalable relational database providing strong, consistent reads, global transactions, and cross-region replication. It can handle heavy reads and writes.
  • Custom: You can also install and manage many of the open source or proprietary databases (such as MySQL, PostgreSQL, MongoDB, or Couchbase) on Compute Engine or GKE

Your requirements can help narrow down options, but at a high level you could use Cloud SQL due to its support for relational data. Cloud SQL is also managed and provides read replica options.

As mentioned previously, the metadata store is not only used by platform users for reporting or administering but also by the servers that select ads. Those reads happen billions of times a day. There are two main ways to approach that heavy-read requirements:

  • Using a database that can handle consistent writes globally and billions of reads per day like Spanner.
  • Decoupling reads and writes. This approach is possible because metadata is not changed often. You can read more about this approach in exports (in part 2).

(Unique) user profile store

This store contains (unique) users and their associated information that provide key insights to select a campaign or ad on request. Information can include the (unique) user's attributes, your own segments, or segments imported from third-parties. In RTB, imported segments often include bid price recommendations.

This datastore must be able to store hundreds of gigabytes, possibly terabytes of data. The datastore must also be able to retrieve single records in, at most, single-digit-millisecond speeds. How much data you store depends on how detailed your (unique) user information is. At a minimum, you should be able to retrieve a list of segments for the targeted user.

The store is updated frequently based on the (unique) user's interaction with ads, sites they visit, or actions they take. The more information, the better the targeting. You might also want to use third-party data management platforms (DMPs) to enrich your first-party data.

Bigtable or Datastore are common databases to use for (unique) user data. Both databases are well suited to random reads and writes of single records. Consider using Bigtable only if you have at least several hundreds of gigabytes of data.

Other common databases such as MongoDB, Couchbase, Cassandra, or Aerospike can also be installed on Compute Engine or GKE. Although these databases often require more management, some might provide more flexibility, possibly lower latency, and in some cases, cross-region replication.

For more details, see user matching (in part 4).

Context stores

The context store is often used to store counters, for example, frequency caps and remaining budget. How often the data is refreshed in the context store varies. For example, a daily cap can be propagated daily, whereas the remaining campaign budget requires recalculating and propagating as soon as possible.

Depending on the storage pattern that you choose, the counters that you update, and the capabilities of your chosen database, you could write directly to the database. Or you might want to decouple the implementation by using a publish-subscribe pattern with a messaging system, such as Pub/Sub, to update the store after the calculation.

Good candidates for context stores are:

  • Bigtable
  • The regional in-memory key-value NoSQL pattern
  • The regional caching pattern

By using horizontal scaling, these stores can handle writes and reads at scale. Infrastructure options for ad servers (part 3) and Infrastructure options for RTB bidders (part 4) discuss some of these options in more detail.

Example of how to manage a budget counter in a distributed environment

You set up budgets in the campaign management tool. You don't want your campaigns to overspend because, most of the time, advertisers will not pay for those extra impressions. But it can be challenging in a distributed environment to aggregate counters such as remaining budget, especially when the system can receive hundreds of thousands of ad requests per second. Campaigns can quickly overspend in a few seconds if the global remaining budget is not consolidated quickly.

By default, a worker spends slices of the budget without knowing how much sibling workers spent. That lack of communication can lead to a situation where the worker spends money that is no longer available.

There are different ways to handle this problem. Both of the following options implement a global budget manager, but they behave differently.

  1. Notifying workers about budget exhaustion: The budget manager tracks spending and pushes notifications to each of the workers when the campaign budget has been exhausted. Due to the likelihood of high levels of QPS, notifications should happen within a second in order to quickly limit overspend. The following diagram shows this process works.

    Budget exhaustion notification

  2. Recurrently allocating slices of budget to each worker: The budget manager breaks the overall remaining budget into smaller amounts that are allocated to each worker individually. Workers spend their own, local budget, and when it is exhausted, they can request more. This option has a couple of advantages:

    1. Before being able to spend again, workers need to wait for the budget manager to allocate them a new amount. This approach prevents overspending even if some workers remain idle for a while.
    2. The budget manager can adapt the allocated amount sent to each worker based on the worker's spending behavior at each cycle. The following diagram shows this process.

      Budget manager allocates budget to each node

Whichever option you choose, budget counters are calculated based on the pricing model agreed upon by the publisher and advertiser. For example, if the model is:

  • CPM based, a billable impression sends an event to the system that decreases the remaining budget based on the price set per thousand impressions.
  • CPC based, a user click sends an event to the system that decreases the remaining budget based on the price set per click.
  • CPA based, a tracking pixel placed on the advertiser property sends an event to the system that decreases the budget based on the price per action.

The number of impressions is often a few orders of magnitude higher than clicks. And the number of clicks is often a few orders of magnitude higher than conversion. Ingesting these events requires a scalable event-processing infrastructure. This approach is discussed in more detail in the data pipeline article.

Analytical store

The analytical store is a database that is used to store and process terabytes of daily ingested data offline; in other words, not during any real-time processing. For example:

  • (Unique) user data is processed offline to determine the associated segments, which in turn are copied to faster databases, such as the user profile database, for serving. This process is explained in exports.
  • Joining requests with impressions and (unique) user actions in order to aggregate offline counters used in a context store.

Reporting/dashboarding store

The reporting/dashboarding store is used in the user frontend and provides insight on how well campaigns or inventories perform. There are different reporting types. You might want to offer some or all of them, including custom analytics capabilities and semi-static dashboards updated every few seconds or in real time.

You can use BigQuery for its analytics capabilities. If you leverage views to limit data access and share them accordingly with your customers, you can give your platform users ad hoc analytical capabilities through your own UI or their own visualization tools. Not every company offers this option, but it is a great addition to be able to offer your customers. Consider using flat-rate pricing for this use case.

If you want to offer OLAP capabilities to your customers with millisecond-to-second latency, consider using a database in front of BigQuery. You can aggregate data for reporting and export it from BigQuery to your choice of database. Relational databases, such as those supported by Cloud SQL, are often used for this purpose.

Because App Engine or any other frontend that acts on behalf of the user by using a service account, BigQuery sees queries as coming from a single user. The outcome is that BigQuery can cache some queries and return previously calculated results faster.

Semi-static dashboards are also commonly used. These dashboards use pre-aggregated KPIs written by a data pipeline process. Stores are most likely NoSQL-based, such as Firestore for easier real-time updates, or caching layers such as Memorystore. The freshness of the data will typically depend on the frequency of the updates and duration of the window used to aggregate your data.

What's next