Serverless Pixel Tracking Architecture

This solution presents an architecture for serverless pixel tracking for advertising scenarios.

As an advertiser or an advertising agency, a primary goal is to make your brand or your customer’s brand visible. One way to do that is by using online advertising. If you do a good job, your ads get delivered to ad slots on a publisher’s property. But the tasks don't stop there.

One key part of an advertising architecture is to gather data in order to make the best use of the budget and to understand how the audience is responding. A common way to do this is through pixel tracking. But setting up those pixel tracking servers often requires a bit of technical overhead, such as having to deal with costs, latency, scaling, and high availability, to name a few. In this solution, you will learn how to build such a platform while mitigating the technical investment.

Security is always important when working with advertising data. Google Cloud Platform (GCP) helps to keep your data safer, more secure, and more private in several ways. For example, all data is encrypted during transmission and when at rest, and GCP is ISO 27001 and SOC3, FINRA, and PCI compliant.

To learn more about how to implement this solution, you can follow the tutorial.

Concepts

The pixel-tracking process usually follows these steps:

  1. A bit of code is added to the creative material or web page to call a URL that points to the advertising network backend. For example:

    <img src=”AD_SERVER_URL?parameters”>
    
  2. On the backend, servers receive that request, process it and return a 1x1, invisible pixel. Usually, this pixel is created programmatically. Returning a pixel instead of an HTTP 200 or 304 response helps to ensure that the browser does not display a missing image.

By returning a single pixel, the networking requirements remain quite small. The 1x1 pixel can easily be made transparent, has little networking impact and it enables logging of all sorts of data, such as:

  • Custom parameters, such as the page name that it is displayed on or user id.
  • User-based environment, such as the browser, OS, or whether the ad is viewed on a mobile device.

Requirements and architecture

While the process might seem quite simple, it requires a few things to be set up. This example makes the following assumptions:

  1. An average of 100,000 impressions happen each second, but the rate of impressions can vary.
  2. Impressions can happen from all over the world.
  3. All impressions must be logged, and when added those logs can represent terabytes of data daily.
  4. All logs can be analyzed daily.

These assumptions lead to a few implementation requirements:

  1. Must be able to scale up and down automatically based on demand.
  2. Need to have millisecond latency worldwide served by one domain.
  3. Need a scalable storage for logs with asynchronous writes.
  4. Need a storage option that is easy to query.

The following diagram shows the architecture. Note the absence of actual servers, both for writing to the analytics platform or for serving pixels.

Architecture for serverless pixel serving

Frontends

The first idea that comes to mind — and commonly implemented in the ad tech industry — would be to use frontend servers such as Google Compute Engine instance groups or a Google Container Engine federated cluster coupled with an autoscaler that write their logs to an object storage using a fluentd agent. While such tools are quite common, this would require you to set up instances, templates, groups, scaling rules, and deployments scripts. That can be a lot of work.

This solution details an easier-to-implement alternative, leveraging a few of GCP's fully managed products:

  • Google Cloud Storage: A global or regional object storage that offers options such as Standard, Nearline, and Coldline, with various prices and SLAs depending on your needs. In this case, you will use Standard, which offers low-millisecond latency.
  • Google HTTP(S) Load Balancer: A global, anycast, IP load-balancing service that can scale to millions of QPS and integrates with Stackdriver Logging. It also can be leveraged by Google Cloud CDN to prevent unnecessary access to Cloud Storage by caching the pixel closer to the user in Google edge caches.

The following image shows a GCP HTTP(S) load balancer configuration in the Cloud Platform Console:

User interface shows configuration for HTTP load balancer.

Logs collection

The seamless integration of Cloud Load Balancer with Stackdriver Logging makes the logging of all requests to the load balancer really easy, with minimum setup. Logs are saved directly to Cloud Platform backends, and from there can be exported to Cloud Storage, Google Cloud Pub/Sub, or Google BigQuery. With this approach, you can do a few things:

  • Monitor how your system is behaving by using Stackdriver Monitoring, and create custom metrics and alerts.
  • Export your logs to various products, based on your need:

    • Backup: Export to Cloud Storage or BigQuery.
    • Analysis: Export to BigQuery.
    • Stream for real-time processing: Export to Cloud Pub/Sub.

A log in Stackdriver Logging should look similar to the following image, where you can see a couple of things:

  • The pixel was actually served from the CDN: cacheHit: true.
  • The URL parameters that were set up in the HTML code are visible.

Stackdriver log shows cache hit and parameters.

This solution directly exports logs to BigQuery by using the streaming API. It is also possible to leverage Google Cloud Dataflow. For more information, see Next Steps.

A log exported directly to BigQuery looks like the following image, where columns match keys of the JSON-like data that you find in Stackdriver Logging. The following image shows a query that displays all data for a specific log, based on its Google-created insertID field.

Google BigQuery shows all data for one log.

Logs analytics

After it loads into BigQuery, you can run ad-hoc queries on your data. Here are a few things that you might want to keep in mind:

  • Using a partition table for each day offers a few advantages, such as reducing the number of bytes processed by all of your queries.
  • Splitting your tables for each group of advertisers can also help to keep the data more manageable, even if it is not a requirement. You can always run a union on those tables by using table wildcards.
  • BigQuery is also a great tool for basic ETL jobs. By giving an output table to a query, you can process data as needed and transform it into a more presentable format. For example, you might want to aggregate your daily data into a weekly table.

The following example code shows a subset of the fields that you can find in BigQuery for each pixel impression, where your parameter is the list of custom key-value pairs that you add and will be able to extract.

 {
    [...]
    "resource_type": "http_load_balancer",
    "resource_labels_url_map_name": "lb-pixel-tracking",
    "resource_labels_zone": "global",
    "httpRequest_requestMethod": "GET",
    "httpRequest_requestUrl": "YOUR_DOMAIN/pixel.png?YOUR_PARAMETERS",
    "httpRequest_requestSize": "972",
    "httpRequest_status": "200",
    "httpRequest_responseSize": "1320",
    "httpRequest_userAgent": "Go-http-client/1.1",
    "httpRequest_remoteIp": "1.2.3.4",
    "httpRequest_cacheHit": "true",
    [...]
}

After your data is stored in BigQuery, you can start working on understanding user behavior. Using the BigQuery UI, analysts with SQL knowledge can see results in seconds across gigabytes to terabytes of data.

BigQuery shows results of pixel-tracking requests

Load testing

To make sure that this solution is usable in production, you can set up a distributed load testing environment, based on Vegeta. For details, see the tutorial. As you can see in the results below, this architecture can easily manage 100,000 RPS.

Load testing results graphs

What's next

This solution presented a high-level architecture for serverless pixel serving. To take the architecture farther, you can leverage other Google Cloud Platform features, such as:

  • DNS: This solution directly used the global IP of the load balancer, but you can also attach your domain name to it by using Google Cloud DNS, for example. One load balancer can have several backends of various types, such as Cloud Storage or Compute Engine.
  • Custom metrics: As mentioned earlier, you can use Stackdriver Logging to create custom metrics that can be used by Cloud Monitoring for alerts or monitoring.
  • Google Cloud Bigtable: Adding Cloud Bigtable to the Cloud Dataflow pipeline helps some of your ad servers get access to updated data in near real time. While Cloud Dataflow can still load data into BigQuery for offline analytics, it could also write to Bigtable, which offers single-digit-latency reads when using SSD. This can be a good way to get your user profile, for example.
  • Google Data Studio: While BigQuery offers a UI to run SQL queries, sometimes a picture is worth a thousand words. Data Studio can help you visualize your analysis through sharable dashboards and cross-datasource analysis.
  • Log processing: This solution proposed focusing on batch analytics with an architecture that can evolve towards real-time decisions. A good approach to build such an architecture would be to use:
  • Cloud Pub/Sub: A fully-managed global messaging system. Publishers can publish huge amount of messages to a topic through various clients. Those messages can then be processed by workers that can either pull or get the messages pushed to them. It is available as an option when exporting from Stackdriver Logging.
  • Cloud Dataflow: The Google evolution of MapReduce that offers batch and streaming capabilities under the same SDK, which is open source under Apache Beam. Cloud Dataflow makes building pipelines easy with a few lines of code, and gives a lot of great features when it comes to stream processing including watermarks, timers, and windowing. It also offers various sinks, such as BigQuery (ad-hoc full queries) or Bigtable (low latency reads and write).

Such an architecture would be similar to the one shown in the following diagram:

Advanced archicture

See tutorials

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...