Google Cloud Big Data and Machine Learning Blog

Innovation in data processing and machine learning technology

Counting uniques faster in BigQuery with HyperLogLog++

Wednesday, July 5, 2017

By Felipe Hoffa, Developer Advocate

As a data exploration task, counting unique users is usually slow and resource intensive because your database needs to keep track of every unique id it has ever seen and that can consume a lot of RAM.

Let’s try it with BigQuery: How many unique users did GitHub have in 2016?

#standardSQL
SELECT COUNT(DISTINCT actor.login) exact_cnt
FROM `githubarchive.year.2016`
6,610,026 (4.1s elapsed, 3.39 GB processed, 320,825,029 rows scanned)

4.1 seconds to process 320 million rows with 3.39GB of data, for a total of ~6.6M unique ids. Not too bad. But we can go faster, if we're willing to get approximate results:

#standardSQL
SELECT APPROX_COUNT_DISTINCT(actor.login) approx_cnt
FROM `githubarchive.year.2016`
6,643,627 (2.6s elapsed, 3.39 GB processed, 320,825,029 rows scanned)

This time we got an approximate count   (with an error rate of ~0.5%), but in only 2.6 seconds!

In this post, I’ll explain how BigQuery uses HyperLogLog++, Google’s internal implementation of the HyperLogLog algorithm for cardinality estimation, to pull that feat off.

Exact counts versus approximate ones

In the paper “HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm” (2013), the authors present the Google improvements reflected in HyperLogLog++. Quoting Mosha Pasumansky, some of these improvements are:
  1. Better accuracy for very large cardinalities: Standard HLL starts having too many hash collisions around 1 billion unique values. Google had to deal with this problem as it does have several businesses with more than 1 billion users (Android, Chrome, Youtube, …). There are also more things to count uniques for (other than users), for example, in IoT we may expect billions of sensors.
  2. Better accuracy for very small cardinalities: HLL+ uses bias correction to reduce the error in small cardinalities.
  3. More scalable with better memory usage: HLL+ uses special sparse representation of sketch. In standard HLL, memory usage is constant and determined by the specified precision, but sparse representation allows memory usage to be smaller for lower cardinalities, and adaptively grow for higher ones, but never exceeds the dense representation.

Let me demonstrate. First, let’s see how BigQuery performed the exact count for the query above:

Computing an exact count on BigQuery — 3 steps

Stage 1 went through ~320M rows, and had an output of ~229M rows. The most interesting part here is that the output was “shuffled by hash.” Several nodes read the underlying data in parallel and transmitted it to a second layer of nodes while shuffling it by hash. This step assures the next layer of nodes that there are no duplicate ids being counted on other nodes. Thus Stage 2 is able to reduce these ~229M rows to 132 (one count for each “node” in this layer), and Stage 3 only needs to add these 132 numbers to produce the exact final result.

Most databases would avoid doing a shuffle because it's so resource intensive. In contrast, BigQuery can pull that off extremely quickly.

For the approximate results below, only two stages were required and no shuffle was involved:

Computing an approximate count on BigQuery — 2 steps

Stage 1 went through ~320M rows and had an output of 1,323 rows. We can infer that 1,323 nodes scanned these rows in parallel, and each one computed a partial result. Stage 2 takes these 1,323 partial results and combines it into the final result.

This is the first time we see some hints of HyperLogLog++: Stage 1 did an HLL_COUNT.INIT(), while Stage 2 combined these results with HLL_COUNT.MERGE(). What’s remarkable here is that we can get a total unique count by merging partial results — thereby showcasing the highly distributed nature of HyperLogLog++.

Next, let’s explore some benefits of this approach.

Counting with bigger numbers

BigQuery can choose from a variety of algorithms to run an APPROX_COUNT_DISTINCT(). In the case above, we see strong hints that it used the underlying HyperLogLog++ functions, but that won’t necessarily always be the case.

For testing purposes, let’s go larger this time: How many unique users did GitHub have each year since 2011?

#StandardSQL
SELECT CONCAT('20', _TABLE_SUFFIX) year, APPROX_COUNT_DISTINCT(actor.login) approx_cnt
FROM `githubarchive.year.20*`
GROUP BY year
ORDER BY year
3.8s elapsed, 8.37 GB processed

Notice that you can use * to match multiple tables, and then identify each one with the pseudo-column _TABLE_SUFFIX.

The results reveal some nice GitHub growth, but here’s an interesting follow-on question: Can you determine the total number of unique users on GitHub for the whole period 2011–2016, only by looking at each year’s unique count?

The answer to that question, of course, is “No.” It would be naive to add all these numbers and then say that GitHub had ~18M unique users between 2011 and 2016. (Unfortunately, this false premise is all too common.) Instead, we need to account for unique users who are present in more than 1 year.

To get the total number of unique users through these 6 years, we would need to remember the ids of each user seen each year to account for the repeating ones. That’s possible, but HyperLogLog++ provides an interesting and more efficient alternative: We can get a “sketch” that summarizes all the unique ids in a much smaller space. It looks like this:

#StandardSQL
SELECT CONCAT('20', _TABLE_SUFFIX) year
 , APPROX_COUNT_DISTINCT(actor.login) approx_cnt
 , HLL_COUNT.INIT(actor.login) sketch
FROM `githubarchive.year.20*`
GROUP BY year
ORDER BY year

What’s interesting here is that those hll_sketch values represent a summary of all unique ids seen, in less than 32KiB (adjustable) — and they're combinable. We only need these sketches to get an approximate count of all GitHub unique users from 2011–2016 (if I had saved the previous results in a temp table):

#standardSQL
SELECT HLL_COUNT.MERGE(sketch)
FROM `fh-bigquery.temp.github_year_sketches`
11,334,294, 2.7s elapsed, 192 KB processed

So, GitHub had ~11.3M unique users between 2011–2016. This value is only 0.3% away from the exact count. To get this result, I only needed to store a 32KiB sketch for each year of unique counts, instead of the 166MB that storing all the unique ids per year would require.

Other interesting properties

Parallelization

The ability to run independent HyperLogLog++ counts and combine them later allows BigQuery to efficiently distribute the counting of uniques load between several nodes — without the need to shuffle results. As explained previously, this approach also allows you to save the intermediate results for combining later. Outside BigQuery, it also allows systems to keep running counts of uniques with minimum RAM requirements.

Merging with individual sketches

Let’s say I create a sketch that represents all the unique logins on GitHub that start with the letter "a":

#standardSQL
SELECT HLL_COUNT.MERGE(sketch) approx_cnt
FROM (
 SELECT HLL_COUNT.INIT(actor.login) sketch
 FROM `githubarchive.month.201601`
 WHERE SUBSTR(actor.login, 0, 1) = 'a'
)
83,899

That sketch represents ~83,899 unique users (exact number: 83,619). What happens when we merge this sketch with another sketch that represents only one element?

The following query creates one sketch for each GitHub login that starts with "b." We then merge each of these sketches with the sketch that represents all of the users that start with "a." With these results, we can see how the count changes when we add a single element that's not represented in a sketch:

#standardSQL
SELECT approx, COUNT(*) ids_with_b_that_merge_this_count
FROM (
 SELECT login, HLL_COUNT.MERGE(sketch)  approx
 FROM (
   SELECT actor.login,
     [
       HLL_COUNT.INIT(actor.login),
       (
         SELECT HLL_COUNT.INIT(actor.login) sketch
         FROM `githubarchive.month.201601`
         WHERE SUBSTR(actor.login, 0, 1) = 'a'
     )] sketch_pair
   FROM `githubarchive.month.201601`
   WHERE SUBSTR(actor.login, 0, 1) = 'b'
   GROUP BY login
 ), UNNEST(sketch_pair) sketch
 GROUP BY login
)
GROUP BY approx
ORDER BY approx

Merging the sketch for an id that starts with ‘b’, with the sketch of all ids that start with ‘a’ — and then counting them by resulting approximate result

What these results say:

  • When merging 33,397 ids that start with "b," the HyperLogLog++ count of uniques doesn’t change at all.
  • For the other 12,398 ids that start with "b," the count of uniques went up by between 1 and 11.
  • It seems easy to find a large list of elements that we could add to a collection without changing the number of uniques that HyperLogLog guesses. Meanwhile adding a single element could also make this count go up by between 1 and 11.

Testing with more than a billion ids

In case you're wondering how to test HyperLogLog++ with more than 1 billion elements,  let’s use more than 3 billion Reddit comments stored in BigQuery:

#standardSQL
SELECT COUNT(DISTINCT id) exact
FROM `fh-bigquery.reddit_comments.20*`
3,168,770,564, 27.9s elapsed, 27.2 GB processed
#standardSQL
SELECT HLL_COUNT.MERGE(sketch) approx
FROM (
 SELECT HLL_COUNT.INIT(id) sketch
 FROM `fh-bigquery.reddit_comments.20*`
)
3,161,955,636, 5.7s elapsed, 27.2 GB processed

BigQuery was able to exactly count 3 billion uniques in 28 seconds (not bad at all), but with HyperLogLog++, it did the same in only 5.7 seconds — with a result that is only 0.2% off from the exact unique count.

Next steps

To learn more about HyperLogLog specifically and BigQuery internals generally, explore:
Want more stories? Check my Medium, follow me on twitter, and subscribe to reddit.com/r/bigquery. And try BigQuery — every month you get a full terabyte of analysis for free.
  • Big Data Solutions

  • Product deep dives, technical comparisons, how-to's and tips and tricks for using the latest data processing and machine learning technologies.

  • Learn More

12 Months FREE TRIAL

Try BigQuery, Machine Learning and other cloud products and get $300 free credit to spend over 12 months.

TRY IT FREE

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.