BigQuery HyperLogLog Sketches

COUNT(DISTINCT user_id) requires BigQuery to hold every distinct value in memory for deduplication. For 7-day, 30-day, and 90-day rolling windows, each window scans all underlying data independently.

HyperLogLog++ (HLL++) sketches solve this problem. They’re probabilistic data structures that estimate cardinality (the number of distinct values) with ~97-99% accuracy. The critical property: they’re composable. You can precompute daily sketches, then merge them for any arbitrary date range in milliseconds.

How HLL++ Works (The Mental Model)

You don’t need to understand the full algorithm to use it well, but the mental model helps.

A sketch is a compact summary of the distinct values it has seen. It doesn’t store the actual values — it stores a fixed-size statistical fingerprint. When you merge two sketches, the result is equivalent to having seen all values from both sets. This is the additive property that makes sketches powerful: the merge of Monday’s sketch and Tuesday’s sketch gives you the correct distinct count for Monday+Tuesday, without re-scanning either day’s data.

The tradeoff is precision. Sketches give an approximate count, not an exact one. The error rate depends on the precision parameter. For most analytics use cases, the approximation is good enough to be indistinguishable from exact counts, and the cost savings are enormous.

The Precompute-and-Merge Pattern

This is the pattern you’ll use in production. It works in two steps:

Step 1: Precompute daily sketches

Run this once to create the historical table, then append daily via a scheduled query or dbt model:

CREATE OR REPLACE TABLE `project.dataset.daily_user_sketches` AS
SELECT
  event_date,
  HLL_COUNT.INIT(user_id, 15) AS user_sketch
FROM `project.dataset.events`
GROUP BY event_date;

The 15 parameter is the precision (more on that below). Each row in this table is a single date with a sketch that summarizes all distinct user_ids for that day. The sketch is a few KB regardless of how many users you had.

Step 2: Merge sketches for any date range

SELECT HLL_COUNT.MERGE(user_sketch) AS unique_users_30d
FROM `project.dataset.daily_user_sketches`
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);

This query reads 30 rows (one sketch per day), merges them, and returns the approximate distinct count. It runs in milliseconds regardless of whether you have 1 million users or 100 million. Compare that to a COUNT(DISTINCT user_id) over 30 days of raw events, which scans terabytes.

You can add any dimension to the sketch table to enable slicing:

CREATE OR REPLACE TABLE `project.dataset.daily_user_sketches_by_country` AS
SELECT
  event_date,
  geo_country,
  HLL_COUNT.INIT(user_id, 15) AS user_sketch
FROM `project.dataset.events`
GROUP BY event_date, geo_country;

Now you can get unique users by country for any date range by adding a country filter to the merge query.

APPROX_COUNT_DISTINCT vs HLL_COUNT

BigQuery offers a simpler function for approximate counts:

SELECT APPROX_COUNT_DISTINCT(user_id) AS approx_users
FROM `project.dataset.events`
WHERE event_date >= '2025-01-01';

APPROX_COUNT_DISTINCT is convenient for one-off queries. It’s faster than exact COUNT(DISTINCT) and uses less memory. But it has a fundamental limitation: it’s not composable.

You can’t add yesterday’s APPROX_COUNT_DISTINCT to today’s and get a valid 2-day count. Users who appear on both days would be double-counted. The function gives you a single-query approximation, not a building block for pre-aggregation.

Use APPROX_COUNT_DISTINCT for ad-hoc exploration. Use HLL_COUNT.INIT and HLL_COUNT.MERGE for production metrics, dashboards, and anything where you need rolling windows or flexible date ranges.

The Precision Parameter

The precision parameter (valid range: 10-24) controls the tradeoff between accuracy and sketch size:

Precision	Approximate Error Rate	Sketch Size	Use Case
10	~3.25%	~1 KB	Rough estimates, very high volume
12	~1.63%	~4 KB	GA4 session counts (Google’s internal choice)
14	~0.81%	~16 KB	GA4 user counts (Google’s internal choice)
15	~0.41%	~32 KB	Default recommendation for most analytics
18	~0.05%	~256 KB	Near-exact requirements
24	~0.003%	~16 MB	Practically exact

Precision 15 gives approximately 0.4% error. On a metric of 1 million unique users, that’s +/- 4,000 users. For virtually all business decisions, this is indistinguishable from exact.

GA4 uses precision 14 for user counts and 12 for sessions internally. If Google considers that accurate enough for their own analytics product, it’s likely accurate enough for your dashboards.

Higher precision means larger sketches, which means more storage and slightly slower merges. The difference is negligible for daily sketch tables (a few hundred KB per row at precision 15), but matters if you’re storing sketches at very high cardinality dimensions (per-user, per-page-URL).

Rolling Window Metrics

The precompute-and-merge pattern shines for rolling windows. A common dashboard requirement:

-- 7-day, 30-day, and 90-day unique users in one query
SELECT
  (SELECT HLL_COUNT.MERGE(user_sketch)
   FROM `project.dataset.daily_user_sketches`
   WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)) AS unique_users_7d,

  (SELECT HLL_COUNT.MERGE(user_sketch)
   FROM `project.dataset.daily_user_sketches`
   WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)) AS unique_users_30d,

  (SELECT HLL_COUNT.MERGE(user_sketch)
   FROM `project.dataset.daily_user_sketches`
   WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)) AS unique_users_90d;

With exact COUNT(DISTINCT), each window scans the full raw event table for its date range. The 90-day window alone might scan terabytes. With sketches, all three windows read from the tiny sketch table and return in under a second.

Performance Impact

Execution speed: 4–8x faster than exact COUNT(DISTINCT) on the same data
Slot consumption: 8x lower because sketches require far less memory than full deduplication
Rolling windows: A 90-day unique user count that took 30 seconds with exact counting takes 200 milliseconds with sketch merges
Storage: A daily sketch table for a billion-user dataset is megabytes, not terabytes

The performance gain comes from precomputing once and querying many times. The daily sketch computation runs once (or incrementally via dbt incremental model). Subsequent dashboard loads, ad-hoc queries, and API calls read from the sketch table instead of raw events.

When Not to Use HLL++

Sketches are wrong for:

Exact counts required by regulation or contracts: Financial reporting, SLA metrics where +/- 1% matters legally
Very small cardinalities: If you have fewer than 10,000 distinct values, exact counting is cheap enough and the approximation error might be proportionally noticeable
Non-count-distinct use cases: HLL++ only does cardinality estimation. For other approximate analytics (percentiles, quantiles), look at APPROX_QUANTILES

For unique users, unique sessions, unique devices, and unique page URLs across time windows, HLL++ sketches reduce both cost and latency significantly. The cost model math favors approximate answers when exact counting is prohibitively expensive.