COUNT(DISTINCT user_id) requires BigQuery to hold every distinct value in memory for deduplication. For 7-day, 30-day, and 90-day rolling windows, each window scans all underlying data independently.
HyperLogLog++ (HLL++) sketches solve this problem. They’re probabilistic data structures that estimate cardinality (the number of distinct values) with ~97-99% accuracy. The critical property: they’re composable. You can precompute daily sketches, then merge them for any arbitrary date range in milliseconds.
How HLL++ Works (The Mental Model)
You don’t need to understand the full algorithm to use it well, but the mental model helps.
A sketch is a compact summary of the distinct values it has seen. It doesn’t store the actual values — it stores a fixed-size statistical fingerprint. When you merge two sketches, the result is equivalent to having seen all values from both sets. This is the additive property that makes sketches powerful: the merge of Monday’s sketch and Tuesday’s sketch gives you the correct distinct count for Monday+Tuesday, without re-scanning either day’s data.
The tradeoff is precision. Sketches give an approximate count, not an exact one. The error rate depends on the precision parameter. For most analytics use cases, the approximation is good enough to be indistinguishable from exact counts, and the cost savings are enormous.
The Precompute-and-Merge Pattern
This is the pattern you’ll use in production. It works in two steps:
Step 1: Precompute daily sketches
Run this once to create the historical table, then append daily via a scheduled query or dbt model:
CREATE OR REPLACE TABLE `project.dataset.daily_user_sketches` ASSELECT event_date, HLL_COUNT.INIT(user_id, 15) AS user_sketchFROM `project.dataset.events`GROUP BY event_date;The 15 parameter is the precision (more on that below). Each row in this table is a single date with a sketch that summarizes all distinct user_ids for that day. The sketch is a few KB regardless of how many users you had.
Step 2: Merge sketches for any date range
SELECT HLL_COUNT.MERGE(user_sketch) AS unique_users_30dFROM `project.dataset.daily_user_sketches`WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);This query reads 30 rows (one sketch per day), merges them, and returns the approximate distinct count. It runs in milliseconds regardless of whether you have 1 million users or 100 million. Compare that to a COUNT(DISTINCT user_id) over 30 days of raw events, which scans terabytes.
You can add any dimension to the sketch table to enable slicing:
CREATE OR REPLACE TABLE `project.dataset.daily_user_sketches_by_country` ASSELECT event_date, geo_country, HLL_COUNT.INIT(user_id, 15) AS user_sketchFROM `project.dataset.events`GROUP BY event_date, geo_country;Now you can get unique users by country for any date range by adding a country filter to the merge query.
APPROX_COUNT_DISTINCT vs HLL_COUNT
BigQuery offers a simpler function for approximate counts:
SELECT APPROX_COUNT_DISTINCT(user_id) AS approx_usersFROM `project.dataset.events`WHERE event_date >= '2025-01-01';APPROX_COUNT_DISTINCT is convenient for one-off queries. It’s faster than exact COUNT(DISTINCT) and uses less memory. But it has a fundamental limitation: it’s not composable.
You can’t add yesterday’s APPROX_COUNT_DISTINCT to today’s and get a valid 2-day count. Users who appear on both days would be double-counted. The function gives you a single-query approximation, not a building block for pre-aggregation.
Use APPROX_COUNT_DISTINCT for ad-hoc exploration. Use HLL_COUNT.INIT and HLL_COUNT.MERGE for production metrics, dashboards, and anything where you need rolling windows or flexible date ranges.
The Precision Parameter
The precision parameter (valid range: 10-24) controls the tradeoff between accuracy and sketch size:
| Precision | Approximate Error Rate | Sketch Size | Use Case |
|---|---|---|---|
| 10 | ~3.25% | ~1 KB | Rough estimates, very high volume |
| 12 | ~1.63% | ~4 KB | GA4 session counts (Google’s internal choice) |
| 14 | ~0.81% | ~16 KB | GA4 user counts (Google’s internal choice) |
| 15 | ~0.41% | ~32 KB | Default recommendation for most analytics |
| 18 | ~0.05% | ~256 KB | Near-exact requirements |
| 24 | ~0.003% | ~16 MB | Practically exact |
Precision 15 gives approximately 0.4% error. On a metric of 1 million unique users, that’s +/- 4,000 users. For virtually all business decisions, this is indistinguishable from exact.
GA4 uses precision 14 for user counts and 12 for sessions internally. If Google considers that accurate enough for their own analytics product, it’s likely accurate enough for your dashboards.
Higher precision means larger sketches, which means more storage and slightly slower merges. The difference is negligible for daily sketch tables (a few hundred KB per row at precision 15), but matters if you’re storing sketches at very high cardinality dimensions (per-user, per-page-URL).
Rolling Window Metrics
The precompute-and-merge pattern shines for rolling windows. A common dashboard requirement:
-- 7-day, 30-day, and 90-day unique users in one querySELECT (SELECT HLL_COUNT.MERGE(user_sketch) FROM `project.dataset.daily_user_sketches` WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)) AS unique_users_7d,
(SELECT HLL_COUNT.MERGE(user_sketch) FROM `project.dataset.daily_user_sketches` WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)) AS unique_users_30d,
(SELECT HLL_COUNT.MERGE(user_sketch) FROM `project.dataset.daily_user_sketches` WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)) AS unique_users_90d;With exact COUNT(DISTINCT), each window scans the full raw event table for its date range. The 90-day window alone might scan terabytes. With sketches, all three windows read from the tiny sketch table and return in under a second.
Performance Impact
- Execution speed: 4–8x faster than exact
COUNT(DISTINCT)on the same data - Slot consumption: 8x lower because sketches require far less memory than full deduplication
- Rolling windows: A 90-day unique user count that took 30 seconds with exact counting takes 200 milliseconds with sketch merges
- Storage: A daily sketch table for a billion-user dataset is megabytes, not terabytes
The performance gain comes from precomputing once and querying many times. The daily sketch computation runs once (or incrementally via dbt incremental model). Subsequent dashboard loads, ad-hoc queries, and API calls read from the sketch table instead of raw events.
When Not to Use HLL++
Sketches are wrong for:
- Exact counts required by regulation or contracts: Financial reporting, SLA metrics where +/- 1% matters legally
- Very small cardinalities: If you have fewer than 10,000 distinct values, exact counting is cheap enough and the approximation error might be proportionally noticeable
- Non-count-distinct use cases: HLL++ only does cardinality estimation. For other approximate analytics (percentiles, quantiles), look at
APPROX_QUANTILES
For unique users, unique sessions, unique devices, and unique page URLs across time windows, HLL++ sketches reduce both cost and latency significantly. The cost model math favors approximate answers when exact counting is prohibitively expensive.