March 25, 2021
SELECT COUNT(DISTINCT A) FROM R
SELECT A, COUNT(*) FROM R GROUP BY A
SELECT A, COUNT(*) ... ORDER BY COUNT(*) DESC LIMIT 10
These are all "Holistic" aggregates (O(|A|) memory). What happens when you run out of memory?
Sketching: Hash function tricks used to estimate useful statistical properties.
Challenge: To avoid double counting, we need to track which values of A we've seen. O(|A|) memory required.
A brief digression
Flips | Score |
---|---|
(👽) | 0 |
(🐕) (👽) | 1 |
(🐕) (🐕) (🐕) (🐕) (🐕) (👽) | 5 |
Flips | Score | Probability | E[# Games] |
---|---|---|---|
(👽) | 0 | 0.5 | 2 |
(🐕)(👽) | 1 | 0.25 | 4 |
(🐕)(🐕)(👽) | 2 | 0.125 | 8 |
(🐕)×N (👽) | N | 12N+1 | 2N+1 |
If I told you that in a series of games, my best score was N, you might expect that I played 2N+1 games.
To do that, I only need to track my top score!
Idea: Simulate coin flips with a hash function
... take the index of the lowest-order nonzero bit
Object | Hash Bits | Score |
---|---|---|
O1 | 01011011 | 0 |
O2 | 00110111 | 0 |
O3 | 00111000 | 3 |
O4 | 10010010 | 1 |
O3 | 00111000 | 3 |
3 |
Estimate: 23+1=16
Duplicates can't raise the top score!
Problem: Noisy estimate!
Idea 1: Instead of your top score, track the lowest score you have not gotten yet (R).
Object | Hash Bits | Score |
---|---|---|
O1 | 01011011 | 0 |
O2 | 00110111 | 0 |
O3 | 00111000 | 3 |
O4 | 10010010 | 1 |
O3 | 00111000 | 3 |
{0, 1, 3} R=2 |
Estimate: 2Rϕ=220.77351≈5.2
Idea 2: Compute several estimates in parallel and average estimates.
Problem: Need a counter for each individual A
Idea: Keep only one counter!
No... seriously
Object | δ(Oi) | Running Count |
---|---|---|
O3 | -1 | -1 |
O1 | +1 | 0 |
O4 | -1 | -1 |
O2 | +1 | 0 |
O4 | -1 | -1 |
O1 | +1 | 0 |
O3 | -1 | -1 |
O3 | -1 | -2 |
O1 | +1 | -1 |
Total= |
COUNT_OF(Oi)⋅δ(Oi) |
+∑j≠iCOUNT_OF(Oj)⋅δ(Oj) |
E[∑jCOUNT_OF(Oj)⋅δ(Oj)]= |
12∑COUNT_OF(Oj) |
−12∑COUNT_OF(Oj) |
Total≈COUNT_OF(Oi)⋅δ(Oi)+0
Running total was −1
Object | δ(Oi) | Estimate |
---|---|---|
O1 | +1 | -1 |
O2 | +1 | -1 |
O3 | -1 | +1 |
O4 | -1 | +1 |
Not... so... great
Problem 1: All of the objects use the same counter (no way to differentiate an estimate for O1 from O2).
Problem 2: The estimate is really noisy
Idea 1: Multiple Buckets (h(x) picks a bucket)
Idea 2: Multiple Trials (h→h1,h2,…; δ→δ1,δ2,…)
Object | h1(Oi) | δ1(Oi) | h2(Oi) | δ2(Oi) |
---|---|---|---|---|
O1 | Bucket 1 | -1 | Bucket 2 | 1 |
O2 | Bucket 1 | -1 | Bucket 1 | -1 |
O3 | Bucket 2 | 1 | Bucket 1 | -1 |
O4 | Bucket 1 | -1 | Bucket 1 | 1 |
Objects Seen: $$
Bucket 1 | Bucket 2 | |
---|---|---|
Trial 0 | 0 | 0 |
Trial 1 | 0 | 0 |
Object | Trial 1 | Trial 2 | Estimate | Real |
---|---|---|---|---|
O1 | 0 | 0 | 0.0 | 0 |
O2 | 0 | 0 | 0.0 | 0 |
O3 | 0 | 0 | 0.0 | 0 |
O4 | 0 | 0 | 0.0 | 0 |
Objects Seen: O2
Bucket 1 | Bucket 2 | |
---|---|---|
Trial 0 | -1 | 0 |
Trial 1 | -1 | 0 |
Object | Trial 1 | Trial 2 | Estimate | Real |
---|---|---|---|---|
O1 | 1 | 0 | 0.5 | 0 |
O2 | 1 | 1 | 1.0 | 1 |
O3 | 0 | 1 | 0.5 | 0 |
O4 | 1 | -1 | 0.0 | 0 |
Objects Seen: O2,O1
Bucket 1 | Bucket 2 | |
---|---|---|
Trial 0 | -2 | 0 |
Trial 1 | -1 | 1 |
Object | Trial 1 | Trial 2 | Estimate | Real |
---|---|---|---|---|
O1 | 2 | 1 | 1.5 | 1 |
O2 | 2 | 1 | 1.5 | 1 |
O3 | 0 | 1 | 0.5 | 0 |
O4 | 2 | -1 | 0.5 | 0 |
Objects Seen: O2,O1,O4
Bucket 1 | Bucket 2 | |
---|---|---|
Trial 0 | -3 | 0 |
Trial 1 | 0 | 1 |
Object | Trial 1 | Trial 2 | Estimate | Real |
---|---|---|---|---|
O1 | 3 | 1 | 2.0 | 1 |
O2 | 3 | 0 | 1.5 | 1 |
O3 | 0 | 0 | 0.0 | 0 |
O4 | 3 | 0 | 1.5 | 1 |
Objects Seen: O2,O1,O4,O1
Bucket 1 | Bucket 2 | |
---|---|---|
Trial 0 | -4 | 0 |
Trial 1 | 0 | 2 |
Object | Trial 1 | Trial 2 | Estimate | Real |
---|---|---|---|---|
O1 | 4 | 2 | 3.0 | 2 |
O2 | 4 | 0 | 2.0 | 1 |
O3 | 0 | 0 | 0.0 | 0 |
O4 | 4 | 0 | 2.0 | 1 |
Objects Seen: O2,O1,O4,O1,O2
Bucket 1 | Bucket 2 | |
---|---|---|
Trial 0 | -5 | 0 |
Trial 1 | -1 | 2 |
Object | Trial 1 | Trial 2 | Estimate | Real |
---|---|---|---|---|
O1 | 5 | 2 | 3.5 | 2 |
O2 | 5 | 1 | 3.0 | 2 |
O3 | 0 | 1 | 0.5 | 0 |
O4 | 5 | -1 | 2.0 | 1 |
In practice, use Median and not Mode to combine trials
Problem: "Heavy Hitters" overwhelm smaller counts
Idea: Give up. Drop δ.
Object | Appearances | h1(Oi) | h2(Oi) |
---|---|---|---|
O1 | 10 | Bucket 1 | Bucket 2 |
O2 | 32 | Bucket 1 | Bucket 1 |
O3 | 1002 | Bucket 2 | Bucket 1 |
O4 | 500 | Bucket 1 | Bucket 1 |
Bucket 1 | Bucket 2 | |
---|---|---|
Trial 0 | 542 | 1002 |
Trial 1 | 1534 | 10 |
Bucket 1 | Bucket 2 | |
---|---|---|
Trial 0 | 542 | 1002 |
Trial 1 | 1534 | 10 |
Object | Appearances | Estimate 1 | Estimate 2 | Min |
---|---|---|---|---|
O1 | 10 | 542 | 10 | 10 |
O2 | 32 | 542 | 1534 | 542 |
O3 | 1002 | 1002 | 1534 | 1002 |
O4 | 500 | 542 | 1534 | 542 |