March 23, 2021
What is the best, correct technique for task X, when Y is true?
So Far: Correct = The original RA for a query.
Only queries guaranteed to produce identical results are correct
What happens when Correct allows for a margin of error?
ORDER BY LIMIT
(Top-K) with some margin of error.LIMIT
without ORDER BY
(Any-K).Sacrifice a little accuracy for a lot of speed
SELECT SUM(A) FROM R
Naively, you need to see all values of R.A
Avg(3,6,10,9,1,3,9,7,9,4,7,9,2,1,2,4,10,8,9,7)=6
Avg(3,6,10,9,1)=5.8 ≈6
Sum(kNSamples)⋅Nk≈Sum(∗)
Sampling lets you approximate aggregate values with orders of magnitude less data.
Question: How accurate is an estimate from N samples
With n tuples sampled uniformly with replacement
|AVG(samples)−AVG(real)| | The absolute error |
P(|AVG(samples)−AVG(real)|≥ϵ) | Its probability of exceeding error threshold ϵ |
P(|AVG(samples)−AVG(real)|≥ϵ)≤2e2nϵ2(max(real)−min(real))2 | ... is below a threshold based on ϵ, n, and the min/max value. |
"Hoeffding's Bound"
See also "Chernoff's Bound" (similar) and "Serfling's Bound" (works without replacement).
What about non-sum-based aggregates?
Idea 1: Generate a bunch of samples of the same size and see how they're distributed.
The resulting histogram models the distribution of samples.
Problem: Generating samples is expensive!
Idea 2: Generate one sample, then resample to see how its distributed
The resulting histogram still models (in expectation) the distribution of samples.
Keep adding samples until you reach a target accuracy
Keep adding samples until you run out of time
Idea 1: Pick Randomly!
for i from 1 to num_samples:
sample_id = random(0, num_records)
samples += [ table.where( rowid = sample_id ) ]
Problem: Random scans are EXPENSIVE.
Idea 2: Assume data already randomized!
Pick a random start record and read sequentially from there.
Problem: Sequential records are almost never IID.
Idea 3: Prebuild samples!
Shuffle data into fixed size sample buckets (e.g., BlinkDB).
Employee | City | Salary |
---|---|---|
Alice | NYC | $120k |
Bob | NYC | $110k |
Carol | NYC | $115k |
Dave | Syracuse | $80k |
SELECT City, AVG(Salary) FROM NYS_Salaries;
Problem: Most data is about NYC. With N samples taken uniformly, margins of error for other cities are much bigger
Generate NCOUNT(DISTINCT City) samples for each group
Use COUNT(DISTINCT City) instead of COUNT(∗) as the total group size
Idea 2: Pre-generate sample buckets across a range of different strata (e.g., BlinkDB).
Assume: UNIQ(A,R)=UNIQ(A,S)=N
It takes O(√N) samples from both R and S
to get even one match.
R⋈BS⋈CT
Sample from R, Use all of S, T
For each tuple sampled from R, sample exactly one joining tuple from S
Question: Are we biasing the sampling process
Goal: Sample r⋈s with probability p((r⋈s)∈R⋈S)
Actual: Sample r⋈s with probability p(r∈R)p((r⋈s)∈R⋈S|r∈R)
p(r∈R)p((r⋈s)∈R⋈S)|r∈R) =p(r∈R,(r⋈s)∈R⋈S)
≠p((r⋈s)∈R⋈S)
p(r∈R,(r⋈s)∈R⋈S) ⋅n =p((r⋈s)∈R⋈S)
n=p((r⋈s)∈R⋈S)p(r∈R,(r⋈s)∈R⋈S)
=1p(r∈R|(r⋈s)∈R⋈S) ≈|R|⋅|{s|s∈S,r.A=S.A}
The probability that a tuple in R participates in a join.
Should I sample from R or S first?
(What if only one tuple from R joins with every tuple from S)
Sample with all join orders at random.
Idea 1: Start small, keep making the sample size bigger
Idea 2: Normal BNLJ, but piggyback sampling off of the results.