CSE-4/562 Spring 2021 - OLA and AQP

OLA and AQP

CSE-4/562 Spring 2021

March 23, 2021

Garcia-Molina/Ullman/Widom: (readings only)

What is the best, correct technique for task X, when Y is true?

  1. How do you define Correct and Best?
  2. What correct alternatives are available?
  3. How do you find the best available alternative

So Far: Correct = The original RA for a query.

Only queries guaranteed to produce identical results are correct

What happens when Correct allows for a margin of error?

  • Aggregate query with some margin of error.
  • ORDER BY LIMIT (Top-K) with some margin of error.
  • LIMIT without ORDER BY (Any-K).

Sacrifice a little accuracy for a lot of speed

Online Aggregation

Approximate Query Processing


                        SELECT SUM(A) FROM R
    

Naively, you need to see all values of R.A

Online Aggregation

$Avg(3,6,10,9,1,3,9,7,9,4,7,9,2,1,2,4,10,8,9,7) = 6$

$Avg(3,6,10,9,1) = 5.8$ $\approx 6$

$Sum\left(\frac{k}{N}\; Samples\right) \cdot \frac{N}{k} \approx Sum(*)$

Sampling lets you approximate aggregate values with orders of magnitude less data.

Question: How accurate is an estimate from $N$ samples

With $n$ tuples sampled uniformly with replacement

$|AVG(samples) - AVG(real)|$ The absolute error
$P(|AVG(samples) - AVG(real)| \geq \epsilon)$ Its probability of exceeding error threshold $\epsilon$
$P(|AVG(samples) - AVG(real)| \geq \epsilon) \leq 2e^{\frac{2n\epsilon^2}{(max(real) - min(real))^2}}$ ... is below a threshold based on $\epsilon$, $n$, and the min/max value.

"Hoeffding's Bound"

See also "Chernoff's Bound" (similar) and "Serfling's Bound" (works without replacement).

What about non-sum-based aggregates?

Bootstrapping

Idea 1: Generate a bunch of samples of the same size and see how they're distributed.

The resulting histogram models the distribution of samples.

Problem: Generating samples is expensive!

Idea 2: Generate one sample, then resample to see how its distributed

The resulting histogram still models (in expectation) the distribution of samples.

Error Bounds

SUM, COUNT, AVG (sampling with replacement)
Hoeffding's Bound
Chernoff's Bound
SUM, COUNT, AVG (sampling without replacement)
Serfling's Bound
Any Other Aggregate
Bootstrapping

Keep adding samples until you reach a target accuracy

Keep adding samples until you run out of time

Generating Samples

Sampling From Disk
Random seeks are slow
Sampling For Group-By or Selections
Low-frequency events don't get sampled
Sampling From Joins
Fixing the Birthday Paradox

Sampling From Disk

Idea 1: Pick Randomly!


      for i from 1 to num_samples:
        sample_id = random(0, num_records)
        samples += [ table.where( rowid = sample_id ) ]
    

Problem: Random scans are EXPENSIVE.

Idea 2: Assume data already randomized!

Pick a random start record and read sequentially from there.

Problem: Sequential records are almost never IID.

Idea 3: Prebuild samples!

Shuffle data into fixed size sample buckets (e.g., BlinkDB).

Sampling From Group-By

EmployeeCitySalary
AliceNYC$120k
BobNYC$110k
CarolNYC$115k
DaveSyracuse$80k

      SELECT City, AVG(Salary) FROM NYS_Salaries;
    

Problem: Most data is about NYC. With $N$ samples taken uniformly, margins of error for other cities are much bigger

Stratified Sampling

Generate $\frac{N}{\texttt{COUNT}(\texttt{DISTINCT } City)}$ samples for each group

Use $\texttt{COUNT}(\texttt{DISTINCT } City)$ instead of $\texttt{COUNT}(*)$ as the total group size

Index Striding

"Online Aggregation" (Hellerstein et. al.)

Idea 2: Pre-generate sample buckets across a range of different strata (e.g., BlinkDB).

Sampling from Joins

The Birthday Paradox

Assume: $\texttt{UNIQ}(A, R) = \texttt{UNIQ}(A, S) = N$

It takes $O(\sqrt{N})$ samples from both $R$ and $S$
to get even one match.

Weighted Joins

Exploit Foreign Keys
Sample from the referencing table, join with full reference tables
Stratified Sampling
For many-many joins, stratify on the join attribute(s)

"Join Synopses"

$R \bowtie_B S \bowtie_C T$

  • $R.B$ is a foreign key reference to $S.B$
  • $S.C$ is a foreign key reference to $T.C$

Sample from R, Use all of S, T

"Join Synopses"

"Join Synopses"

  • Foreign keys keep intermediate state small.
  • Fast evaluation with INLJ or In-Mem Hash Join.
  • Sampling is not biased.
"Join synopses for approximate query answering" (Acharya et. al.)

Stratified Sampling

Stratified Sampling

For each tuple sampled from $R$, sample exactly one joining tuple from $S$

Question: Are we biasing the sampling process

Goal: Sample $r \bowtie s$ with probability $p((r\bowtie s) \in R\bowtie S)$

Actual: Sample $r \bowtie s$ with probability $p(r \in R) p((r\bowtie s) \in R\bowtie S | r \in R)$

Bayes' Theorem

$p(r \in R) p((r\bowtie s) \in R\bowtie S) | r \in R)$ $= p(r \in R, (r\bowtie s) \in R\bowtie S)$

$\neq p((r\bowtie s) \in R\bowtie S)$

Corrective Factor

$p(r \in R, (r\bowtie s) \in R\bowtie S)$ $\cdot n$ $= p((r\bowtie s) \in R\bowtie S)$

$n = \frac{ p((r\bowtie s) \in R\bowtie S) }{ p(r \in R, (r\bowtie s) \in R\bowtie S) }$

$ = \frac{ 1 }{ p(r \in R | (r\bowtie s) \in R\bowtie S) } $ $\approx |R| \cdot |\{\;s\;|s \in S, r.A = S.A\;\}$

The probability that a tuple in $R$ participates in a join.

Should I sample from $R$ or $S$ first?

(What if only one tuple from $R$ joins with every tuple from $S$)

"Wander Join"

Sample with all join orders at random.

"Wander Join: Online Aggregation via Random Walks" (Li et. al.)

Convergent Joins

Ripple Join
Incrementally increase the sample size
Turbo DBO Join
Re-use data loading to get "lucky" joins

Idea 1: Start small, keep making the sample size bigger

Ripple Join

$$Q = R \bowtie S$$
  1. $Q \leftarrow \emptyset$, $R_{sample} \leftarrow \emptyset$, and $S_{sample} \leftarrow \emptyset$
  2. Sample $r \in R$
    1. $Q \leftarrow r \bowtie S_{sample}$
    2. $R_{sample} \leftarrow R_{sample} \uplus \{r\}$
  3. Sample $s \in S$
    1. $Q \leftarrow s \bowtie R_{sample}$
    2. $S_{sample} \leftarrow S_{sample} \uplus \{s\}$
  4. Goto 2

Ripple Join

Idea 2: Normal BNLJ, but piggyback sampling off of the results.