CSE-4/562 Spring 2019

What is the best, correct technique for task X, when Y is true?


                        SELECT SUM(A) FROM R

Naively, you need to see all values of R.A

With $n$ tuples sampled uniformly with replacement

$\|AVG(samples) - AVG(real)\|$	The absolute error
$P(\|AVG(samples) - AVG(real)\| \geq \epsilon)$	Its probability of exceeding error threshold $\epsilon$
$P(\|AVG(samples) - AVG(real)\| \geq \epsilon) \leq 2e^{\frac{2n\epsilon^2}{(max(real) - min(real))^2}}$	... is below a threshold based on $\epsilon$ , $n$ , and the min/max value.

"Hoeffding's Bound"

See also "Chernoff's Bound" (similar) and "Serfling's Bound" (works without replacement).

Idea 1: Pick Randomly!


      for i from 1 to num_samples:
        sample_id = random(0, num_records)
        samples += [ table.where( rowid = sample_id ) ]


      SELECT City, AVG(Salary) FROM NYS_Salaries;