March 9, 2021
Idea 1: Run each plan
If we can't get the exact cost of a plan, what can we do?
Idea 2: Run each plan on a small sample of the data.
Idea 3: Analytically estimate the cost of a plan.
Figure out the IO cost of the entire* subtree.
Only count the amount of memory added by each operator.
* Different from earlier in the semester.
Operation | RA | Total IOs (#pages) | Memory (#tuples) |
---|---|---|---|
Table Scan | R | |R|P | O(1) |
Projection | π(R) | io(R) | O(1) |
Selection | σ(R) | io(R) | O(1) |
Union | R⊎S | io(R)+io(S) | O(1) |
Sort (In-Mem) | τ(R) | io(R) | O(|R|) |
Sort (On-Disk) | τ(R) | 2⋅⌊logB(|R|)⌋P+io(R) | O(B) |
(B+Tree) Index Scan | Index(R,c) | logI(|R|)+|σc(R)|P | O(1) |
(Hash) Index Scan | Index(R,c) | 1 | O(1) |
Operation | RA | Total IOs (#pages) | Mem (#tuples) |
---|---|---|---|
Nested Loop Join (Buffer S in mem) | R×memS | io(R)+io(S) | O(|S|) |
Block NLJ (Buffer S on disk) | R×diskS | |R|B⋅|S|P+io(R)+io(S) | O(1) |
Block NLJ (Recompute S) | R×redoS | io(R)+|R|B⋅io(S) | O(1) |
1-Pass Hash Join | R⋈1PH,cS | io(R)+io(S) | O(|S|) |
2-Pass Hash Join | R⋈2PH,cS | 2|R|+2|S|P+io(R)+io(S) | O(1) |
Sort-Merge Join | R⋈SM,cS | [Sort] | [Sort] |
(Tree) Index NLJ | R⋈INL,c | |R|⋅(logI(|S|)+|σc(S)|P) | O(1) |
(Hash) Index NLJ | R⋈INL,c | |R|⋅1 | O(1) |
(In-Mem) Aggregate | γA(R) | io(R) | adom(A) |
(Sort/Merge) Aggregate | γA(R) | [Sort] | [Sort] |
Symbol | Parameter | Type |
---|---|---|
P | Tuples Per Page | Fixed (|page||tuple|) |
|R| | Size of R | Precomputed∗ (|R|) |
B | Pages of Buffer | Configurable Parameter |
I | Keys per Index Page | Fixed (|page||key+pointer|) |
adom(A) | Number of distinct values of A | Precomputed∗ (|δA(R)|) |
* unless R is a query
Estimating IOs requires Estimating |Q(R)|, |δA(Q(R))|
Unlike estimating IOs, cardinality estimation doesn't care about the algorithm, so we'll just be working with raw RA.
Operator | RA | Estimated Size |
---|---|---|
Table | R | |R| |
Projection | π(Q) | |Q| |
Union | Q1⊎Q2 | |Q1|+|Q2| |
Cross Product | Q1×Q2 | |Q1|×|Q2| |
Sort | τ(Q) | |Q| |
Limit | LIMITN(Q) | N |
Selection | σc(Q) | |Q|×SEL(c,Q) |
Join | Q1⋈cQ2 | |Q1|×|Q2|×SEL(c,Q1×Q2) |
Distinct | δA(Q) | UNIQ(A,Q) |
Aggregate | γA,B←Σ(Q) | UNIQ(A,Q) |
Idea 1: Assume each selection filters down to 10% of the data.
no... really!
|σc1(σc2(R))|≠|σc1∧c2(R)|
|σid=1(STUDENTS)|=|σresidence=′NY′(STUDENTS)|
... but remember that all we need is to rank plans.
Many major databases (Oracle, Postgres, Teradata, etc...) use something like 10% rule if they have nothing better.
(The specific % varies by DBMS.)
(Teradata uses 10% for the first AND
clause,
cut by another 75% for every subsequent clause)
We assume that for σc(Q) or δA(Q)...
COUNT(*)
COUNT(DISTINCT A)
(for each A)MIN(A)
, MAX(A)
(for each numeric A)If necessary statistics aren't available (point 1), fall back to the 10% rule.
If statistical assumptions (points 2, 3) aren't perfectly true, we'll still likely be getting a better estimate than the 10% rule.
UNIQ(A,πA,…(R))=UNIQ(A,R)
UNIQ(A,σ(R))≈UNIQ(A,R)
UNIQ(A,R×S)=UNIQ(A,R) or UNIQ(A,S)
max(UNIQ(A,R),UNIQ(A,S))≤UNIQ(A,R⊎S)≤UNIQ(A,R)+UNIQ(A,S)
minA(πA,…(R))=minA(R)
minA(σA,…(R))≈minA(R)
minA(R×S)=minA(R) or minA(S)
minA(R⊎S)=min(minA(R),minA(S))
Estimating δA(Q) requires only COUNT(DISTINCT A)
Selectivity is a probability (SEL(c,Q)=P(c))
P(A=x1) | = | 1COUNT(DISTINCT A) |
P(A∈(x1,x2,…,xN)) | = | NCOUNT(DISTINCT A) |
P(A≤x1) | = | x1−MIN(A)MAX(A)−MIN(A) |
P(x1≤A≤x2) | = | x2−x1MAX(A)−MIN(A) |
P(A=B) | = | min(1COUNT(DISTINCT A),1COUNT(DISTINCT B)) |
P(c1∧c2) | = | P(c1)⋅P(c2) |
P(c1∨c2) | = | 1−(1−P(c1))⋅(1−P(c2)) |
(With constants x1, x2, ...)
...but handles most usage patterns