- Cost-Based Optimization

Apply blind heuristics (e.g., push down selections)
Enumerate all possible execution plans by varying (or for a reasonable subset)
- Join/Union Evaluation Order (commutativity, associativity, distributivity)
- Algorithms for Joins, Aggregates, Sort, Distinct, and others
- Data Access Paths
Estimate the cost of each execution plan
Pick the execution plan with the lowest cost

Idea 1: Run each plan

Operation	RA	Total IOs (#pages)	Memory (#tuples)
Table Scan	$R$	$\frac{\|R\|}{\mathcal P}$	$O(1)$
Projection	$\pi(R)$	$\textbf{io}(R)$	$O(1)$
Selection	$\sigma(R)$	$\textbf{io}(R)$	$O(1)$
Union	$R \uplus S$	$\textbf{io}(R) + \textbf{io}(S)$	$O(1)$
Sort (In-Mem)	$\tau(R)$	$\textbf{io}(R)$	$O(\|R\|)$
Sort (On-Disk)	$\tau(R)$	$\frac{2 \cdot \lfloor log_{\mathcal B}(\|R\|) \rfloor}{\mathcal P} + \textbf{io}(R)$	$O(\mathcal B)$
(B+Tree) Index Scan	$Index(R, c)$	$\log_{\mathcal I}(\|R\|) + \frac{\|\sigma_c(R)\|}{\mathcal P}$	$O(1)$
(Hash) Index Scan	$Index(R, c)$	$1$	$O(1)$

Operation	RA	Total IOs (#pages)	Mem (#tuples)
Nested Loop Join (Buffer $S$ in mem)	$R \times_{mem} S$	$\textbf{io}(R)+\textbf{io}(S)$	$O(\|S\|)$
Block NLJ (Buffer $S$ on disk)	$R \times_{disk} S$	$\frac{\|R\|}{\mathcal B} \cdot \frac{\|S\|}{\mathcal P} + \textbf{io}(R) + \textbf{io}(S)$	$O(1)$
Block NLJ (Recompute $S$ )	$R \times_{redo} S$	$\textbf{io}(R) + \frac{\|R\|}{\mathcal B} \cdot \textbf{io}(S)$	$O(1)$
1-Pass Hash Join	$R \bowtie_{1PH, c} S$	$\textbf{io}(R) + \textbf{io}(S)$	$O(\|S\|)$
2-Pass Hash Join	$R \bowtie_{2PH, c} S$	$\frac{2\|R\| + 2\|S\|}{\mathcal P} + \textbf{io}(R) + \textbf{io}(S)$	$O(1)$
Sort-Merge Join	$R \bowtie_{SM, c} S$	[Sort]	[Sort]
(Tree) Index NLJ	$R \bowtie_{INL, c}$	$\|R\| \cdot (\log_{\mathcal I}(\|S\|) + \frac{\|\sigma_c(S)\|}{\mathcal P})$	$O(1)$
(Hash) Index NLJ	$R \bowtie_{INL, c}$	$\|R\| \cdot 1$	$O(1)$
(In-Mem) Aggregate	$\gamma_A(R)$	$\textbf{io}(R)$	$adom(A)$
(Sort/Merge) Aggregate	$\gamma_A(R)$	[Sort]	[Sort]

Symbol	Parameter	Type
$\mathcal P$	Tuples Per Page	Fixed ( $\frac{\|\text{page}\|}{\|\text{tuple}\|}$ )
$\|R\|$	Size of $R$	Precomputed ( $\|R\|$ )
$\mathcal B$	Pages of Buffer	Configurable Parameter
$\mathcal I$	Keys per Index Page	Fixed ( $\frac{\|\text{page}\|}{\|\text{key+pointer}\|}$ )
$adom(A)$	Number of distinct values of $A$	Precomputed ( $\|\delta_A(R)\|$ )

* unless $R$ is a query

Estimating IOs requires Estimating $|Q(R)|$ , $|\delta_A(Q(R))|$

Operator	RA	Estimated Size
Table	$R$	$\|R\|$
Projection	$\pi(Q)$	$\|Q\|$
Union	$Q_1 \uplus Q_2$	$\|Q_1\| + \|Q_2\|$
Cross Product	$Q_1 \times Q_2$	$\|Q_1\| \times \|Q_2\|$
Sort	$\tau(Q)$	$\|Q\|$
Limit	$\texttt{LIMIT}_N(Q)$	$N$
Selection	$\sigma_c(Q)$	$\|Q\| \times \texttt{SEL}(c, Q)$
Join	$Q_1 \bowtie_c Q_2$	$\|Q_1\| \times \|Q_2\| \times \texttt{SEL}(c, Q_1\times Q_2)$
Distinct	$\delta_A(Q)$	$\texttt{UNIQ}(A, Q)$
Aggregate	$\gamma_{A, B \leftarrow \Sigma}(Q)$	$\texttt{UNIQ}(A, Q)$

$\texttt{SEL}(c, Q)$ : Selectivity of $c$ on $Q$ , or $\frac{|\sigma_c(Q)|}{|Q|}$
$\texttt{UNIQ}(A, Q)$ : # of distinct values of $A$ in $Q$ .

We assume that for $\sigma_c(Q)$ or $\delta_A(Q)$ ...

Basic statistics are known about Q:
- COUNT(*)
- COUNT(DISTINCT A) (for each A)
- MIN(A), MAX(A) (for each numeric A)
Attribute values are uniformly distributed.
No inter-attribute correlations.

If necessary statistics aren't available (point 1), fall back to the 10% rule.

If statistical assumptions (points 2, 3) aren't perfectly true, we'll still likely be getting a better estimate than the 10% rule.

Selectivity is a probability ( $\texttt{SEL}(c, Q) = P(c)$ )

$P(A = x_1)$	$=$	$\frac{1}{\texttt{COUNT(DISTINCT A)}}$
$P(A \in (x_1, x_2, \ldots, x_N))$	$=$	$\frac{N}{\texttt{COUNT(DISTINCT A)}}$
$P(A \leq x_1)$	$=$	$\frac{x_1 - \texttt{MIN(A)}}{\texttt{MAX(A)} - \texttt{MIN(A)}}$
$P(x_1 \leq A \leq x_2)$	$=$	$\frac{x_2 - x_1}{\texttt{MAX(A)} - \texttt{MIN(A)}}$
$P(A = B)$	$=$	$\textbf{min}\left( \frac{1}{\texttt{COUNT(DISTINCT A)}}, \frac{1}{\texttt{COUNT(DISTINCT B)}} \right)$
$P(c_1 \wedge c_2)$	$=$	$P(c_1) \cdot P(c_2)$
$P(c_1 \vee c_2)$	$=$	$1 - (1 - P(c_1)) \cdot (1 - P(c_2))$

(With constants $x_1$ , $x_2$ , ...)

Don't always have statistics for $Q$: For example, $\pi_{A \leftarrow (B \cdot C)}(R)$
Don't always have clear rules for $c$: For example, $\sigma_{\texttt{FitsModel}(A, B, C)}(R)$
Attribute values are not always uniformly distributed.: For example, $|\sigma_{SPC\_COMMON = 'pin\ oak'}(T)|$ vs $|\sigma_{SPC\_COMMON = 'honeylocust'}(T)|$
Attribute values are sometimes correlated.: For example, $\sigma_{(stump < 5) \wedge (diam > 3)}(T)$

...but handles most usage patterns