CSE-4/562 Spring 2019

Checkpoint 3

April 10, 2019

Textbook:

Phase 1: 5 minutes (total) to process CREATE TABLEs; Stream of queries, as before
Phase 2 -- Restart: No CREATE TABLE issued; Stream of queries, as before

Both phases in-memory

Optimization Ideas

Cache smaller tables in-memory: Avoid repeated parsing overheads
Build Indexes: CREATE TABLEs will include PRIMARY KEY and other suggested indexes.
Gather statistics: Pick the best available index / join order.
Save pre-parsed tables: Avoid parsing overheads on first load
Projection pushdown: Avoid parsing, store smaller tuples, fewer memory copies

In-Memory Cost-Based Optimization

Remember the Real Goals

Accurately rank the plans.
Don't spend more time optimizing than you get back.
Don't pick a plan that uses more memory than you have.

Typical cost centers

Parsing
Memory Copies (Tuple Construction)
(Nearly) everything else

Not universal, every system is different

HProf


  $> java -Xrunhprof:cpu=samples dubstep.Main


  $> java -Xrunhprof:cpu=time dubstep.Main

Sampling is faster, Time is more accurate

Output ends up in ./java.hprof.txt

Example java.hprof.txt

...
... lots of gobblygook ...
...
CPU SAMPLES BEGIN (total = 126) Fri Oct 22 12:12:14 2004
rank   self  accum   count trace method
   1 53.17% 53.17%      67 300027 java.util.zip.ZipFile.getEntry
   2 17.46% 70.63%      22 300135 java.util.zip.ZipFile.getNextEntry
   3  5.56% 76.19%       7 300111 java.lang.ClassLoader.defineClass2
   4  3.97% 80.16%       5 300140 java.io.UnixFileSystem.list
   5  2.38% 82.54%       3 300149 java.lang.Shutdown.halt0
   6  1.59% 84.13%       2 300136 java.util.zip.ZipEntry.initFields
   7  1.59% 85.71%       2 300138 java.lang.String.substring
   8  1.59% 87.30%       2 300026 java.util.zip.ZipFile.open
   9  0.79% 88.10%       1 300118 com.sun.tools.javac.code.Type$ErrorType.<init>
  10  0.79% 88.89%       1 300134 java.util.zip.ZipFile.ensureOpen

https://docs.oracle.com/javase/7/docs/technotes/samples/hprof.html

'self' column shows how much time is spent in that method.

'count' column shows number of samples/invocations

'trace' column links back to a stack trace of the sample/invocation


TRACE 300027:
java.util.zip.ZipFile.getEntry(ZipFile.java:Unknown line)
java.util.zip.ZipFile.getEntry(ZipFile.java:253)
java.util.jar.JarFile.getEntry(JarFile.java:197)
java.util.jar.JarFile.getJarEntry(JarFile.java:180)

Main Goal: Figure out where your code is spending its time, have it spend less time there

Idea: Minimize number of "intermediate" tuples

Estimating IOs requires Estimating $|Q(R)|$, $|\delta_A(Q(R))|$

Cardinality Estimation

Operator	RA	Estimated Size
Table	$R$	$\|R\|$
Projection	$\pi(Q)$	$\|Q\|$
Union	$Q_1 \uplus Q_2$	$\|Q_1\| + \|Q_2\|$
Cross Product	$Q_1 \times Q_2$	$\|Q_1\| \times \|Q_2\|$
Sort	$\tau(Q)$	$\|Q\|$
Limit	$\texttt{LIMIT}_N(Q)$	$N$
Selection	$\sigma_c(Q)$	$\|Q\| \times \texttt{SEL}(c, Q)$
Join	$Q_1 \bowtie_c Q_2$	$\|Q_1\| \times \|Q_2\| \times \texttt{SEL}(c, Q_1\times Q_2)$
Distinct	$\delta_A(Q)$	$\texttt{UNIQ}(A, Q)$
Aggregate	$\gamma_{A, B \leftarrow \Sigma}(Q)$	$\texttt{UNIQ}(A, Q)$

$\texttt{SEL}(c, Q)$: Selectivity of $c$ on $Q$, or $\frac{|\sigma_c(Q)|}{|Q|}$
$\texttt{UNIQ}(A, Q)$: # of distinct values of $A$ in $Q$.

Cardinality Estimation

(The Hard Parts)

$\sigma_c(Q)$ (Cardinality Estimation): How many tuples will a condition $c$ allow to pass?
$\delta_A(Q)$ (Distinct Values Estimation): How many distinct values of attribute(s) $A$ exist?

Many major databases (Oracle, Postgres, Teradata, etc...) use something like 10% rule if they have nothing better.

(The specific % varies by DBMS. E.g., Teradata uses 10% for the first AND clause, and 75% for every subsequent clause)

Uniform Prior

We assume that for $\sigma_c(Q)$ or $\delta_A(Q)$...

Basic statistics are known about $Q$:
- COUNT(*)
- COUNT(DISTINCT A) (for each A)
- MIN(A), MAX(A) (for each numeric A)
Attribute values are uniformly distributed.
No inter-attribute correlations.

If necessary statistics aren't available (point 1), fall back to the 10% rule.

If statistical assumptions (points 2, 3) aren't perfectly true, we'll still likely be getting a better estimate than the 10% rule.

COUNT(DISTINCT A)

$\texttt{UNIQ}(A, \pi_{A, \ldots}(R)) = \texttt{UNIQ}(A, R)$

$\texttt{UNIQ}(A, \sigma(R)) \approx \texttt{UNIQ}(A, R)$

$\texttt{UNIQ}(A, R \times S) = \texttt{UNIQ}(A, R)$ or $\texttt{UNIQ}(A, S)$

$$max(\texttt{UNIQ}(A, R), \texttt{UNIQ}(A, S)) \leq\\ \texttt{UNIQ}(A, R \uplus S)\\ \leq \texttt{UNIQ}(A, R) + \texttt{UNIQ}(A, S)$$

MIN(A), MAX(A)

$min_A(\pi_{A, \ldots}(R)) = min_A(R)$

$min_A(\sigma_{A, \ldots}(R)) \approx min_A(R)$

$min_A(R \times S) = min_A(R)$ or $min_A(S)$

$min_A(R \uplus S) = min(min_A(R), min_A(S))$

Estimating $\delta_A(Q)$ requires only COUNT(DISTINCT A)

Estimating Selectivity

Selectivity is a probability ($\texttt{SEL}(c, Q) = P(c)$)

$P(A = x_1)$	$=$	$\frac{1}{\texttt{COUNT(DISTINCT A)}}$
$P(A \in (x_1, x_2, \ldots, x_N))$	$=$	$\frac{N}{\texttt{COUNT(DISTINCT A)}}$
$P(A \leq x_1)$	$=$	$\frac{x_1 - \texttt{MIN(A)}}{\texttt{MAX(A)} - \texttt{MIN(A)}}$
$P(x_1 \leq A \leq x_2)$	$=$	$\frac{x_2 - x_1}{\texttt{MAX(A)} - \texttt{MIN(A)}}$
$P(A = B)$	$=$	$\textbf{min}\left( \frac{1}{\texttt{COUNT(DISTINCT A)}}, \frac{1}{\texttt{COUNT(DISTINCT B)}} \right)$
$P(c_1 \wedge c_2)$	$=$	$P(c_1) \cdot P(c_2)$
$P(c_1 \vee c_2)$	$=$	$1 - (1 - P(c_1)) \cdot (1 - P(c_2))$

(With constants $x_1$, $x_2$, ...)

Simplified Plan Exploration

Left-Deep Plans Only: Avoid searching the full space of join orders.
Greedy index selection: Separate join ordering and index selection

Suggestion: Start Simple

In-Mem/On-Disk Index Builder
Optimizer += Index Selection (e.g., selectivity)
Index Scan
Index Nested Loop Join