CSE-4/562 Spring 2021 - Exam Review

May 4, 2021

### Exam Format

You will need:

• A smartphone or second device with a camera and zoom installed.
• A quiet, well lit area.
• A view of your work surface, you, and your computer screen.

### Day-Of

• Remember to enable "Do-not-Disturb" mode on your devices.
• Everyone logged into Zoom will be given a pin to access their exam.
• Scan your exam via zoom session in a breakout room.

SELECT  [DISTINCT] targetlist
FROM    relationlist
WHERE   condition

1. Compute the $2^n$ combinations of tuples in all relations appearing in relationlist
2. Discard tuples that fail the condition
3. Delete attributes not in targetlist
4. If DISTINCT is specified, eliminate duplicate rows

This is the least efficient strategy to compute a query! A good optimizer will find more efficient strategies to compute the same answer.

## Record Layouts

### Record Layout 2: Delimiters

File
A collection of pages (or records)
Page
A fixed-size collection of records
Page size is usually dictated by hardware.
Mem Page $\approx$ 4K
td>bsp td>bsp td>nsp;Cache Line $\approx$ 64B
Record
One or more fields (for now)
Field
A primitive value (for now)

### Relational Algebra

Queries are applied to Relations $$Q(\textbf{Trees}, \textbf{SpeciesInfo})$$

Queries are also Relations! $$Q_2(\textbf{SpeciesInfo}, Q_1(\textbf{Trees}))$$ (Relational Algebra is Closed)

### Relational Algebra

OperationSymMeaning
Selection$\sigma$Select a subset of the input rows
Projection$\pi$Delete unwanted columns
Cross-product$\times$Combine two relations
Set-difference$-$Tuples in Rel 1, but not Rel 2
Union$\cup$Tuples either in Rel 1 or in Rel 2
Intersection$\cap$Tuples in both Rel 1 and Rel 2
Join$\bowtie$Pairs of tuples matching a specified condition
Division$/$"Inverse" of cross-product
Sort $\tau_A$Sort records by attribute(s) $A$
Limit$\texttt{LIMIT}_N$Return only the first $N$ records
(according to sort order if paired with sort).

### RA Equivs

We say that $Q_1 \equiv Q_2$ if and only if
we can guarantee that the bag of tuples produced by $Q_1(R, S, T, \ldots)$
is the same as the bag of tuples produced by $Q_2(R, S, T, \ldots)$
for any combination of valid inputs $R, S, T, \ldots$.

... that satisfy any necessary properties.

RuleNotes
$\sigma_{C_1\wedge C_2}(R) \equiv \sigma_{C_1}(\sigma_{C_2}(R))$
$\sigma_{C_1\vee C_2}(R) \equiv \sigma_{C_1}(R) \cup \sigma_{C_2}(R)$ Note, this is only true for set, not bag union
$\sigma_C(R \times S) \equiv R \bowtie_C S$
$\sigma_C(R \times S) \equiv \sigma_C(R) \times S$ If $C$ references only $R$'s attributes, also works for joins
$\pi_{A}(\pi_{A \cup B}(R)) \equiv \pi_{A}(R)$
$\sigma_C(\pi_{A}(R)) \equiv \pi_A(\sigma_C(R))$ If $A$ contains all of the attributes referenced by $C$
$\pi_{A\cup B}(R\times S) \equiv \pi_A(R) \times \pi_B(S)$ Where $A$ (resp., $B$) contains attributes in $R$ (resp., $S$)
$R \times (S \times T) \equiv (R \times S) \times T$ Also works for joins
$R \times S \equiv S \times R$ Also works for joins
$R \cup (S \cup T) \equiv (R \cup S) \cup T$ Also works for intersection and bag-union
$R \cup S \equiv S \cup R$ Also works for intersections and bag-union
$\sigma_{C}(R \cup S) \equiv \sigma_{C}(R) \cup \sigma_{C}(S)$ Also works for intersections and bag-union
$\pi_{A}(R \cup S) \equiv \pi_{A}(R) \cup \pi_{A}(S)$ Also works for intersections and bag-union
$\sigma_{C}(\gamma_{A, AGG}(R)) \equiv \gamma_{A, AGG}(\sigma_{C}(R))$ If $A$ contains all of the attributes referenced by $C$

### Algorithms

"Volcano" Operators (Iterators)
Operators "pull" tuples, one-at-a-time, from their children.
2-Pass (External) Sort
Create sorted runs, then repeatedly merge runs
Join Algorithms
Quickly picking out specific pairs of tuples.
Aggregation Algorithms
In-Memory vs 2-Pass, Normal vs Group-By

### Strategies for Implementing $R \bowtie_{R.A = S.A} S$

Sort/Merge Join
Sort all of the data upfront, then scan over both sides.
In-Memory Index Join (1-pass Hash; Hash Join)
Build an in-memory index on one table, scan the other.
Partition Join (2-pass Hash; External Hash Join)
Partition both sides so that tuples don't join across partitions.

### Sort/Merge Join Sort/Merge typically expressed as 3 operators
(2xSort + Merge)

### 2-Pass Hash Join

Limited Queries
Only supports join conditions of the form $R.A = S.B$
Low Memory
Never need more than 1 pair of partitions in memory
High IO Cost
Every record gets written out to disk, and back in.

Can partition on data-values to support other types of queries.

### Index Nested Loop Join

To compute $R \bowtie_{R.A < S.B} S$ with an index on $S.B$
1. Read One Row of $R$
2. Get the value of $R.A$
3. Start index scan on $S.B > [R.A]$
4. Return rows as normal

### Basic Aggregate Pattern

Init
Define a starting value for the accumulator
Fold(Accum, New)
Merge a new value into the accumulator
Finalize(Accum)
Extract the aggregate from the accumulator.

### Basic Aggregate Types

Grey et. al. "Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Distributive
Finite-sized accumulator and doesn't need a finalize (COUNT, SUM)
Algebraic
Finite-sized accumulator but needs a finalize (AVG)
Holistic
Unbounded accumulator (MEDIAN)

### Grouping Algorithms

2-pass Hash Aggregate
Like 2-pass Hash Join: Distribute groups across buckets, then do an in-memory aggregate for each bucket.
Sort-Aggregate
Like Sort-Merge Join: Sort data by groups, then group elements will be adjacent.

Cost-Based Optimization

### Accounting

Figure out the cost of each individual operator.

Only count the number of IOs added by each operator.

OperationRATotal IOs (#pages)Memory (#tuples)
Table Scan $R$ $\frac{|R|}{\mathcal P}$ $O(1)$
Projection $\pi(R)$ $\textbf{io}(R)$ $O(1)$
Selection $\sigma(R)$ $\textbf{io}(R)$ $O(1)$
Union $R \uplus S$ $\textbf{io}(R) + \textbf{io}(S)$ $O(1)$
Sort (In-Mem) $\tau(R)$ $\textbf{io}(R)$ $O(|R|)$
Sort (On-Disk) $\tau(R)$ $\frac{2 \cdot \lfloor log_{\mathcal B}(|R|) \rfloor}{\mathcal P} + \textbf{io}(R)$ $O(\mathcal B)$
(B+Tree) Index Scan $Index(R, c)$ $\log_{\mathcal I}(|R|) + \frac{|\sigma_c(R)|}{\mathcal P}$ $O(1)$
(Hash) Index Scan $Index(R, c)$ $1$ $O(1)$
1. Tuples per Page ($\mathcal P$) – Normally defined per-schema
2. Size of $R$ ($|R|$)
3. Pages of Buffer ($\mathcal B$)
4. Keys per Index Page ($\mathcal I$)
OperationRATotal IOs (#pages)Mem (#tuples)
Nested Loop Join (Buffer $S$ in mem) $R \times_{mem} S$ $\textbf{io}(R)+\textbf{io}(S)$ $O(|S|)$
Block NLJ (Buffer $S$ on disk) $R \times_{disk} S$ $\frac{|R|}{\mathcal B} \cdot \frac{|S|}{\mathcal P} + \textbf{io}(R) + \textbf{io}(S)$ $O(1)$
Block NLJ (Recompute $S$) $R \times_{redo} S$ $\textbf{io}(R) + \frac{|R|}{\mathcal B} \cdot \textbf{io}(S)$ $O(1)$
1-Pass Hash Join $R \bowtie_{1PH, c} S$ $\textbf{io}(R) + \textbf{io}(S)$ $O(|S|)$
2-Pass Hash Join $R \bowtie_{2PH, c} S$ $\frac{2|R| + 2|S|}{\mathcal P} + \textbf{io}(R) + \textbf{io}(S)$ $O(1)$
Sort-Merge Join $R \bowtie_{SM, c} S$ [Sort] [Sort]
(Tree) Index NLJ $R \bowtie_{INL, c}$ $|R| \cdot (\log_{\mathcal I}(|S|) + \frac{|\sigma_c(S)|}{\mathcal P})$ $O(1)$
(Hash) Index NLJ $R \bowtie_{INL, c}$ $|R| \cdot 1$ $O(1)$
(In-Mem) Aggregate $\gamma_A(R)$ $\textbf{io}(R)$ $adom(A)$
(Sort/Merge) Aggregate $\gamma_A(R)$ [Sort] [Sort]
1. Tuples per Page ($\mathcal P$) – Normally defined per-schema
2. Size of $R$ ($|R|$)
3. Pages of Buffer ($\mathcal B$)
4. Keys per Index Page ($\mathcal I$)
5. Number of distinct values of $A$ ($adom(A)$)

Estimating IOs requires Estimating $|Q(R)|$

Operator RA Estimated Size
Table $R$ $|R|$
Projection $\pi(Q)$ $|Q|$
Union $Q_1 \uplus Q_2$ $|Q_1| + |Q_2|$
Cross Product $Q_1 \times Q_2$ $|Q_1| \times |Q_2|$
Sort $\tau(Q)$ $|Q|$
Limit $\texttt{LIMIT}_N(Q)$ $N$
Selection $\sigma_c(Q)$ $|Q| \times \texttt{SEL}(c, Q)$
Join $Q_1 \bowtie_c Q_2$ $|Q_1| \times |Q_2| \times \texttt{SEL}(c, Q_1\times Q_2)$
Distinct $\delta_A(Q)$ $\texttt{UNIQ}(A, Q)$
Aggregate $\gamma_{A, B \leftarrow \Sigma}(Q)$ $\texttt{UNIQ}(A, Q)$
• $\texttt{SEL}(c, Q)$: Selectivity of $c$ on $Q$, or $\frac{|\sigma_c(Q)|}{|Q|}$
• $\texttt{UNIQ}(A, Q)$: # of distinct values of $A$ in $Q$.

### (Some) Estimation Techniques

Guess Randomly
Rules of thumb if you have no other options...
Uniform Prior
Use basic statistics to make a very rough guess.
Sampling / History
Small, Quick Sampling Runs (or prior executions of the query).
Histograms
Using more detailed statistics for improved guesses.
Constraints
Using rules about the data for improved guesses.

### Sketching

FlipsScoreProbability E[# Games]
(👽)00.5 2
(🐕)(👽)10.25 4
(🐕)(🐕)(👽)20.125 8
(🐕)$\times N$   (👽)$N$$\frac{1}{2^{N+1}}$ $2^{N+1}$

If I told you that in a series of games, my best score was $N$, you might expect that I played $2^{N+1}$ games.

To do that, I only need to track my top score!

### Flajolet-Martin Sketches

#### ($\approx$ HyperLogLog)

1. For each record...
1. Hash each record
2. Find the index of the lowest-order non-zero bit
3. Add the index of the bit to a set
2. Find $R$, the lowest index not in the set
3. Estimate Count-Distinct as $\frac{2^R}{\phi}$ ($\phi \approx 0.77351$)
4. Repeat (in parallel) as needed

### Count Sketches

1. Pick a number of "trials" and a number of "bins"
2. For each record $O_i$
1. For each "trial" $j$
1. Use a hash function $h_j(O_i)$ to pick a bin
2. Add a $\pm 1$ value determined by hash function $\delta_j(O_i)$ to the bin
• For each trial $j$, estimate the count of $O_i$ by the value of bin $h_j(O_i)$
• Take the median value for all trials.
• ### Count-Min Sketches

1. Pick a number of "trials" and a number of "bins"
2. For each record $O_i$
1. For each "trial" $j$
1. Use a hash function $h_j(O_i)$ to pick a bin
2. Add 1 to the bin
• For each trial $j$, estimate the count of $O_i$ by the value of bin $h_j(O_i)$
• Take the minimum value for all trials.
• Flajolet-Martin Sketches (HyperLogLog)
Estimating Count-Distinct
Count Sketches
Estimating Count-GroupBy
(roughly uniform counts)
Count-Min Sketches
Estimating Count-GroupBy
(small number of heavy hitters)