CSE-4/562 Spring 2021 - Parallel Query Execution

March 18, 2021

## Why Scale?

Scan 1 PB at 300MB/s (SATA r2)

## Communication Models

### Shared Nothing / Message Passing

Shared Memory
HDFS, S3, RAM+Modern OSes
NUMA
AMD CPUs
Shared Nothing
MPP, Actor Model

Today we want to clearly see the communications.

### Putting Workers Together

Wikipedia - Public Domain Image - Photographer Unknown

freesvg.org

freesvg.org

freesvg.org

freesvg.org

### How to Partition Data?

• Arbitrarily
• Range
• Hash

look familiar?

Can we run each worker on one partition?

## Yes

$N$ partitions in, $N$ partitions out

## Yes

$N$ partitions in, $N$ partitions out

## Yes

Trick question, just combines partitions!

## No

$N$ partitions in, $1$ partition out

### Partial Aggregation

Algebraic Aggregates (Count, Sum, Avg, Min, Max)
Bounded-size intermediate state
Holistic Aggregates (Median, Mode, Count-Distinct)
Unbounded-size intermediate state
$$\Sigma_{SUM} (R_1 \uplus R_2 \uplus \ldots \uplus R_N)$$

Aggregate needs to process $N$ partitions.

$$\Sigma_{SUM} (\Sigma_{SUM}(R_1) \uplus \Sigma_{SUM}(R_2) \uplus \ldots \uplus \Sigma_{SUM}(R_N))$$

Final aggregate only needs to process $N$ tuples.

$$\Sigma_{AVG} (R_1 \uplus R_2 \uplus \ldots \uplus R_N)$$

↓↓↓↓↓

$$\Sigma_{\frac{SUM(A)}{SUM(B)}} (\Sigma_{A \leftarrow SUM,\; B \leftarrow COUNT}(R_1) \ldots \Sigma_{A \leftarrow SUM,\; B \leftarrow COUNT}(R_N))$$

### Basic Aggregate Pattern

Init
Define a starting value for the accumulator
Fold(Accum, New)
Merge a new value into the accumulator
Finalize(Accum)
Extract the aggregate from the accumulator.
Merge(Accum, Accum)
Merge two accumulators together.

## No

Every partition from one table needs to pair
with every partition from the other.

$$(R_1 \uplus \ldots \uplus R_N) \bowtie (S_1 \uplus \ldots \uplus S_K)$$

↓↓↓↓↓↓

$$(R_1 \bowtie S_1) \uplus \ldots \uplus (R_1 \bowtie S_K)$$ $$\ldots\uplus \ldots \uplus \ldots$$ $$(R_N \bowtie S_1) \uplus \ldots \uplus (R_N \bowtie S_K)$$
 $S_1$ $S_2$ $S_3$ $S_4$ $R_1$ $R_1\bowtie S_1$ $R_1\bowtie S_2$ $R_1\bowtie S_3$ $R_1\bowtie S_4$ $R_2$ $R_2\bowtie S_1$ $R_2\bowtie S_2$ $R_2\bowtie S_3$ $R_2\bowtie S_4$ $R_3$ $R_3\bowtie S_1$ $R_3\bowtie S_2$ $R_3\bowtie S_3$ $R_3\bowtie S_4$ $R_4$ $R_4\bowtie S_1$ $R_4\bowtie S_2$ $R_4\bowtie S_3$ $R_4\bowtie S_4$

$N$ workers gets us $\sqrt{N}$ scaling

### How to Partition Data?

• Arbitrarily
• Range
• Hash
$$R \bowtie_{A} S$$ $$R_i = \sigma_{\texttt{hash}(A) = i}(R)$$ $$S_i = \sigma_{\texttt{hash}(A) = i}(S)$$
 $S_1$ $S_2$ $S_3$ $S_4$ $R_1$ $R_1\bowtie S_1$ $R_1\bowtie S_2$ $R_1\bowtie S_3$ $R_1\bowtie S_4$ $R_2$ $R_2\bowtie S_1$ $R_2\bowtie S_2$ $R_2\bowtie S_3$ $R_2\bowtie S_4$ $R_3$ $R_3\bowtie S_1$ $R_3\bowtie S_2$ $R_3\bowtie S_3$ $R_3\bowtie S_4$ $R_4$ $R_4\bowtie S_1$ $R_4\bowtie S_2$ $R_4\bowtie S_3$ $R_4\bowtie S_4$

Back to $N$ scaling for $N$ workers

What if the partitions aren't aligned so nicely?

Can we do better?

Focus on $R_1 \bowtie_B S_1$

Problem: All tuples in $R_1$ and $S_1$ need to be
sent to the same worker.

### Data Transfer

• Limited IO/Network bandwidth
• Compute needed to receive data

Idea 1: Put the worker on the node that has the data!

Problem: What if the data is on 2 different nodes?

Idea 1.b: Put the worker on one of the nodes with data.

Can we reduce network use more?

Problem: Worker 2 is still sending a lot of data.

Idea: Compress $\pi_B(S_1)$

### Lossy Compression

(not all errors are equal)

False Positives
($b \in \pi_B(S_1)$ when it isn't)
not ideal, but ok
False Negatives
($b \not\in \pi_B(S_1)$ when it is)

### Bloom Filters

$$filter \leftarrow \texttt{Bloom}(\textbf{Alice}, \textbf{Bob}, \textbf{Carol}, \textbf{Dave})$$
 User: Is Alice part of the set? $filter$: Yes User: Is Eve part of the set? $filter$: No User: Is Fred part of the set? $filter$: Yes

### Bloom Filter

Test always returns Yes if the element is in the set.

Test usually returns No if the element is not in the set.

### Bloom Filters

A bloom filter is an array of bits.

$M$: Number of bits in the array.

$K$: Number of hash functions.

### For one record/key

1. $\forall i \in [M] : filter[i] = 0$
2. $\forall j \in [K] : filter[\texttt{hash}_j(key)] = 1$

Each bit vector has $\sim K$ bits set.

 $Key_1$ 00101010 $Key_2$ 01010110 $Key_3$ 10000110 $Key_4$ 01001100

#### Filters are combined by Bitwise-OR

$$Key_1 \;|\; Key_2 = 01111110$$

#### Test for inclusion by checking for bits

$$Key_i \;\&\; filter = Key_i$$
 $Key_1$ 00101010 $Key_2$ 01010110 $Key_3$ 10000110 $Key_4$ 01001100
$$Key_1 \;|\; Key_2 = 01111110$$
 $Key_1 \;\&\; 01111110$ 00101010 ✅ $Key_3 \;\&\; 01111110$ 00101010 ❌ $Key_4 \;\&\; 01111110$ 01001100 ✅

(False positive)

Next time: Online Aggregation/AQP