CSE 662 Fall 2019 - Differential Dataflow

November 5

### Recap: AGCA / DBToaster

Think of relations as functions from tuples to multiplicities

$$R(t) \rightarrow \text{multiplicity}$$
• Negative multiplicities are allowed
• All possible tuples have multiplicities
• Only a finite number of tuples have non-zero multiplicities

### Recap: AGCA / DBToaster

Join
$[R \bowtie S](t_1 \bowtie t_2) = [R](t_1) \times [S](t_2)$
Union
$[R \uplus S](t) = [R](t) + [S](t)$
Projection
$[\pi R](t) = \sum_{t'} [R](t \bowtie t')$
Selection
$[\sigma_{\phi} R](t) = [R](t) \times \phi(t)$

### Recap: AGCA / DBToaster

Given $f(R)$, $R$, and $\delta R$, compute $f(R \uplus \delta R)$ as $f(R) \uplus f'(R, \delta R)$

Recursive implementation by cases.

Base case: $\delta(R) = \delta^+R - \delta^-R$

### Recap: AGCA / DBToaster

Join
$\delta(R \bowtie S) = (\delta(R) \bowtie S) \uplus (R \bowtie \delta(S))$
$\uplus (\delta(R) \bowtie \delta(S))$
Union
$\delta(R \uplus S) = \delta(R) \uplus \delta(S)$
Projection
$\delta(\pi R) = \pi(\delta(R))$
Selection
$\delta(\sigma R) = \sigma(\delta(R))$

For example...

• GraphLab
• Spark

### Dataflow Systems

A graph of deterministic (usually idempotent) operators.

Edges between operators representing data flows.

(Simple example: Relational Algebra tree)

### Dataflow System Challenges

Partitioning
How to avoid each instance of an operator needing every record.
Scheduling
Where/When should each operator execute to minimize latency/data transfer.

### Loops in Dataflow Systems

Example: Connected Components

1. Assign each node an identifier (label).
2. Propagate identifiers along edges.
3. Each node gets the smallest label from itself or peers.
4. Repeat from 2 until no more data
$$Q := min_{label}\big((Q \bowtie E) \uplus L\big)$$

... but from the scheduler's perspective ...

### Or in RA

$$Q_0 := L$$

$$Q_1 := min\big((Q_0 \bowtie E) \uplus Q_0\big)$$

$$Q_2 := min\big((Q_1 \bowtie E) \uplus Q_1\big)$$

$$Q_3 := min\big((Q_2 \bowtie E) \uplus Q_2\big)$$

...

(until $Q_i = Q_{i-1}$)

## Expensive!

### The IVM View

$$\delta Q_0 := Q_0 = L$$

$$\delta Q_1 := Q_1 - Q_0 \approx min(\delta Q_0 \bowtie E)$$

$$\delta Q_2 := Q_2 - Q_1 \approx min(\delta Q_1 \bowtie E)$$

$$\delta Q_3 := Q_3 - Q_2 \approx min(\delta Q_2 \bowtie E)$$

...

(until $\delta Q_i = \emptyset$)

$$\delta Q_{i+1} = min(Q_{i} \bowtie E \cup Q_{i}) - Q_{i}$$

$$\approx min(Q_{i} \bowtie E) - Q_{i}$$

$$= min((Q_{i-1} \cup \delta Q_{i}) \bowtie E) - Q_{i}$$

$$= min((Q_{i-2} \cup \delta Q_{i-1} \cup \delta Q_{i}) \bowtie E) - Q_{i}$$

$$= min\big( (\sum_{i' \leq i} \delta Q_{i'}) \bowtie E\big) - Q_{i}$$

Note: $\delta Q_{i'} \bowtie E \subseteq Q_{i} \;\;\; \forall i' < i$

$$= min\big( \delta Q_{i} \bowtie E\big) - Q_{i}$$

$$\delta Q_{i+1} \approx min\big( \delta Q_{i} \bowtie E\big) - Q_{i}$$

Now what if we want to modify E?

### Recall

$$Q_{0,0} := L$$

$$Q_{1,0} := min\big((Q_0 \bowtie E) \uplus Q_0\big)$$

$$Q_{2,0} := min\big((Q_1 \bowtie E) \uplus Q_1\big)$$

$$Q_{3,0} := min\big((Q_2 \bowtie E) \uplus Q_2\big)$$

...

(until $Q_{i,0} = Q_{i-1,0}$)

$$Q_{0,1} := L$$

$$Q_{1,1} := min\big((Q_0 \bowtie (E \uplus \delta E_1)) \uplus Q_0\big)$$

$$Q_{2,1} := min\big((Q_1 \bowtie (E \uplus \delta E_1)) \uplus Q_1\big)$$

$$Q_{3,1} := min\big((Q_2 \bowtie (E \uplus \delta E_1)) \uplus Q_2\big)$$

...

(until $Q_{i,1} = Q_{i-1,1}$)

$$Q_{0,2} := L$$

$$Q_{1,2} := min\big((Q_0 \bowtie (E \uplus \delta E_1 \uplus \delta E_2)) \uplus Q_0\big)$$

$$Q_{2,2} := min\big((Q_1 \bowtie (E \uplus \delta E_1 \uplus \delta E_2)) \uplus Q_1\big)$$

$$Q_{3,2} := min\big((Q_2 \bowtie (E \uplus \delta E_1 \uplus \delta E_2)) \uplus Q_2\big)$$

...

(until $Q_{i,2} = Q_{i-1,2}$)

$$Q_{i+1, j} := min\big((Q_{i,j} \bowtie (\sum_{j' \leq j} \delta E_{j'})) \uplus Q_{i,j}\big)$$

observe that ... $$Q_{i+1, j-1} := min\big((Q_{i,j-1} \bowtie (\sum_{j' < j} \delta E_{j'})) \uplus Q_{i,j-1}\big)$$

so... $$Q_{i+1, j} := min\big((Q_{i,j-1} \bowtie E_j) \ldots$$

$$\uplus Q_{i, j-1} \uplus Q_{i,j} \ldots$$

$$\uplus (Q_{i,j} - Q_{i, j-1})\bowtie(\sum_{j' \leq j} \delta E_{j'}) \big)$$

Let $\delta Q_{i,j}$ be all newly introduced values relative to all predecessors.

$$\delta Q_{i,j} = Q_{i,j} - Q_{i-1,j} - Q_{i, j-1}$$

$$Q_{i,j} := min\big((Q_{i-1,j-1} \bowtie E_j) \uplus Q_{i-1, j-1} \uplus Q_{i-1,j} \uplus (Q_{i-1,j} - Q_{i-1, j-1})\bowtie(\sum_{j' \leq j} \delta E_{j'}) \big)$$

$$\delta Q_{i,j} = min\big((Q_{i-1,j-1} \bowtie E_j) \uplus (Q_{i-1,j} - Q_{i-1, j-1})\bowtie(\sum_{j' \leq j} \delta E_{j'}) \big)$$

$$\delta Q_{i,j} = min\big((\delta Q_{i,j-1} \bowtie E_j) \uplus (Q_{i-1,j} - Q_{i-1, j-1})\bowtie(\sum_{j' \leq j} \delta E_{j'}) \big)$$

$$\delta Q_{i,j} = min\big((\delta Q_{i,j-1} \bowtie E_j) \uplus (\delta Q_{i-1,j}\bowtie\sum_{j' \leq j} \delta E_{j'}) \big)$$

Much cheaper to evaluate!
(but requires keeping around all $\delta Q_{i,j}$