CSE-4/562 Spring 2019 - Provenance

April 26, 2019

### Recap

Why does a tuple appear in a query output?

The answer can be summarized by a polynomial.

RAB
112→ a
213→ b
323→ c
424→ d
SBC
121→ e
222→ f
333→ g

$Q(1) = ae + af + bg$

### Set Semantics

$Q(1) = ae + af + bg$

$a = b = e = f = g = T \rightarrow$ $T$

### Bag Semantics

$Q(1) = ae + af + bg$

$\{a = 3; b = 1; e = 1; f = 1; g = 3 \} \rightarrow$ $12$

### How Provenance

$Q(1) = ae + af + bg$

$\{a = t_1; b = t_2; e = t_5; f = t_6; g = t_7 \}\rightarrow$
$\{\{t_1, t_5\}, \{t_1, t_6\}, \{t_2, t_7\}\}$

### Access Control

With $\left< [0, 5], \min, \max, 0, 5 \right>$
(0 = minimum security, 5 = top secret)

$Q(1) = ae + af + bg$

$\{a = 5; b = 1; e = 2; f = 3; g = 2 \}\rightarrow$$2 Smoke : Fine-grained Lineage at Interactive Speed (Psallidas, Wu) ### Implementation Challenges 1. Representing Provenance 2. Generating Provenance 3. Querying Provenance ### Representing Provenance Relational/Annotation Extra columns in each table store provenance information. External Provenance stored in external data-structures or tables. ### Generating Provenance Query Rewriting Q \rightarrow Q', where Q' computes the provenance. Operator Instrumentation Specialized operators construct provenance. ### Generating Provenance Eager Compute provenance as part of query results. Lazy Store query: Compute provenance as-needed. DesignGProMSMOKE RepresentationRelationalSpecialized ImplementationRewritingInstrumentation GenerationLazy + EagerEager + Hybrid ### GProM RAB 112→ a 213→ b 323→ c 424→ d SBC 121→ e 222→ f 333→ g RAB\phi 12R_1 13R_2 23R_3 24R_4 SBC\phi 21S_1 22S_2 33S_3 Extra columns store annotations  PROVENANCE OF SELECT A FROM R NATURAL JOIN S  We'll be using bag-relational algebra. R \rightarrow \pi_{R.*, \phi \leftarrow \texttt{ROWID}}(R) \pi_A(R) \rightarrow ? \pi_{A, \phi}(R) \sigma_c(R) \rightarrow ? \sigma_c(R) R \times S \rightarrow ? \rho_{\phi \rightarrow \phi_1} (R) \times \rho_{\phi \rightarrow \phi_2} (S) RAB\phi 12R_1 13R_2 23R_3 24R_4 SBC\phi 21S_1 22S_2 33S_3 R \bowtie SABC\phi_1$$\phi_2$
121$R_1$$S_1 122R_1$$S_2$
133$R_2$$S_3 233R_3$$S_3$

$\pi_A(R) \rightarrow \pi_{A, \phi_1, \ldots, \phi_N}(R)$

$R \times S \rightarrow ?$

$R$ (resp., $S$) has $M$ (resp., $N$) annotation columns.

$R \times \rho_{\phi_{1} \rightarrow \phi_{M+1}, \ldots, \phi_{N} \rightarrow \phi_{M+N}} (S)$

(Rename $S$'s columns, appending to $R$'s)

$R \uplus S \rightarrow ?$

If $R$ has $M$ annotation columns
and $S$ has $N > M$ columns:

$\pi_{R.*, \phi_{M+1} \leftarrow \texttt{NULL}, \ldots, \phi_{N} \leftarrow \texttt{NULL}} (R) \uplus S$

(Pad the narrower relation with $\texttt{NULL}$s)

RAB$\phi$
12$R_1$
13$R_2$
23$R_3$
24$R_4$
SBC$\phi$
21$S_1$
22$S_2$
33$S_3$
TABC$\phi$
214$T_1$
455$T_2$
$(R \bowtie S) \uplus T$ABC$\phi_1$$\phi_2 121R_1$$S_1$
122$R_1$$S_2 133R_2$$S_3$
233$R_3$$S_3 214T_1NULL 455T_2NULL ### Challenge Many-to-one operators Idea: Construct provenance by combining multiple rows. Q_1ABC\phi_1$$\phi_2$
121$R_1$$S_1 122R_1$$S_2$
133$R_2$$S_3 233R_3$$S_3$
214$T_1$NULL
455$T_2$NULL

$\delta\big(\pi_{A}(Q_1)\big)$?

$\delta\big(\pi_{A}(Q_1)\big)$A$\phi_1$$\phi_2 1R_1$$S_1$
1$R_1$$S_2 1R_2$$S_3$
2$R_3$$S_3 2T_1NULL 4T_2NULL \delta R \rightarrow ? R _A\gamma_{SUM(B)}(R) \rightarrow ? \big(_A\gamma_{SUM(B)}(R)\big) \bowtie_{A} \big( \pi_{A, \phi_1, \ldots, \phi_{M}} R\big) _A\gamma_{COUNT}(Q_1) (lhs only)ACOUNT 13 22 41 _A\gamma_{COUNT}(Q_1)ACOUNT\phi_1$$\phi_2$
13$R_1$$S_1 13R_1$$S_2$
13$R_2$$S_3 22R_3$$S_3$
22$T_1$NULL
41$T_2$NULL

Does this work? ... not quite

Is it a true provenance polynomial?

Plug in "F" for $R_1$ (i.e., simulate removing $R_1$ from the input)

$Q_1'$ABC
121→ $R_1 \wedge S_1$
122→ $R_1 \wedge S_2$
133→ $R_2 \wedge S_3$
233→ $R_3 \wedge S_3$
214→ $T_1$
455→ $T_2$

Plug in "F" for $R_1$ (i.e., simulate removing $R_1$ from the input)

$Q_1'$ABC
121→ $F \wedge S_1$
122→ $F \wedge S_2$
133→ $T \wedge S_3$
233→ $T \wedge S_3$
214→ $T$
455→ $T$

Plug in "F" for $R_1$ (i.e., simulate removing $R_1$ from the input)

$Q_1'$ABC
121→ $F$
122→ $F$
133→ $T$
233→ $T$
214→ $T$
455→ $T$
$Q_1'$ABC$\phi_1$$\phi_2 133R_2$$S_3$
233$R_3$$S_3 214T_1NULL 455T_2NULL _A\gamma_{COUNT}(Q_1')ACOUNT 11R_2$$S_3$
22$R_3$$S_3 22T_1NULL 41T_2NULL vs _A\gamma_{COUNT}(Q_1)ACOUNT 13R_2$$S_3$