CSE-4/562 Spring 2019 - Provenance

Provenance

CSE-4/562 Spring 2019

April 26, 2019

Textbook: GProM, SMOKE

Recap

Why does a tuple appear in a query output?

The answer can be summarized by a polynomial.

RAB
112→ a
213→ b
323→ c
424→ d
SBC
121→ e
222→ f
333→ g

$Q(1) = ae + af + bg$

Set Semantics

$Q(1) = ae + af + bg$

$a = b = e = f = g = T \rightarrow$ $T$

Bag Semantics

$Q(1) = ae + af + bg$

$\{a = 3; b = 1; e = 1; f = 1; g = 3 \} \rightarrow$ $12$

How Provenance

$Q(1) = ae + af + bg$

$\{a = t_1; b = t_2; e = t_5; f = t_6; g = t_7 \}\rightarrow$
$\{\{t_1, t_5\}, \{t_1, t_6\}, \{t_2, t_7\}\}$

Access Control

With $\left< [0, 5], \min, \max, 0, 5 \right>$
(0 = minimum security, 5 = top secret)

$Q(1) = ae + af + bg$

$\{a = 5; b = 1; e = 2; f = 3; g = 2 \}\rightarrow$$2$

Smoke : Fine-grained Lineage at Interactive Speed (Psallidas, Wu)

Implementation Challenges

  1. Representing Provenance
  2. Generating Provenance
  3. Querying Provenance

Representing Provenance

Relational/Annotation
Extra columns in each table store provenance information.
External
Provenance stored in external data-structures or tables.

Generating Provenance

Query Rewriting
$Q \rightarrow Q'$, where $Q'$ computes the provenance.
Operator Instrumentation
Specialized operators construct provenance.

Generating Provenance

Eager
Compute provenance as part of query results.
Lazy
Store query: Compute provenance as-needed.
DesignGProMSMOKE
RepresentationRelationalSpecialized
ImplementationRewritingInstrumentation
GenerationLazy + EagerEager + Hybrid

GProM

RAB
112→ a
213→ b
323→ c
424→ d
SBC
121→ e
222→ f
333→ g
RAB$\phi$
12$R_1$
13$R_2$
23$R_3$
24$R_4$
SBC$\phi$
21$S_1$
22$S_2$
33$S_3$

Extra columns store annotations


      PROVENANCE OF SELECT A FROM R NATURAL JOIN S
    

We'll be using bag-relational algebra.

$R \rightarrow \pi_{R.*, \phi \leftarrow \texttt{ROWID}}(R)$

$\pi_A(R) \rightarrow ?$

$\pi_{A, \phi}(R)$

$\sigma_c(R) \rightarrow ?$

$\sigma_c(R)$

$R \times S \rightarrow ?$

$\rho_{\phi \rightarrow \phi_1} (R) \times \rho_{\phi \rightarrow \phi_2} (S)$

RAB$\phi$
12$R_1$
13$R_2$
23$R_3$
24$R_4$
SBC$\phi$
21$S_1$
22$S_2$
33$S_3$
$R \bowtie S$ABC$\phi_1$$\phi_2$
121$R_1$$S_1$
122$R_1$$S_2$
133$R_2$$S_3$
233$R_3$$S_3$

$\pi_A(R) \rightarrow \pi_{A, \phi_1, \ldots, \phi_N}(R)$

$R \times S \rightarrow ?$

$R$ (resp., $S$) has $M$ (resp., $N$) annotation columns.

$R \times \rho_{\phi_{1} \rightarrow \phi_{M+1}, \ldots, \phi_{N} \rightarrow \phi_{M+N}} (S)$

(Rename $S$'s columns, appending to $R$'s)

$R \uplus S \rightarrow ?$

If $R$ has $M$ annotation columns
and $S$ has $N > M$ columns:

$\pi_{R.*, \phi_{M+1} \leftarrow \texttt{NULL}, \ldots, \phi_{N} \leftarrow \texttt{NULL}} (R) \uplus S$

(Pad the narrower relation with $\texttt{NULL}$s)

RAB$\phi$
12$R_1$
13$R_2$
23$R_3$
24$R_4$
SBC$\phi$
21$S_1$
22$S_2$
33$S_3$
TABC$\phi$
214$T_1$
455$T_2$
$(R \bowtie S) \uplus T$ABC$\phi_1$$\phi_2$
121$R_1$$S_1$
122$R_1$$S_2$
133$R_2$$S_3$
233$R_3$$S_3$
214$T_1$NULL
455$T_2$NULL

Challenge

Many-to-one operators

Idea: Construct provenance by combining multiple rows.

$Q_1$ABC$\phi_1$$\phi_2$
121$R_1$$S_1$
122$R_1$$S_2$
133$R_2$$S_3$
233$R_3$$S_3$
214$T_1$NULL
455$T_2$NULL

$\delta\big(\pi_{A}(Q_1)\big)$?

$\delta\big(\pi_{A}(Q_1)\big)$A$\phi_1$$\phi_2$
1$R_1$$S_1$
1$R_1$$S_2$
1$R_2$$S_3$
2$R_3$$S_3$
2$T_1$NULL
4$T_2$NULL

$\delta R \rightarrow ?$

$R$

$_A\gamma_{SUM(B)}(R) \rightarrow ?$

$\big(_A\gamma_{SUM(B)}(R)\big) \bowtie_{A} \big( \pi_{A, \phi_1, \ldots, \phi_{M}} R\big)$

$_A\gamma_{COUNT}(Q_1)$ (lhs only)ACOUNT
13
22
41
$_A\gamma_{COUNT}(Q_1)$ACOUNT$\phi_1$$\phi_2$
13$R_1$$S_1$
13$R_1$$S_2$
13$R_2$$S_3$
22$R_3$$S_3$
22$T_1$NULL
41$T_2$NULL

Does this work? ... not quite

Is it a true provenance polynomial?

Plug in "F" for $R_1$ (i.e., simulate removing $R_1$ from the input)

$Q_1'$ABC
121→ $R_1 \wedge S_1$
122→ $R_1 \wedge S_2$
133→ $R_2 \wedge S_3$
233→ $R_3 \wedge S_3$
214→ $T_1$
455→ $T_2$

Plug in "F" for $R_1$ (i.e., simulate removing $R_1$ from the input)

$Q_1'$ABC
121→ $F \wedge S_1$
122→ $F \wedge S_2$
133→ $T \wedge S_3$
233→ $T \wedge S_3$
214→ $T$
455→ $T$

Plug in "F" for $R_1$ (i.e., simulate removing $R_1$ from the input)

$Q_1'$ABC
121→ $F$
122→ $F$
133→ $T$
233→ $T$
214→ $T$
455→ $T$
$Q_1'$ABC$\phi_1$$\phi_2$
133$R_2$$S_3$
233$R_3$$S_3$
214$T_1$NULL
455$T_2$NULL
$_A\gamma_{COUNT}(Q_1')$ACOUNT
11$R_2$$S_3$
22$R_3$$S_3$
22$T_1$NULL
41$T_2$NULL

vs

$_A\gamma_{COUNT}(Q_1)$ACOUNT
13$R_2$$S_3$
22$R_3$$S_3$
22$T_1$NULL
41$T_2$NULL

Not quite a provenance polynomial...

...but still correct for lineage queries

(and there are other solutions)

Pro: Standardized, relational representation

Con: Huge blow-up in data size (aggregation/distinct)

... but it works better for lazy evaluation


                  SELECT DISTINCT A, COUNT 
                  FROM (PROVENANCE OF Q_1)
                  WHERE phi_1 = 'R_1'
    

Observation 1: predicate on $\phi$ can be pushed down.

Observation 2: DISTINCT would be a no-op (group-by attributes are always a key) if not for provenance additions.

This query can be made to run very fast!

The power of the relational representation is that it can be queried and optimized using an existing database. Everything is just a query.

SMOKE

Goal 1: Find the input tuples that produced a given output tuple (backward query).

Goal 2: Find the output tuples that result from a given input tuple (forward query).

Combine for interactive visualizations

Smoke : Fine-grained Lineage at Interactive Speed (Psallidas, Wu)

Basic Idea: Each operator emits two structures:
$\textbf{id}_{in} \rightarrow \{ \textbf{id}_{out} \}$ and $\textbf{id}_{out} \rightarrow \{ \textbf{id}_{in} \}$

A "Forward index" and a "Backward index"

Again, we'll be using bag-relational algebra.

$R$AB
112
213
323
424
$\pi_A R$AB
11
21
32
42
forwardinout
11
22
33
44
backwardoutin
11
22
33
44
$R$AB
112
213
323
424
$\sigma_{B=3} R$AB
113
223
forwardinout
21
32
backwardoutin
12
23
$R$AB
112
213
323
424
$S$BC
121
222
333
$R \bowtie S$ABC
1121
2122
3133
forwardinout
l-11
r-11
l-12
r-22
l-23
r-34
backwardoutin
1l-1, r-1
2l-1, r-2
3l-2, r-3
$R$AB
112
213
323
424
$_A \gamma_{SUM(B)}(R)$ASUM
112
222
forwardinout
11
21
32
42
backwardoutin
11, 2
23, 4

Optimizations

Multimap
Avoid DB overheads by storing forward/backward indexes in special in-mem datastructure
Simple Array where possible
Further specialize layout for 1-1 provenance operators.
Size selection
Avoid (expensive) structure reallocation costs by using DB statistics for join selectivity, number of rows
Deferred Provenance
2-stage execution: First results (and fixed-size metadata), then provenance

Recovering Provenance

$_A\gamma_{COUNT}\big((R \bowtie S) \bowtie T\big) \rightarrow \\ \textbf{fwd}_{\bowtie_1}, \textbf{fwd}_{\bowtie_2}, \textbf{fwd}_{\gamma}, \textbf{back}_{\bowtie_1}, \textbf{back}_{\bowtie_2}, \textbf{back}_{\gamma}$

What tuples went in to $\left< A: 1, COUNT: 3 \right>$
(output tuple 1)?

$$\pi_{\textbf{back}_{\bowtie_2}.in}\big(\big( (\textbf{back}_{\gamma} \bowtie_{\textbf{back}_{\gamma}.in = \textbf{back}_{\bowtie_2}.out} \textbf{back}_{\bowtie_2})$$ $$\bowtie_{\textbf{back}_{\bowtie_2}.in = \textbf{back}_{\bowtie_1}.out} \textbf{back}_{\bowtie_1}\big)\big)$$