April 26, 2019
Why does a tuple appear in a query output?
The answer can be summarized by a polynomial.
|
|
Q(1)=ae+af+bg
Q(1)=ae+af+bg
a=b=e=f=g=T→ T
Q(1)=ae+af+bg
{a=3;b=1;e=1;f=1;g=3}→ 12
Q(1)=ae+af+bg
{a=t1;b=t2;e=t5;f=t6;g=t7}→
{{t1,t5},{t1,t6},{t2,t7}}
With ⟨[0,5],min,max,0,5⟩
(0 = minimum security, 5 = top secret)
Q(1)=ae+af+bg
{a=5;b=1;e=2;f=3;g=2}→2
Design | GProM | SMOKE |
---|---|---|
Representation | Relational | Specialized |
Implementation | Rewriting | Instrumentation |
Generation | Lazy + Eager | Eager + Hybrid |
|
|
|
|
Extra columns store annotations
PROVENANCE OF SELECT A FROM R NATURAL JOIN S
We'll be using bag-relational algebra.
R→πR.∗,ϕ←ROWID(R)
πA(R)→?
πA,ϕ(R)
σc(R)→?
σc(R)
R×S→?
ρϕ→ϕ1(R)×ρϕ→ϕ2(S)
|
|
R⋈S | A | B | C | ϕ1 | ϕ2 |
---|---|---|---|---|---|
1 | 2 | 1 | R1 | S1 | |
1 | 2 | 2 | R1 | S2 | |
1 | 3 | 3 | R2 | S3 | |
2 | 3 | 3 | R3 | S3 |
πA(R)→πA,ϕ1,…,ϕN(R)
R×S→?
R (resp., S) has M (resp., N) annotation columns.
R×ρϕ1→ϕM+1,…,ϕN→ϕM+N(S)
(Rename S's columns, appending to R's)
R⊎S→?
If R has M annotation columns
and S has N>M columns:
πR.∗,ϕM+1←NULL,…,ϕN←NULL(R)⊎S
(Pad the narrower relation with NULLs)
|
| ||||||||||||||||||||||||||||||||||||
|
(R⋈S)⊎T | A | B | C | ϕ1 | ϕ2 |
---|---|---|---|---|---|
1 | 2 | 1 | R1 | S1 | |
1 | 2 | 2 | R1 | S2 | |
1 | 3 | 3 | R2 | S3 | |
2 | 3 | 3 | R3 | S3 | |
2 | 1 | 4 | T1 | NULL | |
4 | 5 | 5 | T2 | NULL |
Many-to-one operators
Idea: Construct provenance by combining multiple rows.
Q1 | A | B | C | ϕ1 | ϕ2 |
---|---|---|---|---|---|
1 | 2 | 1 | R1 | S1 | |
1 | 2 | 2 | R1 | S2 | |
1 | 3 | 3 | R2 | S3 | |
2 | 3 | 3 | R3 | S3 | |
2 | 1 | 4 | T1 | NULL | |
4 | 5 | 5 | T2 | NULL |
δ(πA(Q1))?
δ(πA(Q1)) | A | ϕ1 | ϕ2 |
---|---|---|---|
1 | R1 | S1 | |
1 | R1 | S2 | |
1 | R2 | S3 | |
2 | R3 | S3 | |
2 | T1 | NULL | |
4 | T2 | NULL |
δR→?
R
AγSUM(B)(R)→?
(AγSUM(B)(R))⋈A(πA,ϕ1,…,ϕMR)
AγCOUNT(Q1) (lhs only) | A | COUNT |
---|---|---|
1 | 3 | |
2 | 2 | |
4 | 1 |
AγCOUNT(Q1) | A | COUNT | ϕ1 | ϕ2 |
---|---|---|---|---|
1 | 3 | R1 | S1 | |
1 | 3 | R1 | S2 | |
1 | 3 | R2 | S3 | |
2 | 2 | R3 | S3 | |
2 | 2 | T1 | NULL | |
4 | 1 | T2 | NULL |
Does this work? ... not quite
Is it a true provenance polynomial?
Plug in "F" for R1 (i.e., simulate removing R1 from the input)
Q′1 | A | B | C | |
---|---|---|---|---|
1 | 2 | 1 | → R1∧S1 | |
1 | 2 | 2 | → R1∧S2 | |
1 | 3 | 3 | → R2∧S3 | |
2 | 3 | 3 | → R3∧S3 | |
2 | 1 | 4 | → T1 | |
4 | 5 | 5 | → T2 |
Plug in "F" for R1 (i.e., simulate removing R1 from the input)
Q′1 | A | B | C | |
---|---|---|---|---|
1 | 2 | 1 | → F∧S1 | |
1 | 2 | 2 | → F∧S2 | |
1 | 3 | 3 | → T∧S3 | |
2 | 3 | 3 | → T∧S3 | |
2 | 1 | 4 | → T | |
4 | 5 | 5 | → T |
Plug in "F" for R1 (i.e., simulate removing R1 from the input)
Q′1 | A | B | C | |
---|---|---|---|---|
1 | 2 | 1 | → F | |
1 | 2 | 2 | → F | |
1 | 3 | 3 | → T | |
2 | 3 | 3 | → T | |
2 | 1 | 4 | → T | |
4 | 5 | 5 | → T |
Q′1 | A | B | C | ϕ1 | ϕ2 |
---|---|---|---|---|---|
1 | 3 | 3 | R2 | S3 | |
2 | 3 | 3 | R3 | S3 | |
2 | 1 | 4 | T1 | NULL | |
4 | 5 | 5 | T2 | NULL |
AγCOUNT(Q′1) | A | COUNT | ||
---|---|---|---|---|
1 | 1 | R2 | S3 | |
2 | 2 | R3 | S3 | |
2 | 2 | T1 | NULL | |
4 | 1 | T2 | NULL |
vs
AγCOUNT(Q1) | A | COUNT | ||
---|---|---|---|---|
1 | 3 | R2 | S3 | |
2 | 2 | R3 | S3 | |
2 | 2 | T1 | NULL | |
4 | 1 | T2 | NULL |
Not quite a provenance polynomial...
...but still correct for lineage queries
(and there are other solutions)
Pro: Standardized, relational representation
Con: Huge blow-up in data size (aggregation/distinct)
... but it works better for lazy evaluation
SELECT DISTINCT A, COUNT
FROM (PROVENANCE OF Q_1)
WHERE phi_1 = 'R_1'
Observation 1: predicate on ϕ can be pushed down.
Observation 2: DISTINCT
would be a no-op (group-by attributes are always a key) if not for provenance additions.
This query can be made to run very fast!
The power of the relational representation is that it can be queried and optimized using an existing database. Everything is just a query.
Goal 1: Find the input tuples that produced a given output tuple (backward query).
Goal 2: Find the output tuples that result from a given input tuple (forward query).
Combine for interactive visualizations
Basic Idea: Each operator emits two structures:
idin→{idout} and idout→{idin}
A "Forward index" and a "Backward index"
Again, we'll be using bag-relational algebra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
AγCOUNT((R⋈S)⋈T)→fwd⋈1,fwd⋈2,fwdγ,back⋈1,back⋈2,backγ
What tuples went in to ⟨A:1,COUNT:3⟩
(output tuple 1)?
πback⋈2.in(((backγ⋈backγ.in=back⋈2.outback⋈2) ⋈back⋈2.in=back⋈1.outback⋈1))