May 6, 2019
Limitation: Can't distinguish between possible-but unlikely and possible-but very likely.
Idea: Make variables probabilistic
bob={4p=0.89p=0.2
carol={3p=0.48p=0.6
R | Name | ZipCode | |
---|---|---|---|
1 | Alice | 10003 | → always |
2 | Bob | 14260 | → if bob=4 |
3 | Bob | 19260 | → if bob=9 |
4 | Carol | 13201 | → if carol=3 |
5 | Carol | 18201 | → if carol=8 |
SELECT COUNT(*)
FROM R NATURAL JOIN ZipCodeLookup
WHERE State = 'NY'
Q(D)={1if bob=9∧carol=82if bob=4∧carol=8∨bob=9∧carol=33if bob=4∧carol=3
={1p=0.2×0.62p=0.8×0.6+0.2×0.43p=0.8×0.4
={1p=0.122p=0.563p=0.32
Q(D)={1p=0.122p=0.563p=0.32
E[Q(D)]=0.12+1.12+0.96=2.20
P[Q(D)≥2]=0.56+0.32=0.88
In general, computing marginal probabilities for result tuples exactly is #P
... so we approximate
Idea 1: Sample. Pick (e.g.) 10 random possible worlds and compute results for each.
R1⇐{bob→4,carol→3}
R1 | Name | ZipCode |
---|---|---|
1 | Alice | 10003 |
2 | Bob | 14260 |
3 | Carol | 13201 |
Q={3}
R2⇐{bob→9,carol→8}
R2 | Name | ZipCode |
---|---|---|
1 | Alice | 10003 |
2 | Bob | 19260 |
3 | Carol | 18201 |
Q={3,1}
R3⇐{bob→4,carol→8}
R3 | Name | ZipCode |
---|---|---|
1 | Alice | 10003 |
2 | Bob | 14260 |
3 | Carol | 18201 |
Q={3,1,2}
R4⇐{bob→4,carol→8}
R4 | Name | ZipCode |
---|---|---|
1 | Alice | 10003 |
2 | Bob | 14260 |
3 | Carol | 18201 |
Q={3,1,2,2}
R5⇐{bob→9,carol→8}
R5 | Name | ZipCode |
---|---|---|
1 | Alice | 10003 |
2 | Bob | 19260 |
3 | Carol | 18201 |
Q={3,1,2,2,1}
Problem: Sloooooooooooow.
Can we make it faster?
Idea 1.A: Combine all samples into one query.
R | Name | ZipCode | ID |
---|---|---|---|
1 | Alice | 10003 | 1 |
2 | Bob | 14260 | 1 |
3 | Carol | 13201 | 1 |
4 | Alice | 10003 | 2 |
5 | Bob | 19260 | 2 |
6 | Carol | 18201 | 2 |
7 | Alice | 10003 | 3 |
8 | Bob | 14260 | 3 |
9 | Carol | 18201 | 3 |
10 | Alice | 10003 | 4 |
11 | Bob | 14260 | 4 |
12 | Carol | 18201 | 4 |
13 | Alice | 10003 | 5 |
14 | Bob | 19260 | 5 |
15 | Carol | 18201 | 5 |
Q | Count | ID |
---|---|---|
1 | 3 | 1 |
2 | 1 | 2 |
3 | 2 | 3 |
4 | 2 | 4 |
5 | 1 | 5 |
πA(R)→ πA,ID(R)
σϕ(R)→ σϕ(R)
R⊎S→ R⊎S
R×S→ πR.∗,S.∗,R.ID(σR.ID=S.ID(R×S))
δR→ δR
AγAgg(∗)(R)→ A,IDγAgg(∗)(R)
Still sloooooow.
There's a lot of repetition.
Idea 2.B Use native array-types in DBs
R | Name | ZipCode |
---|---|---|
1 | Alice | 10003 |
2 | Bob | [14260, 19260, 14260, 14260, 19260] |
3 | Carol | [13201, 18201, 18201, 18201, 18201] |
πA(R)→ πA(R)
σϕ(R)→ ?
Idea 1.B' Also mark which tuples are present in which samples
R | Name | ZipCode | W |
---|---|---|---|
1 | Alice | 10003 | 11111 |
2 | Bob | [14260, 19260, 14260, 14260, 19260] | 11111 |
3 | Carol | [13201, 18201, 18201, 18201, 18201] | 11111 |
R | Name | ZipCode | W |
---|---|---|---|
1 | Alice | 10003 | 11111 |
2 | Bob | [14260, 19260, 14260, 14260, 19260] | 10110 |
3 | Carol | [13201, 18201, 18201, 18201, 18201] | 10000 |
πA(R)→ πA(R)
σϕ(R)→ σW=0(πW&→ϕ(R))
R⊎S→ R⊎S
R×S→ σW=0(πR.∗,S.∗,R.W&S.W(R×S))
AγAgg(B)(R)→ Aγ[Agg(if(W[1]){R.B[1]}),Agg(if(W[2]){R.B[2]}),…](R)
πA(R)→πA(R)
σϕ(R)→σW=0(πW&→ϕ(R))
R⊎S→R⊎S
R×S→σW=0(πR.∗,S.∗,R.W&S.W(R×S))
AγAgg(B)(R)→ Aγ[Agg(if(W[1]){R.B[1]}),Agg(if(W[2]){R.B[2]}),…](R)
(Generate aggregates for each sample separately)
Good luck ever doing an equi-join.
Hope your group-by variables aren't uncertain.
Inefficient equi-joins on uncertain variables.
Inefficient aggregates with uncertain variables.
How many samples necessary to get desired precision?
Idea 2: Symbolic Execution (Provenance)
σcount≥2(Q)=
bob=4∧carol=8
∨bob=9∧carol=3
∨bob=4∧carol=3
P[σcount≥2(Q)]=? ≈ #SAT
P[x∧y]=P[x]⋅P[y]
(iff x and y are independent)
P[x∧y]=0
(iff x and y are mutually exclusive)
P[x∨y]=1−(1−P[x])⋅(1−P[y])
(iff x and y are independent)
P[x∨y]=P[x]+P[y]
(iff x and y are mutually exclusive)
Good enough to get us the probability of any boolean formula over mutually exclusive or independent variables
... and otherwise?
For a boolean formula f and variable x:
f=(x∧f[x∖T])∨(¬x∧f[x∖F])
Disjunction of mutually-exclusive terms!
... each a conjunction of independent terms.
... and x removed from f
Ok... just keep applying Shannon!
Each application creates 2 new formulas (ExpTime!)
Idea 2.A: Combine the two. Use Shanon expansion as long as time/resources permit, then use a #SAT approximation.