May 6, 2019
Limitation: Can't distinguish between possible-but unlikely and possible-but very likely.
Idea: Make variables probabilistic
$$\texttt{bob} = \begin{cases} 4 & p = 0.8 \\ 9 & p = 0.2\end{cases}$$
$$\texttt{carol} = \begin{cases} 3 & p = 0.4 \\ 8 & p = 0.6\end{cases}$$
$\mathcal R$ | Name | ZipCode | |
---|---|---|---|
1 | Alice | 10003 | → always |
2 | Bob | 14260 | → if $\texttt{bob} = 4$ |
3 | Bob | 19260 | → if $\texttt{bob} = 9$ |
4 | Carol | 13201 | → if $\texttt{carol} = 3$ |
5 | Carol | 18201 | → if $\texttt{carol} = 8$ |
SELECT COUNT(*)
FROM R NATURAL JOIN ZipCodeLookup
WHERE State = 'NY'
$$Q(\mathcal D) = \begin{cases} 1 & \textbf{if } \texttt{bob} = 9 \wedge \texttt{carol} = 8\\ 2 & \textbf{if } \texttt{bob} = 4 \wedge \texttt{carol} = 8 \\&\; \vee\; \texttt{bob} = 9 \wedge \texttt{carol} = 3\\ 3 & \textbf{if } \texttt{bob} = 4 \wedge \texttt{carol} = 3 \end{cases}$$
$$ = \begin{cases} 1 & p = 0.2 \times 0.6\\ 2 & p = 0.8 \times 0.6 + 0.2 \times 0.4\\ 3 & p = 0.8 \times 0.4 \end{cases}$$
$$ = \begin{cases} 1 & p = 0.12\\ 2 & p = 0.56\\ 3 & p = 0.32\end{cases}$$
$$Q(\mathcal D) = \begin{cases} 1 & p = 0.12\\ 2 & p = 0.56\\ 3 & p = 0.32\end{cases}$$
$E\left[Q(\mathcal D)\right] = 0.12+1.12+0.96 = 2.20$
$P\left[Q(\mathcal D) \geq 2\right] = 0.56+0.32 = 0.88$
In general, computing marginal probabilities for result tuples exactly is #P
... so we approximate
Idea 1: Sample. Pick (e.g.) 10 random possible worlds and compute results for each.
$$R_{1} \Leftarrow \{\; \texttt{bob} \rightarrow 4, \; \texttt{carol} \rightarrow 3\}$$
$\mathcal R_{1}$ | Name | ZipCode |
---|---|---|
1 | Alice | 10003 |
2 | Bob | 14260 |
3 | Carol | 13201 |
$$\mathcal Q = \{\;3\;\}$$
$$R_{2} \Leftarrow \{\; \texttt{bob} \rightarrow 4, \; \texttt{carol} \rightarrow 8\}$$
$\mathcal R_{2}$ | Name | ZipCode |
---|---|---|
1 | Alice | 10003 |
2 | Bob | 14260 |
3 | Carol | 18201 |
$$\mathcal Q = \{\;3,\;2\;\}$$
$$R_{3} \Leftarrow \{\; \texttt{bob} \rightarrow 4, \; \texttt{carol} \rightarrow 8\}$$
$\mathcal R_{3}$ | Name | ZipCode |
---|---|---|
1 | Alice | 10003 |
2 | Bob | 14260 |
3 | Carol | 18201 |
$$\mathcal Q = \{\;3,\;2,\;2\;\}$$
$$R_{4} \Leftarrow \{\; \texttt{bob} \rightarrow 4, \; \texttt{carol} \rightarrow 3\}$$
$\mathcal R_{4}$ | Name | ZipCode |
---|---|---|
1 | Alice | 10003 |
2 | Bob | 14260 |
3 | Carol | 13201 |
$$\mathcal Q = \{\;3,\;2,\;2,\;3\;\}$$
$$R_{5} \Leftarrow \{\; \texttt{bob} \rightarrow 4, \; \texttt{carol} \rightarrow 3\}$$
$\mathcal R_{5}$ | Name | ZipCode |
---|---|---|
1 | Alice | 10003 |
2 | Bob | 14260 |
3 | Carol | 13201 |
$$\mathcal Q = \{\;3,\;2,\;2,\;3,\;3\;\}$$
Problem: Sloooooooooooow.
Can we make it faster?
Idea 1.A: Combine all samples into one query.
$\mathcal R$ | Name | ZipCode | $\mathcal{ID}$ |
---|---|---|---|
1 | Alice | 10003 | 1 |
2 | Bob | 14260 | 1 |
3 | Carol | 13201 | 1 |
4 | Alice | 10003 | 2 |
5 | Bob | 14260 | 2 |
6 | Carol | 18201 | 2 |
7 | Alice | 10003 | 3 |
8 | Bob | 14260 | 3 |
9 | Carol | 18201 | 3 |
10 | Alice | 10003 | 4 |
11 | Bob | 14260 | 4 |
12 | Carol | 13201 | 4 |
13 | Alice | 10003 | 5 |
14 | Bob | 14260 | 5 |
15 | Carol | 13201 | 5 |
$\mathcal Q$ | Count | $\mathcal{ID}$ |
---|---|---|
1 | 3 | 1 |
2 | 2 | 2 |
3 | 2 | 3 |
4 | 3 | 4 |
5 | 3 | 5 |
$\pi_A(R) \rightarrow $ $\pi_{A, \mathcal{ID}}(R)$
$\sigma_\phi(R) \rightarrow $ $\sigma_{\phi}(R)$
$R \uplus S \rightarrow $ $R \uplus S$
$R \times S \rightarrow $ $\pi_{R.*, S.*, R.\mathcal{ID}}\big($$\sigma_{R.\mathcal{ID} = S.\mathcal{ID}}( $$ R \times S)\big)$
$\delta R \rightarrow $ $\delta R$
$_A\gamma_{Agg(*)}(R) \rightarrow $ $_{A, \mathcal{ID}}\gamma_{Agg(*)}(R)$
Still sloooooow.
There's a lot of repetition.
Idea 2.B Use native array-types in DBs
$\mathcal R$ | Name | ZipCode |
---|---|---|
1 | Alice | 10003 |
2 | Bob | [14260, 14260, 14260, 14260, 14260] |
3 | Carol | [13201, 18201, 18201, 13201, 13201] |
$\pi_A(R) \rightarrow $ $\pi_{A}(R)$
$\sigma_\phi(R) \rightarrow $ ?
Idea 1.B' Also mark which tuples are present in which samples
$\mathcal R$ | Name | ZipCode | $\mathcal W$ |
---|---|---|---|
1 | Alice | 10003 | 11111 |
2 | Bob | [14260, 14260, 14260, 14260, 14260] | 11111 |
3 | Carol | [13201, 18201, 18201, 13201, 13201] | 11111 |
$\mathcal R$ | Name | ZipCode | $\mathcal W$ |
---|---|---|---|
1 | Alice | 10003 | 11111 |
2 | Bob | [14260, 14260, 14260, 14260, 14260] | 11111 |
3 | Carol | [13201, 18201, 18201, 13201, 13201] | 10011 |
$\pi_A(R) \rightarrow $ $\pi_{A}(R)$
$\sigma_\phi(R) \rightarrow $ $\sigma_{\mathcal W = 0}($$\pi_{\mathcal W \;\&\; \vec \phi}(R))$
$R \uplus S \rightarrow $ $R \uplus S$
$R \times S \rightarrow $ $\sigma_{\mathcal{W} = 0}\big($$\pi_{R.*, S.*, R.\mathcal{W} \;\&\; S.\mathcal{W}}( $$ R \times S)\big)$
$_A\gamma_{Agg(B)}(R) \rightarrow $ $_A\gamma_{[ Agg\big(\textbf{if}(W[1])\{R.B[1]\}\big), Agg\big(\textbf{if}(W[2])\{R.B[2]\}\big), \ldots ]}(R)$
$\pi_A(R) \rightarrow \pi_{A}(R)$
$\sigma_\phi(R) \rightarrow \sigma_{\mathcal W = 0}(\pi_{\mathcal W \;\&\; \vec \phi}(R))$
$R \uplus S \rightarrow R \uplus S$
$R \times S \rightarrow \sigma_{\mathcal{W} = 0}\big(\pi_{R.*, S.*, R.\mathcal{W} \;\&\; S.\mathcal{W}}( R \times S)\big)$
$_A\gamma_{Agg(B)}(R) \rightarrow $ $_A\gamma_{[ Agg\big(\textbf{if}(W[1])\{R.B[1]\}\big), Agg\big(\textbf{if}(W[2])\{R.B[2]\}\big), \ldots ]}(R)$
(Generate aggregates for each sample separately)
Good luck ever doing an equi-join.
Hope your group-by variables aren't uncertain.
Inefficient equi-joins on uncertain variables.
Inefficient aggregates with uncertain variables.
How many samples necessary to get desired precision?
Idea 2: Symbolic Execution (Provenance)
$\sigma_{count \geq 2}(Q) =$
$\texttt{bob} = 4 \wedge \texttt{carol} = 8 $
$\vee\; \texttt{bob} = 9 \wedge \texttt{carol} = 3 $
$\vee\; \texttt{bob} = 4 \wedge \texttt{carol} = 3$
$P[\sigma_{count \geq 2}(Q)] = ?$ $\approx$ #SAT
$P[\texttt{x} \wedge \texttt{y}] = P[\texttt{x}] \cdot P[\texttt{y}]$
(iff $\texttt{x}$ and $\texttt{y}$ are independent)
$P[\texttt{x} \wedge \texttt{y}] = 0$
(iff $\texttt{x}$ and $\texttt{y}$ are mutually exclusive)
$P[\texttt{x} \vee \texttt{y}] = 1- (1-P[\texttt{x}]) \cdot (1-P[\texttt{y}])$
(iff $\texttt{x}$ and $\texttt{y}$ are independent)
$P[\texttt{x} \vee \texttt{y}] = P[\texttt{x}] + P[\texttt{y}]$
(iff $\texttt{x}$ and $\texttt{y}$ are mutually exclusive)
Good enough to get us the probability of any boolean formula over mutually exclusive or independent variables
... and otherwise?
For a boolean formula $f$ and variable $\texttt{x}$:
$$f = (\texttt{x} \wedge f[\texttt{x}\backslash T]) \vee (\neg \texttt{x} \wedge f[\texttt{x}\backslash F])$$
Disjunction of mutually-exclusive terms!
... each a conjunction of independent terms.
... and $\texttt{x}$ removed from $f$
Ok... just keep applying Shannon!
Each application creates 2 new formulas (ExpTime!)
Idea 2.A: Combine the two. Use Shanon expansion as long as time/resources permit, then use a #SAT approximation.