CSE-4/562 Spring 2019 - Probabilistic Databases

May 6, 2019

### (One Form of) Incomplete Databases

• Define each choice as a variable
• Tag each row with a boolean formula over variables
• Each possible world is one assignment of values to variables
• The possible world has all rows tagged with formulas that evaluate to "true"
Certain Tuple
A tuple that appears in all possible worlds
Possible Tuple
A tuple that appears in at least one possible world

Limitation: Can't distinguish between possible-but unlikely and possible-but very likely.

Idea: Make variables probabilistic

### Example

$$\texttt{bob} = \begin{cases} 4 & p = 0.8 \\ 9 & p = 0.2\end{cases}$$

$$\texttt{carol} = \begin{cases} 3 & p = 0.4 \\ 8 & p = 0.6\end{cases}$$

$\mathcal R$NameZipCode
1Alice10003→ always
2Bob14260→ if $\texttt{bob} = 4$
3Bob19260→ if $\texttt{bob} = 9$
4Carol13201→ if $\texttt{carol} = 3$
5Carol18201→ if $\texttt{carol} = 8$

SELECT COUNT(*)
FROM R NATURAL JOIN ZipCodeLookup
WHERE State = 'NY'


$$Q(\mathcal D) = \begin{cases} 1 & \textbf{if } \texttt{bob} = 9 \wedge \texttt{carol} = 8\\ 2 & \textbf{if } \texttt{bob} = 4 \wedge \texttt{carol} = 8 \\&\; \vee\; \texttt{bob} = 9 \wedge \texttt{carol} = 3\\ 3 & \textbf{if } \texttt{bob} = 4 \wedge \texttt{carol} = 3 \end{cases}$$

$$= \begin{cases} 1 & p = 0.2 \times 0.6\\ 2 & p = 0.8 \times 0.6 + 0.2 \times 0.4\\ 3 & p = 0.8 \times 0.4 \end{cases}$$

$$= \begin{cases} 1 & p = 0.12\\ 2 & p = 0.56\\ 3 & p = 0.32\end{cases}$$

$$Q(\mathcal D) = \begin{cases} 1 & p = 0.12\\ 2 & p = 0.56\\ 3 & p = 0.32\end{cases}$$

$E\left[Q(\mathcal D)\right] = 0.12+1.12+0.96 = 2.20$

$P\left[Q(\mathcal D) \geq 2\right] = 0.56+0.32 = 0.88$

In general, computing marginal probabilities for result tuples exactly is #P

... so we approximate

Idea 1: Sample. Pick (e.g.) 10 random possible worlds and compute results for each.

$$R_{1} \Leftarrow \{\; \texttt{bob} \rightarrow 9, \; \texttt{carol} \rightarrow 8\}$$

$\mathcal R_{1}$NameZipCode
1Alice10003
2Bob19260
3Carol18201

$$\mathcal Q = \{\;1\;\}$$

$$R_{2} \Leftarrow \{\; \texttt{bob} \rightarrow 9, \; \texttt{carol} \rightarrow 3\}$$

$\mathcal R_{2}$NameZipCode
1Alice10003
2Bob19260
3Carol13201

$$\mathcal Q = \{\;1,\;2\;\}$$

$$R_{3} \Leftarrow \{\; \texttt{bob} \rightarrow 9, \; \texttt{carol} \rightarrow 8\}$$

$\mathcal R_{3}$NameZipCode
1Alice10003
2Bob19260
3Carol18201

$$\mathcal Q = \{\;1,\;2,\;1\;\}$$

$$R_{4} \Leftarrow \{\; \texttt{bob} \rightarrow 4, \; \texttt{carol} \rightarrow 8\}$$

$\mathcal R_{4}$NameZipCode
1Alice10003
2Bob14260
3Carol18201

$$\mathcal Q = \{\;1,\;2,\;1,\;2\;\}$$

$$R_{5} \Leftarrow \{\; \texttt{bob} \rightarrow 9, \; \texttt{carol} \rightarrow 8\}$$

$\mathcal R_{5}$NameZipCode
1Alice10003
2Bob19260
3Carol18201

$$\mathcal Q = \{\;1,\;2,\;1,\;2,\;1\;\}$$

Problem: Sloooooooooooow.

Can we make it faster?

Idea 1.A: Combine all samples into one query.

$\mathcal R$NameZipCode$\mathcal{ID}$
1Alice100031
2Bob192601
3Carol182011
4Alice100032
5Bob192602
6Carol132012
7Alice100033
8Bob192603
9Carol182013
10Alice100034
11Bob142604
12Carol182014
13Alice100035
14Bob192605
15Carol182015
$\mathcal Q$Count$\mathcal{ID}$
111
222
313
424
515

### Querying Joint Sample Tables

$\pi_A(R) \rightarrow$ $\pi_{A, \mathcal{ID}}(R)$

$\sigma_\phi(R) \rightarrow$ $\sigma_{\phi}(R)$

$R \uplus S \rightarrow$ $R \uplus S$

$R \times S \rightarrow$ $\pi_{R.*, S.*, R.\mathcal{ID}}\big($$\sigma_{R.\mathcal{ID} = S.\mathcal{ID}}($$ R \times S)\big)$

$\delta R \rightarrow$ $\delta R$

$_A\gamma_{Agg(*)}(R) \rightarrow$ $_{A, \mathcal{ID}}\gamma_{Agg(*)}(R)$

Still sloooooow.

There's a lot of repetition.

Idea 2.B Use native array-types in DBs

### Tuple Bundles

$\mathcal R$NameZipCode
1Alice10003
2Bob[19260, 19260, 19260, 14260, 19260]
3Carol[18201, 13201, 18201, 18201, 18201]
MCDB: a monte carlo approach to managing uncertain data (Jampani et. al.)

### Querying Tuple Bundles

$\pi_A(R) \rightarrow$ $\pi_{A}(R)$

$\sigma_\phi(R) \rightarrow$ ?

Idea 1.B' Also mark which tuples are present in which samples

$\mathcal R$NameZipCode$\mathcal W$
1Alice1000311111
2Bob[19260, 19260, 19260, 14260, 19260]11111
3Carol[18201, 13201, 18201, 18201, 18201]11111
↓ $\sigma_{InNYS(ZipCode)}(\mathcal R)$ ↓
$\mathcal R$NameZipCode$\mathcal W$
1Alice1000311111
2Bob[19260, 19260, 19260, 14260, 19260]00010
3Carol[18201, 13201, 18201, 18201, 18201]01000

### Querying Tuple Bundles

$\pi_A(R) \rightarrow$ $\pi_{A}(R)$

$\sigma_\phi(R) \rightarrow$ $\sigma_{\mathcal W = 0}($$\pi_{\mathcal W \;\&\; \vec \phi}(R)) R \uplus S \rightarrow R \uplus S R \times S \rightarrow \sigma_{\mathcal{W} = 0}\big($$\pi_{R.*, S.*, R.\mathcal{W} \;\&\; S.\mathcal{W}}( $$R \times S)\big) _A\gamma_{Agg(B)}(R) \rightarrow _A\gamma_{[ Agg\big(\textbf{if}(W[1])\{R.B[1]\}\big), Agg\big(\textbf{if}(W[2])\{R.B[2]\}\big), \ldots ]}(R) ### Querying Joint Sample Tables \pi_A(R) \rightarrow \pi_{A}(R) \sigma_\phi(R) \rightarrow \sigma_{\mathcal W = 0}(\pi_{\mathcal W \;\&\; \vec \phi}(R)) R \uplus S \rightarrow R \uplus S R \times S \rightarrow \sigma_{\mathcal{W} = 0}\big(\pi_{R.*, S.*, R.\mathcal{W} \;\&\; S.\mathcal{W}}( R \times S)\big) _A\gamma_{Agg(B)}(R) \rightarrow _A\gamma_{[ Agg\big(\textbf{if}(W[1])\{R.B[1]\}\big), Agg\big(\textbf{if}(W[2])\{R.B[2]\}\big), \ldots ]}(R) (Generate aggregates for each sample separately) Good luck ever doing an equi-join. Hope your group-by variables aren't uncertain. Inefficient equi-joins on uncertain variables. Inefficient aggregates with uncertain variables. How many samples necessary to get desired precision? Idea 2: Symbolic Execution (Provenance) \sigma_{count \geq 2}(Q) = \texttt{bob} = 4 \wedge \texttt{carol} = 8 \vee\; \texttt{bob} = 9 \wedge \texttt{carol} = 3 \vee\; \texttt{bob} = 4 \wedge \texttt{carol} = 3 P[\sigma_{count \geq 2}(Q)] = ? \approx #SAT ### Computing Probabilities P[\texttt{x} \wedge \texttt{y}] = P[\texttt{x}] \cdot P[\texttt{y}] (iff \texttt{x} and \texttt{y} are independent) P[\texttt{x} \wedge \texttt{y}] = 0 (iff \texttt{x} and \texttt{y} are mutually exclusive) P[\texttt{x} \vee \texttt{y}] = 1- (1-P[\texttt{x}]) \cdot (1-P[\texttt{y}]) (iff \texttt{x} and \texttt{y} are independent) P[\texttt{x} \vee \texttt{y}] = P[\texttt{x}] + P[\texttt{y}] (iff \texttt{x} and \texttt{y} are mutually exclusive) Good enough to get us the probability of any boolean formula over mutually exclusive or independent variables ... and otherwise? ### Shannon Expansion For a boolean formula f and variable \texttt{x}:$$f = (\texttt{x} \wedge f[\texttt{x}\backslash T]) \vee (\neg \texttt{x} \wedge f[\texttt{x}\backslash F])$$Disjunction of mutually-exclusive terms! ... each a conjunction of independent terms. ... and$\texttt{x}$removed from$f\$

Ok... just keep applying Shannon!

Each application creates 2 new formulas (ExpTime!)

Idea 2.A: Combine the two. Use Shanon expansion as long as time/resources permit, then use a #SAT approximation.

Sprout: Lazy vs. eager query plans for tuple-independent probabilistic databases (Olteanu et. al.)

### More Resources

MCDB
Sampling-based probabilistic databases
Sprout
"Any-time" Approximation.
Mimir
PL tricks to make ProbDBs faster
DeepDive
ProbDBs used in practice to populate Knowledge Bases.
Integrating and Ranking Uncertain Scientific Data
ProbDBs used in practice to predict gene expressions / propose experiments.