CSE 662 Fall 2019

Missing Data

SELECT * FROM R WHERE X IS NOT NULL

Advantages: Easy and fast.; Compatible with scale-free aggregates (max, min, avg).
Disadvantages: Breaks scaling aggregates (sum, count).; Can be unsafe if data is correlated.

Which to use?

Two new operators enforce COLUMN NOT NULL constraints.

Drop ( $\delta_{C}$ ): Non-blocking operator discards tuples if $C = \texttt{NULL}$ .
Impute ( $\mu_{C}$ ): Blocking operator that replaces $C$ when $C = \texttt{NULL}$ .; Train + Repair in one step.