February 13, 2019
Like normal projection, but can create new columns
πM←A+B∗C,N←2(R)produces 1 row for every row of R, with 2 columns: M and N
... but first
NULL
ValuesNULL
value for these casesNULL makes things more complicated.
What happens if Trees.SPC_COMMON is NULL?
Unknown | AND | Unknown | ≡ | Unknown |
Unknown | AND | True | ≡ | Unknown |
Unknown | AND | False | ≡ | False |
Unknown | OR | Unknown | ≡ | Unknown |
Unknown | OR | True | ≡ | True |
Unknown | OR | False | ≡ | Unknown |
NOT | Unknown | ≡ | Unknown |
WHERE
clauses eliminate all non-True rows
What happens if some streets have no trees?
Streets
District | StreetName |
---|---|
A | Apple St. |
B | Border St. |
C | Coventry St. |
D | Durham St. |
Trees
TreeId | StreetName |
---|---|
1 | Apple St. |
2 | Border St. |
3 | Durham St. |
4 | Elicott St. |
Streets ⟗ Trees
District | StreetName | TreeId | StreetName |
---|---|---|---|
A | Apple St. | 1 | Apple St. |
B | Border St. | 2 | Border St. |
C | Coventry St. | NULL | NULL |
D | Durham St. | 3 | Durham St. |
NULL | NULL | 4 | Elicott St. |
Only LEFT outer join
Only RIGHT outer join
τA(R) The tuples of R in ascending order according to 'A'
Ln(R) The first n tuples of R
Pick your favorite sort algorithm.
What happens if you don't have enough memory?
Key Idea: Merging 2 sorted lists requires O(1) memory.
Repeat Pass 2 As Needed.
What's the bottleneck?
IO Cost: O(N⋅⌈log2(N)⌉)
(with N blocks)
On average, we'll get runs of size 2⋅|WS|
SELECT COUNT(*) FROM R
SELECT SUM(A) FROM R
SELECT A, SUM(B) FROM R GROUP BY A
SELECT DISTINCT A FROM R
SELECT A FROM R GROUP BY A
TREE_ID | SPC_COMMON | BORONAME | TREE_DBH |
---|---|---|---|
180683 | 'red maple' | 'Queens' | 3 |
315986 | 'pin oak' | 'Queens' | 21 |
204026 | 'honeylocust' | 'Brooklyn' | 3 |
204337 | 'honeylocust' | 'Brooklyn' | 10 |
189565 | 'American linden' | 'Brooklyn' | 21 |
... and 683783 more |
SELECT COUNT(*) FROM TREES
TREE_ID | SPC_COMMON | BORONAME | TREE_DBH |
---|---|---|---|
COUNT = 0 | |||
180683 | 'red maple' | 'Queens' | 3 |
COUNT = 1 | |||
315986 | 'pin oak' | 'Queens' | 21 |
COUNT = 2 | |||
204026 | 'honeylocust' | 'Brooklyn' | 3 |
COUNT = 3 | |||
204337 | 'honeylocust' | 'Brooklyn' | 10 |
COUNT = 4 | |||
189565 | 'American linden' | 'Brooklyn' | 21 |
COUNT = 5 | |||
... and 683783 more | |||
COUNT = 683788 |
SELECT SUM(TREE_DBH) FROM TREES
TREE_ID | SPC_COMMON | BORONAME | TREE_DBH |
---|---|---|---|
SUM = 0 | |||
180683 | 'red maple' | 'Queens' | 3 |
SUM = 3 | |||
315986 | 'pin oak' | 'Queens' | 21 |
SUM = 24 | |||
204026 | 'honeylocust' | 'Brooklyn' | 3 |
SUM = 27 | |||
204337 | 'honeylocust' | 'Brooklyn' | 10 |
SUM = 37 | |||
189565 | 'American linden' | 'Brooklyn' | 21 |
SUM = 58 | |||
... and 683783 more |
This is also sometimes called a "fold"
Grey et. al. "Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals
SELECT SPC_COMMON, COUNT(*) FROM TREES GROUP BY SPC_COMMON
Naive Idea: Keep a separate accumulator for each group
TREE_ID | SPC_COMMON | BORONAME | TREE_DBH |
---|---|---|---|
{} | |||
180683 | 'red maple' | 'Queens' | 3 |
{ 'red maple' = 1 } | |||
204337 | 'honeylocust' | 'Brooklyn' | 10 |
{ 'red maple' = 1, 'honeylocust' = 1 } | |||
315986 | 'pin oak' | 'Queens' | 21 |
{ 'red maple' = 1, 'honeylocust' = 1, 'pin oak' = 1 } | |||
204026 | 'honeylocust' | 'Brooklyn' | 3 |
{ 'red maple' = 1, 'honeylocust' = 2, 'pin oak' = 1 } |
What could go wrong?
TREE_ID | SPC_COMMON | BORONAME | TREE_DBH |
---|---|---|---|
{} | |||
204337 | 'honeylocust' | 'Brooklyn' | 10 |
{ 'honeylocust' = 1 } | |||
204026 | 'honeylocust' | 'Brooklyn' | 3 |
{ 'honeylocust' = 2 } | |||
... and more | |||
315986 | 'pin oak' | 'Queens' | 21 |
{ 'honeylocust' = 3206, 'pin oak' = 1 } | |||
... and more | |||
180683 | 'red maple' | 'Queens' | 3 |
{ 'pin oak' = 53814, 'red maple' = 1 } |