February 11, 2019
Databases are usually IO- or Memory-bound
Core Question: Do we have enough memory to use this operator?
IO measured in:
Figure out the cost of each individual operator.
Only count the number of IOs added by each operator.
If S is a query, this can get very expensive
Optimization 1: Cache S in memory
Optimization 2: Cache S on disk
Is there a middle ground?
Problem: We need to evaluate rhs
iterator
once per record in lhs
Better Solution: Load both lhs
and rhs
records in blocks.
def apply_cross(lhs, rhs):
result = []
while r_block = lhs.take(100):
while s_block = rhs.take(100):
for r in r_block:
for s in s_block:
result += [r + s]
rhs.reset()
return result
(with B as the block size for R)
(and with caching S to disk)
In-memory caching is a special case of block-nested loop with B=|S|
Does the block size for R matter?
How big should the blocks be?
Cross product is expensive!
Can we do better?
σc(R×S)≡R⋈cS
Problem: Naively, any tuple matches any other
Solution: First organize the data
h(X)modN
Alternative: Build an in-memory tree (e.g., B+Tree) instead of a hash table!
Why is it important that the hash function is pseudorandom?
What if the data is already organized (e.g., sorted) in a useful way?