March 4, 2021
Reads Want: Nice sorted, compact list.
Writes Want: Random order.
What happens if we optimize for reads
(and even do away with pages)
Each insert requires O(N) write IOs.
"Write Amplification"
How can we reduce Write Amplification?
Idea: Buffer writes
Pro: With a B element buffer, O(NB) write amplification (amortized)
Con: Every read now needs to go to two places
Con: O(NB) is still linear
Idea: Don't merge!
Pro: No write amplification!
Con: Every read now needs to go to O(NB) places on disk (Read Amplification)
Idea: Combine the two?
In general, level i contains 2i−1B records.
When writing to a full level, merge and write to next level instead.
Write Amplification: Every record copied O(logN) times
Read Amplification: At most O(logN) levels
"The Case for Learned Index Structures"
by Kraska, Beutel, Chi, Dean, Polyzotis
f(key)↦position
(not exactly true, but close enough for today)
Simplified Use Case: Static data with "infinite" prep time.
We have infinite prep time, so fit a (tiny) neural network to the CDF.
if
statements are really expensive on modern processors.
SELECT partkey
FROM lineitem l, orders o
WHERE l.orderkey = o.orderkey
AND o.orderdate >= DATE(NOW() - '1 Month')
ORDER BY shipdate DESC LIMIT 10;
SELECT suppkey, COUNT(*)
FROM lineitem l, orders o
WHERE l.orderkey = o.orderkey
AND o.orderdate >= DATE(NOW() - '1 Month')
GROUP BY suppkey;
SELECT partkey, COUNT(*)
FROM lineitem l, orders o
WHERE l.orderkey = o.orderkey
AND o.orderdate > DATE(NOW() - '1 Month')
GROUP BY partkey;
All of these views share the same business logic!
Started as a convenience
CREATE VIEW salesSinceLastMonth AS
SELECT l.*
FROM lineitem l, orders o
WHERE l.orderkey = o.orderkey
AND o.orderdate > DATE(NOW() - '1 Month')
SELECT partkey FROM salesSinceLastMonth
ORDER BY shipdate DESC LIMIT 10;
SELECT suppkey, COUNT(*)
FROM salesSinceLastMonth
GROUP BY suppkey;
SELECT partkey, COUNT(*)
FROM salesSinceLastMonth
GROUP BY partkey;
But also useful for performance
CREATE MATERIALIZED VIEW salesSinceLastMonth AS
SELECT l.*
FROM lineitem l, orders o
WHERE l.orderkey = o.orderkey
AND o.orderdate > DATE(NOW() - '1 Month')
Materializing the view, or pre-computing and saving the view lets us answer all of the queries on the view faster!
What if the query doesn't use the view?
SELECT l.partkey
FROM lineitem l, orders o
WHERE l.orderkey = o.orderkey
AND o.orderdate > DATE(’2015-03-31’)
ORDER BY l.shipdate DESC
LIMIT 10;
Can we detect that a query could be answered with a view?
(sometimes)
View Query | User Query | |
---|---|---|
SELECT Lv FROM Rv WHERE Cv
|
SELECT Lq FROM Rq WHERE Cq
|
When are we allowed to rewrite this table?
View Query | User Query | |
---|---|---|
SELECT Lv FROM Rv WHERE Cv
|
SELECT Lq FROM Rq WHERE Cq
|
View Query | User Query | |
---|---|---|
SELECT Lv FROM Rv WHERE Cv
|
SELECT Lq FROM Rq WHERE Cq
|
SELECT LQ
FROM (RQ−RV), view
WHERE CQ
Enumerate all possible plans
... then how do you pick? (more soon)