April, 2021
Just like Checkpoint 3, but now...
CREATE TABLE CUSTOMER(
CUSTKEY INT,
NAME STRING,
ADDRESS STRING,
NATIONKEY INT,
PHONE STRING,
ACCTBAL FLOAT,
MKTSEGMENT STRING,
COMMENT STRING
) USING csv OPTIONS(
path 'data/CUSTOMER.data',
delimiter = '|',
primary_key = 'custkey'
)
primary_key = 'custkey',
or
primary_key = 'orderkey,lineitem',
comma-separated list describing the primary key of the table
Time to scan SF 0.1 LINEITEM
Source | Time |
---|---|
CSV on NVME SSD | 0.88745s |
Time to scan SF 0.1 LINEITEM
Source | Time |
---|---|
CSV on NVME SSD | 0.88745s |
IndexedSeq[InternalRow] | 0.018s |
~30x speedup
Takeaway: Read data in at CREATE TABLE
Table(...)
Return an iterator over the preloaded table.
σC(R) and (…⋈CR)
Original Query: πA(σB=1∧C<3(R))
Possible Implementations:
Sort data on (A,B,C,…)
First sort on A, B is a tiebreaker for A,
C is a tiebreaker for B, etc...
Which one do we pick?
(Start picking arbitrarily, then experiment)
These are called "Access Paths"
Time to filter SF 0.1 LINEITEM for one orderkey
Source | Time |
---|---|
CSV on NVME SSD | 0.9196s |
IndexedSeq[InternalRow] | 0.0624s |
Time to filter SF 0.1 LINEITEM for one orderkey
Source | Time |
---|---|
CSV on NVME SSD | 0.9196s |
IndexedSeq[InternalRow] | 0.0624s |
Sorted IndexedSeq[InternalRow] + Bin Search | 0.0008s |
~80x speedup
Takeaway: Sort on primary key and binary search.
Filter(expression, Table(...))
If expression is a ...
If expression is a ...
But TPC-H doesn't have filters on keys...
USING csv OPTIONS(
path '../TPCH/LINEITEM.csv',
delimiter = '|',
primary_key = 'orderkey,linenumber',
tree_index = 'shipdate',
hash_index = 'linestatus|shipmode'
)
tree_index and hash_index are |-separated lists of ,-separated indexes.