DuBStep Checkpoint 3

Overview: Use a precomputation period to make queries faster.
Grade: Scored out of 10 points

Databases designed for analytics are frequently required to respond to queries very quickly -- The longer you make an analyst wait, the harder it is to explore a dataset. Fortunately, however, analysts are often ok with a preparation period: a chance to 'load' data in. In this checkpoint, we'll be moving closer to this interactive setting. This means several things:

We'll be expecting a little user friendliness: Print out a prompt ('$>' on its own line) when your program first starts up, and after it finishes processing each statement (SELECT or CREATE TABLE). The testing script will wait to see a prompt (and a correct result for a SELECT statement) before it gives you the next statement.
You'll get 2 minutes to process each CREATE TABLE statement. As long as you remain under the cutoff, this time will not be counted towards your grade.
You'll get a period of between 1-2 minutes (depending on query) to process as many queries as you can. You'll get a stream of semi-randomly generated queries based on one of the TPC-H templates, one at a time. Your grade will be based on how many queries you can process in the allotted time.

Indexes

JSQLParser's CreateTable object includes a getIndexes() method that lists off every type of constraint/index definition included in the Create Table statement. There are three types of "index" (aka constraint) definitions that can be used: PRIMARY KEY, INDEX, and UNIQUE:

PRIMARY KEY: A table can have at most one PRIMARY KEY definition. The columns indicated in the definition are guaranteed to be unique (you may assume that they are; a typical analytical database would be required to check that this constraint holds). Primary keys are often used to refer to the elements of a given table (e.g., custkey for Customer or orderkey for Orders).
INDEX: An index indicates that the specified columns will be used in queries and that an index should be built over them. Assume that the user may want range queries as well as exact lookup queries for explicitly indexed attributes.
UNIQUE: A unique constraint indicates that the specified columns are guaranteed to uniquely identify a row of the data. In general, a unique constraint can be used to better estimate arities for cost-based optimization, and may also be used to refine choice of index.

Different types of index structures serve different purposes. When selecting an index to use, there are several orthogonal properties to consider. Three of the more important properties are clustering, uniqueness, and organization:

Clustered (Primary) vs Unclustered (Secondary)

A clustered index stores the full record in the index structure itself (i.e., in java, a Map<Key,Row>). An unclustered index stores a pointer to the full record (either a literal pointer, the position on disk, or another way to uniquely identify the record like the record's primary key).

Unique vs Nonunique

The structure of an index changes slightly depending on whether you can assume that the attributes being indexed uniquely identify a row or not (in java: Map<Key,Row> vs Map<Key,List<Row>>).

Hash vs Tree

Hash indexes are faster (O(1) put/get vs O(log N)), but Tree indexes support support range scans, and typically use less memory than a hash index. See documentation on Java's HashMap and TreeMap. The reference implementation uses a Clustered Unique Hash index for Primary Keys and an Unclustered Non-Unique Tree index for everything else. For this particular checkpoint, these choices are somewhat arbitrary, and different combinations of indexes can do just as well or better. For example, you will be in an extremely memory-constrained setting, so a tree index for primary keys may be an acceptable tradeoff. As another example, the reference implementation uses Primary Keys to implement unclustered indexes, which may not be the most optimal decision.

The Reference Implementation

The reference implementation was extended in the following ways for checkpoint 3:

When parsing CREATE TABLE statements, the implementation now interprets the Index objects for PRIMARY KEY and INDEX definitions included in the CreateTable object parsed by JSQLParser. It immediately loads data in from the CSV files, parses it, and indexes it according to the definitions given.
A new Index Scan (leaf) operator is added that reads from an index. The operator operates in three modes: Scan everything (saving on the Disk IO), Scan equal-to, and Scan range (given low and high values).
A new Index Nested Loop Join (unary) operator is added that joins rows against an index. INLJ works like Index Scan, but keeps changing the key being scanned for every row visited.
The optimizer now recognizes patterns for which Index Scan is appropriate (SELECT(FILESCAN), or just FILESCAN), as well as patterns for which INLJ is apropriate (SELECT(CROSS(FILESCAN, ANYTHING))). This change is very similar to checkpoint 2 where the reference implementation added a rule to convert CROSS to HASH JOIN.

Interface

Unlike previous checkpoints, your code will be run "interactively" as described below. Data will drawn from a 0.1GB (SF 0.1) TPC-H dataset. Your program will be called as follows:

java -cp build:checkpoint2.jar:jsqlparser.jar 
     dubstep.Main 
     --data [data] 
     -

on the command-line indicates that you should read from System.in and not a file. An example of how to do this is included in the checkpoint 2 solution documents. The testing software will expect the string "$> " printed on its own line as a prompt. Your code should print a prompt when it is ready to receive a command: When the program first starts, after each CREATE TABLE statement, and after the last line of output from a SELECT statement.

Grading

Your code will be launched and given all relevant CREATE TABLE statements. You will have 2 minutes for each statement. If your code takes more than 2 minutes to output a prompt after the CREATE TABLE statement is issued, you will receive a 0 for the submission and need to try again.

After loading the statements, your code will be subjected to several test cases. For each test case, we will pick one of the 22 TPC-H query templates and generate a stream of random queries according to this template. Your database will be asked to process one of these queries at a time. You have between 1 and 2 minutes (depending on test case) to process as many queries as you can. Your grade will be based on the number of queries processed.

We will select a subset of the queries processed by your database to act as 'validation' queries using reservoir sampling. Your database's answers to these queries will be saved and checked for correctness after the experiment using the same rules as for checkpoint 2. If your database produces an incorrect result for a validation query, you will receive a 0 for the test case.