Indexing (Part 2)
CSE-4/562 Spring 2019
February 20, 2019
Textbook: Ch. 14.3
Index
Data
Data, even if well organized still requires you to page through a lot.
An index helps you quickly jump to specific data you might be interested in.
Data Organization
- Unordered Heap
- No organization at all. O(N) reads.
- (Secondary) Index
- Index structure over unorganized data. O(≪N) random reads for some queries.
- Clustered (Primary) Index
- Index structure over clustered data. O(≪N) sequential reads for some queries.
Hash Indexes
A hash function h(k) is ...
- ... deterministic
- The same k always produces the same hash value.
- ... (pseudo-)random
- Different ks are unlikely to have the same hash value.
Modulus h(k)%N gives you a random number in [0,N)
Problems
- N is too small
- Too many overflow pages (slower reads).
- N is too big
- Too many normal pages (wasted space).
Idea: Resize the structure as needed
To keep things simple, let's use h(k)=k
(you wouldn't actually do this in practice)
Problems
- Changing hash functions reallocates everything
- Only double/halve the size of a hash function
- Changing sizes still requires reading everything
- Idea: Only redistribute buckets that are too big
Dynamic Hashing
- Add a level of indirection (Directory).
- A data page i can store data with h(k) for any n.
- Double the size of the directory (almost free) by duplicating existing entries.
- When bucket i fills up, split on the next power of 2.
- Can also merge buckets/halve the directory size.
Indexing (Part 2)
CSE-4/562 Spring 2019
February 20, 2019
Textbook: Ch. 14.3