CSE-4/562 Spring 2021 - Streaming Queries

March 30, 2021

## Checkpoint 2

• Deadline moved to April 7
• Extra office hours Wed April 7

### Sequential Data

• Temporal Data (today's focus)
• Bi-Temporal (time the event happened + time the event was recorded)
• Spatial (2d data, 3d data)
• Spatio-Temporal Data (Spatial + Time)

### Example Queries

Find the % change in monthly sales, each month


SELECT A.Month, (A.Sales-B.Sales) / B.Sales
FROM (SELECT … AS Month, SUM(…) AS Sales FROM …) A,
(SELECT … AS Month, SUM(…) AS Sales FROM …) B
WHERE A.Month = B.Month + 1


### Example Queries

Find the % change in monthly sales, each month

### Example Queries

Find the daily top-5 products by sales in the last week


SELECT Product, SUM(…) AS Sales FROM … WHERE date = today - 1
ORDER BY Sales Desc LIMIT 5 UNION ALL
SELECT Product, SUM(…) AS Sales FROM … WHERE date = today - 2
ORDER BY Sales Desc LIMIT 5 UNION ALL …


### Example Queries

Find the trailing n-day moving average of sales

… almost impossible to express if n is a parameter
(i.e., query size depends on N)

### The WINDOW Operator

1. Define a Sequence (i.e., sort the relation)
2. Compute all subsequences
• Fixed Physical Size: N records exactly.
• Fixed Logical Size: Records within N units of time.
3. Compute an aggregate for each subsequence (one output row per subsequence)

SELECT L.state, T.month,
AVG(S.sales) OVER W as movavg
FROM   Sales S, Times T, Locations L
WHERE  S.timeid = T.timeid
AND  S.locid = L.locid
WINDOW W AS (
PARTITION BY L.state
ORDER BY T.month
RANGE BETWEEN INTERVAL ‘1’ MONTH PRECEDING
AND INTERVAL ‘1’ MONTH FOLLOWING
)

PARTITION BY
Like GROUP BY
ORDER BY
The sequence to create. The output has one row for each value of this column.
RANGE BETWEEN ... AND ...
Physical/Logical size of the window
[Aggregate] OVER [WindowName]
A single query can have multiple windows.
OLAP
Changing Queries, Fixed Data
OLTP
Changing Data, Minimal Queries
Streaming
Changing Data, Fixed Queries

Challenge: Need to react to new data as it arrives

### Streaming Queries

• Push vs Pull Data Flow
• Revisit Joins
• Revisit Indexing
• Revisit Aggregation
• Each operator operates independently
• Each operator buffers input
• Operators are scheduled (possibly in different threads)

### Stream Joins

• Problem 1: Eventually run out of buffer!
• Problem 2: Looping over all buffer entries is slow

Idea 1: Mandate ONLY WINDOW queries

Idea 2: Index the buffer!

Challenge: Maintaining the index as tuples fall out of the buffer

### Streaming Indexes

Tuples always enter from one side and exit out the other

Lots of lookups for active tuples.

### What is the best layout?

Insert/remove in (nearly) temporal order
Hash/Tree
Lookup (randomly ordered) Join Key

$O(1 + log(|W|))$ insertions.

$O(1 + log(|W|))$ expiration.

### Streaming Aggregation

Ring Aggregates (Sum, Count, Average)
Add new values - $O(|\Delta|)$
Subtract old values - $O(|\Delta|)$
Semiring Aggregates (Min, Max)
Rescan for new max - $O(|W|)$

### Summary

Push vs Pull Data Flow
Push is a better fit because sources produce data at different rates.
Revisit Joins
Focus on ripple-style WINDOW joins.
Revisit Indexing
Linked Hash/Tree Indexes for efficient windowed indexing.
Revisit Aggregation
Sliding window aggregates.