Checkpoint 2

March 16, 2021

Garcia-Molina/Ullman/Widom: Ch. 15.1-15.5, 16.2-16.3, 16.7

Checkpoint 2

Just like Checkpoint 1, but now...

Selecting actual expressions
New SQL features (Limit, Order-By)
Tighter Constraints (Bigger data and less time/query)
Bonus Leaderboard Queries: More data than memory

New SQL Features: Sort

ORDER BY col1 asc/desc, col2 asc/desc, ...: Sort the data on col1 (using col2, col3, ... as tiebreakers) in ascending or descending order.


  'Sort ['bar ASC NULLS FIRST], true
  +- 'Project [*]
     +- 'UnresolvedRelation [FOO], [], false

You can rely on Scala's native sort.


  case class Sort(
    order: Seq[SortOrder],  // Order clauses
    global: Boolean,        // ignore this (for distributed execution)
    child: LogicalPlan      // Input plan
  ) extends UnaryNode with Product with Serializable


  case class SortOrder(
    child: Expression,                    // The expression to sort by
    direction: SortDirection,             // Ascending or Descending
    nullOrdering: NullOrdering,           // NullsFirst or NullsLast
    sameOrderExpressions: Seq[Expression] // ignore this (used by spark)
  )

https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/plans/logical/Sort.html


      SELECT * FROM R ORDER BY A ASC, B DESC


      Sort(Seq(
          SortOrder(UnresolvedAttribute(Seq("A")), 
                    Ascending,
                    Ascending.defaultNullOrder,
                    Seq()),
          SortOrder(UnresolvedAttribute(Seq("B")), 
                    Descending,
                    Descending.defaultNullOrder,
                    Seq())
        ), true, UnresolvedRelation(Seq("R"))),
      ))

New SQL Features: Limit

LIMIT N: Return only the first N rows. If the data is sorted, return the first rows according to the sort order. If not, return an arbitrary N rows.


  'GlobalLimit 3
  +- 'LocalLimit 3
      +- 'Project [*]
         +- 'UnresolvedRelation [FOO], [], false

Output exactly limitExpr rows if available


  case class GlobalLimit(
    limitExpr: Expression, 
    child: LogicalPlan
  ) extends OrderPreservingUnaryNode with Product with Serializable

Output exactly limitExpr rows if available per partition


  case class LocalLimit(
    limitExpr: Expression, 
    child: LogicalPlan
  ) extends OrderPreservingUnaryNode with Product with Serializable

For this checkpoint: Pick one, ignore the other

https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/plans/logical/GlobalLimit.html
https://doc.odin.cse.buffalo.edu/catalyzer/org/apache/spark/sql/catalyst/plans/logical/LocalLimit.html


      SELECT * FROM R LIMIT 3


  GlobalLimit(
    Literal(3),
    LocalLimit(
      Literal(3),
      UnresolvedRelation(Seq("R"))
    )
  )

Handling Tighter Constraints

Selection Pushdown: Push selections down through projections, joins, unions; as close to the data as possible
Join Conversion: Identify useful join conditions (e.g., $R.B = S.B$) and replace results with a better Join algorithm.

$\sigma_{c_1 \wedge c_2 \wedge c_3}(R \times S) = \sigma_{c_1}(R) \bowtie_{c_2} \sigma_{c_3}(S)$

$\sigma(\pi(...)) \rightarrow \pi(\sigma(...))$?


      plan transform {
        case Filter(condition, 
              Project(expressions, child)) => ???
      }

$\sigma(L\times R) \rightarrow L \bowtie R$?


      plan transform {
        case Filter(condition, 
              Join(lhs, rhs, Cross, condition, hint)) => ???
      }

Spark's Optimizer

(simplified version)


  trait OptimizationRule {
    def apply(plan: LogicalPlan): LogicalPlan
  }

  object PushDownSelections 
    extends OptimizationRule  
  {
    def apply(plan: LogicalPlan) = 
      plan.transform { 
        case Filter(condition, 
                Project(expressions, child)) => ???
        /* and other cases here... */
      }
  }

Spark's Optimizer

(simplified version)

Save a copy of the plan
Apply every optimization rule once
If the plan changed since the saved copy, goto 1
If not, done!

Test whether two logical plans are the same with fastEquals

Watch out for infinite loops.


  val rules = Seq[OptimizationRule]( ??? )

  def onePass(plan: LogicalPlan) = 
    { val current = plan
      for(rule <- rules){ current = rule.apply(current) }
      
  def fixpoint(plan: LogicalPlan) = 
    { var current = plan
      var last = null
      while(last == null || !current.fastEquals(last)){
        last = current
        current = onePass(current)
      }
      return current
    }

Handling Tighter Constraints

Selection Pushdown: Push selections down through projections, joins, unions; as close to the data as possible
Join Conversion: Identify useful join conditions (e.g., $R.B = S.B$) and replace results with a better Join algorithm.

One-Pass Hash Join

The reference implementation uses Scala's Map.

Keep in-mind that you may need to hash multiple tuples to the same join key.

Picking a Join Order

Pick the smaller table for the outer relation.
LogicalPlan.maxRows is one way to bound the size of the relation (but needs to be defined for your Table class).
Optional: Use commutativity/associativity to find all join orders.

Cross product is expensive!
Can we do better?

$\sigma_c(R\times S) \equiv R\bowtie_c S$

Cross Product

Problem: Naively, any tuple matches any other

Join Conditions

Solution: First organize the data

Strategies for Implementing $R \bowtie_{R.A = S.A} S$

In-Memory Index Join (1-pass Hash; Hash Join): Build an in-memory index on one table, scan the other.
Partition Join (2-pass Hash; External Hash Join): Partition both sides so that tuples don't join across partitions.
Sort/Merge Join: Sort all of the data upfront, then scan over both sides.

Hash Functions

A hash function is a function that maps a large data value to a small fixed-size value
- Typically is deterministic & pseudorandom
Used in Checksums, Hash Tables, Partitioning, Bloom Filters, Caching, Cryptography, Password Storage, …
Examples: MD5, SHA1, SHA2
- MD5() part of OpenSSL (on most OSX / Linux / Unix)
Can map h(k) to range [0,N) with h(k) % N (modulus)

Hash Functions

$$h(X) \mod N$$

Pseudorandom output between $[0, N)$
Always the same output for a given $X$

1-Pass Hash Join

Limited Queries: Only supports join conditions of the form $R.A = S.B$
Moderate-High Memory: Keeps 1 full relation in memory
Low Added IO Cost: Only requires 1 scan over each input.

Bonus Problem: Memory

Which on-disk algorithms to use?
When to use them?

On-Disk

2-Pass sort, the only on-disk algorithm you need
(plus Sort/Merge join

Sort/Merge Join

Limited Queries: Only supports join conditions of the form $R.A = S.B$
Low Memory: Only needs to keep ~2 rows in memory at a time (not counting sort).
Low Added IO Cost: No added IO! (not counting sort).

Checkpoint 2

Garcia-Molina/Ullman/Widom: Ch. 15.1-15.5, 16.2-16.3, 16.7

Checkpoint 2

New SQL Features: Sort

New SQL Features: Limit

Handling Tighter Constraints

Spark's Optimizer

(simplified version)

Spark's Optimizer

(simplified version)

Handling Tighter Constraints

One-Pass Hash Join

Picking a Join Order

Cross Product

Join Conditions

Strategies for Implementing $R \bowtie_{R.A = S.A} S$

Hash Functions

Hash Functions

1-Pass Hash Join

1-Pass Hash Join

Bonus Problem: Memory

On-Disk

Sort/Merge Join

Sort/Merge Join

2-Way Sort

Pass 1: Create Sorted Runs

Pass 2: Merge Sorted Runs

Questions?