April 10, 2019
CREATE TABLE
sCREATE TABLE
issuedBoth phases in-memory
CREATE TABLE
s will include PRIMARY KEY
and other suggested indexes.Not universal, every system is different
$> java -Xrunhprof:cpu=samples dubstep.Main
or
$> java -Xrunhprof:cpu=time dubstep.Main
Sampling is faster, Time is more accurate
Output ends up in ./java.hprof.txt
...
... lots of gobblygook ...
...
CPU SAMPLES BEGIN (total = 126) Fri Oct 22 12:12:14 2004
rank self accum count trace method
1 53.17% 53.17% 67 300027 java.util.zip.ZipFile.getEntry
2 17.46% 70.63% 22 300135 java.util.zip.ZipFile.getNextEntry
3 5.56% 76.19% 7 300111 java.lang.ClassLoader.defineClass2
4 3.97% 80.16% 5 300140 java.io.UnixFileSystem.list
5 2.38% 82.54% 3 300149 java.lang.Shutdown.halt0
6 1.59% 84.13% 2 300136 java.util.zip.ZipEntry.initFields
7 1.59% 85.71% 2 300138 java.lang.String.substring
8 1.59% 87.30% 2 300026 java.util.zip.ZipFile.open
9 0.79% 88.10% 1 300118 com.sun.tools.javac.code.Type$ErrorType.<init>
10 0.79% 88.89% 1 300134 java.util.zip.ZipFile.ensureOpen
'self' column shows how much time is spent in that method.
'count' column shows number of samples/invocations
'trace' column links back to a stack trace of the sample/invocation
TRACE 300027:
java.util.zip.ZipFile.getEntry(ZipFile.java:Unknown line)
java.util.zip.ZipFile.getEntry(ZipFile.java:253)
java.util.jar.JarFile.getEntry(JarFile.java:197)
java.util.jar.JarFile.getJarEntry(JarFile.java:180)
Main Goal: Figure out where your code is spending its time, have it spend less time there
Idea: Minimize number of "intermediate" tuples
Estimating IOs requires Estimating |Q(R)|, |δA(Q(R))|
Operator | RA | Estimated Size |
---|---|---|
Table | R | |R| |
Projection | π(Q) | |Q| |
Union | Q1⊎Q2 | |Q1|+|Q2| |
Cross Product | Q1×Q2 | |Q1|×|Q2| |
Sort | τ(Q) | |Q| |
Limit | LIMITN(Q) | N |
Selection | σc(Q) | |Q|×SEL(c,Q) |
Join | Q1⋈cQ2 | |Q1|×|Q2|×SEL(c,Q1×Q2) |
Distinct | δA(Q) | UNIQ(A,Q) |
Aggregate | γA,B←Σ(Q) | UNIQ(A,Q) |
Many major databases (Oracle, Postgres, Teradata, etc...) use something like 10% rule if they have nothing better.
(The specific % varies by DBMS. E.g., Teradata uses 10% for the first AND
clause, and 75% for every subsequent clause)
We assume that for σc(Q) or δA(Q)...
COUNT(*)
COUNT(DISTINCT A)
(for each A)MIN(A)
, MAX(A)
(for each numeric A)If necessary statistics aren't available (point 1), fall back to the 10% rule.
If statistical assumptions (points 2, 3) aren't perfectly true, we'll still likely be getting a better estimate than the 10% rule.
UNIQ(A,πA,…(R))=UNIQ(A,R)
UNIQ(A,σ(R))≈UNIQ(A,R)
UNIQ(A,R×S)=UNIQ(A,R) or UNIQ(A,S)
max(UNIQ(A,R),UNIQ(A,S))≤UNIQ(A,R⊎S)≤UNIQ(A,R)+UNIQ(A,S)
minA(πA,…(R))=minA(R)
minA(σA,…(R))≈minA(R)
minA(R×S)=minA(R) or minA(S)
minA(R⊎S)=min(minA(R),minA(S))
Estimating δA(Q) requires only COUNT(DISTINCT A)
Selectivity is a probability (SEL(c,Q)=P(c))
P(A=x1) | = | 1COUNT(DISTINCT A) |
P(A∈(x1,x2,…,xN)) | = | NCOUNT(DISTINCT A) |
P(A≤x1) | = | x1−MIN(A)MAX(A)−MIN(A) |
P(x1≤A≤x2) | = | x2−x1MAX(A)−MIN(A) |
P(A=B) | = | min(1COUNT(DISTINCT A),1COUNT(DISTINCT B)) |
P(c1∧c2) | = | P(c1)⋅P(c2) |
P(c1∨c2) | = | 1−(1−P(c1))⋅(1−P(c2)) |
(With constants x1, x2, ...)