Analyzing database access logs is a key part of performance tuning, intrusion detection, and many other database administration tasks. Unfortunately, it is common for production databases to deal with millions or even more queries each day, so these logs must be summarized before they can be used. On one hand, we want to compress logs to facilitate efficient storage and human inspection. On the other hand, we want to accurately infer frequencies of patterns that are of interest to workload-analytic applications. We established a framework for inferring pattern frequencies in a principled way using only a small subset of patterns and proposed an efficiently computable measure of overall inference accuracy. Achieving higher accuracy requires more patterns, but we found that runtime of pattern mining algorithms also steeply increase. We hypothesize that this is due to mixing workloads and proposed to partition the log into separate clusters. By clustering, the search space of candidate patterns are reduced and we empirically showed that state-of-the-art pattern mining algorithms can be greatly improved both in runtime and accuracy. We further improved the effectiveness of clustering to the extent that as we create more clusters, each cluster becomes easy enough for pattern mining such that different algorithms do not vary much in accuracy. As a result, we finally proposed naive mixture encodings which focuses on partitioning workload mixtures and summarize each partition using the most efficient though naive encoding. We showed that naive mixture encoding is orders of magnitude faster to construct and provides summarization accuracy competitive with more complicated pattern mining algorithms.
Read more in the preprint