November 19
Precise question → Precise answer
Vague question → Imprecise answer
Vague question → Precise answer
User: Return the average number of publications by Bob in each year.
Database: 5
User: Return authors who have more papers than Bob in VLDB after 2000.
Recover sentence structure as a tree of concepts
(Implemented via the Stanford parser)
Tag each node with a label describing its (lightly disambiguated) role in the query.
Node | Tag | Metadata |
---|---|---|
return | SN | SELECT |
author | NN | author |
more | ON | > |
paper | NN | publication |
Bob | VN | author.name |
VLDB | VN | conference.name |
after | ON | > |
2000 | VN | publication.year |
Select, Operator, Function, Quantifier, and Logic Nodes are schema-independent.
Pre-generate (semi-manually) a knowledge base of terms that correspond to each node type.
Create a knowledge-base of every table name, attribute name, and cell value in the database.
(Ideally the user types in a noun phrase that is one of these)
Given a parse tree node ($node$) and every schema element ($elem$), find the best match.
$$Sim(node, elem) = \textbf{max}(Jaccard(node, value), Word2vec(node, value))$$
Every node that matches at least one schema element is labeled either NN or VN.
Every node that matches more than one schema element is ambiguous.
Leverage relationships exposed in the schema:
If proper noun A is an ancestor of proper noun B in the parse tree...
... attempt to minimize the distance between the tag of A and B in the schema graph.
Approach: Explore "similar" parse trees in the vicinity of the original parse tree.
Two NN or VN nodes are "close" in the parse tree if:
We want close nodes in the parse tree to be close in the schema graph as well.
Dynamic programming to avoid processing the same tree more than once.
Return all permitted edits encountered.
ON (comparators) should have two descendents.
We expect:
Otherwise we copy the LHS subtree into the RHS.