CrowdQ: A Search Engine with Crowdsourced Query ...kubitron/courses/cs262...structured template for...

CrowdQ: A Search Engine with Crowdsourced QueryUnderstanding

Daniel Haas Daniel Bruckner Jonathan Harper{dhaas, bruckner, jharper}@cs.berkeley.edu

University of California BerkeleyCS262a: Adanced Topics in Computer Systems

ABSTRACTProviding direct answers to web search queries rather thanlinks to related documents has been shown to increase usersatisfaction [3]. However, most search engines only provideanswers to a few simple query types and keyword queriesprovide little semantic information. To support direct an-swers for more complex and less common queries, we havedeveloped CrowdQ, a novel hybrid human-machine systemthat leverages the crowd to gain knowledge of query struc-ture and entity relationships without the need for expertquery annotation. Keyword query understanding techniquesprovide candidate semantic interpretations of queries whichare generalized as structured query templates. Where exist-ing query understanding techniques result in ambiguity weutilize web users for disambiguation via tasks on a crowd-sourcing platform. We describe our prototype of CrowdQ,demonstrating that we can answer a wide range queries withhigh precision in interactive latencies, and evaluate the qual-ity of the crowd interface, correctness of template answersand responsiveness of live template matching and query ex-ecution.

1. INTRODUCTIONModern web search engines are beginning to provide struc-tured results in response to users’ queries—rather than merelylinks to documents—in order to more directly meet their in-formation needs. These direct answers, however, are onlyavailable for a small set of common query types [6], leavingless common, more complex queries (for example, “birthdaysof bay area mayors”) unaddressed. Uncommon queries ac-tually form the bulk of a search engine’s workload: Whiteet. al.[19] found that over six months, 97% of unique queriesoccurred 10 or fewer times. A solution is needed to interpretand directly answer these long tail queries.

Recent crowd-based systems have demonstrated the effec-tiveness of crowds for solving common database tasks likegathering missing data and entity resolution [14, 11]. More-

over, crowdsourcing techniques have been used to link enti-ties from keyword queries to semantic datastores [8]. Whilethese techniques have all focused on fixing data, we believethey can be extended to understand queries as well. Exist-ing approaches to keyword query understanding link querieswith entities and relationships in structured Linked Datastores [18], though the best approaches require expert anno-tation [16].

Crowds provide an opportunity for deeper query understand-ing: every web user has intimate experience with craftingand executing keyword searches on the internet, makingthem experts at understanding queries. We propose inject-ing this expertise into each of the hard problems that com-pose query understanding: recognizing entities, identifyingthe relationships between entities, and linking these entitiesand relationships to structured semantic stores. Further,we observe that queries can be grouped by their semantics,and these groups can be utilized to extend the crowd’s un-derstanding of a single query to enable automated directanswering of other queries in the same semantic grouping.

To this end, we introduce CrowdQ, a system for crowd-sourced query understanding and answering. CrowdQ con-sists of an offline query analyzer and an online answer lookupengine. The offline analyzer uses the crowd to build struc-tured query templates that are used for efficient answer re-trieval from a structured datastore. Our initial prototypetakes advantage of the crowd to verify the results of auto-mated inter-entity relationship extraction. The result is astructured template for each semantic class of queries thatis used by the online answer engine to efficiently find answersin real time. Users interact with the system via a standardweb interface (see Figure 4) that provides query answers aswell as contextual information and alternate interpretationsof the query.

The main contributions of this work are as follows:

• An abstraction (“1-Hops”) that enables efficient repre-sentation and discovery of query interpretations

• An efficient crowd interface for verifying query seman-tics with high accuracy

• An online query answer engine that maps queries tolearned templates with latencies acceptable for inter-active web search

• Experimental evaluation of the above components againsta corpus of keyword queries designed to challenge se-mantic parsers

The remainder of this paper is organized as follows. Section2 provides context and a discussion of related work. Section3 describes the challenges faced by CrowdQ. Section 4 de-scribes the semantic constraints on queries in CrowdQ andthe query answering abstraction used. Section 5 describeseach component of the CrowdQ system in detail. In section6 we evaluate the prototype implementation of CrowdQ. Insections 7 and 8 we discuss future work on the CrowdQproject and the conclusions drawn from the development ofthe prototype.

2. RELATED WORKRecent work has dealt with keyword query understandingover heterogeneous knowledge bases. Tran et. al. linkquery terms with a RDF knowledge base and search for con-nections to generate the top k answer candidate subgraphs[18]. Zhou et. al. use a similar graph search based approach[20]. In these approaches, keywords (or related words to key-words) are assumed to correspond one-to-one to objects inthe knowledge base. In contrast, CrowdQ takes into accountnamed entities and part-of-speech tags to group related key-words that correspond to a single entity or relationship inthe data graph. Additionally, these approaches provide a listof graphs or queries on the data store and leave the choiceof the most appropriate query to the user. CrowdQ usesthe semantic query understanding ability of the crowd toattempt to choose the most appropriate query.

Pound et. al. generate a model using query log mining tomap keyword queries to semantic structures which are linkedwith structured query templates [16]. The linking process inthis approach requires expert annotation of a training dataset which must include any possible interpretations for eachtraining query. Agarwal et. al. also use query log miningto extract structured query templates [1]. CrowdQ proposesthat the interpretation of keyword queries is a suitable taskfor the crowd.

Natural language question answering (QA) using Web knowl-edge bases is another related area. Fernandez et. al. developthe PowerAqua system [10, 13] which links a QA system witha Linked Data store. The FREyA project by Damljanovicet. al. [7] exploit interaction with the user to disambiguatenatural language queries on Linked Data. QA approachesoften rely on having properly formed questions and moresemantic context than is available in keyword queries.

Bernstein et. al. [3] recently used the crowd to extractanswers to keyword queries directly from result web pages.However, this approach can’t answer questions which aren’talready solved explicitly in an existing document. Addi-tionally, each answer solves only a single search query. Incontrast, our approach uses the crowd for semantic query un-derstanding and linking to the datastore, and attempts togeneralize results from the crowd by generating templatesto solve larger classes of related queries. Other approaches[2, 5] also take advantage of the crowd for information ex-traction in the context of search engines.

Crowdsourcing has been used previously for query process-ing in the context of traditional database systems in CrowdDB[11] and Qurk [14]. These systems, however, focus on us-ing the crowd for gathering, cleaning, or ranking the data.CrowdQ attempts to address using the crowd for under-standing query semantics, especially in the case of keywordqueries which have less context than natural language queries.

3. CHALLENGESComplexity of query semantics. Parsing natural lan-guage is a non-trivial problem, and keyword queries providea more difficult challenge as they lack grammatical contextto identify relationships between terms. This means thata keyword query of even relatively short length can repre-sent considerably involved semantic structure. For example,the query “bay area mayors” has only two components (“bayarea”, an entity, and “mayors”, a class), but implies a muchmore complicated interpretation (“list of people who are po-litical leaders of type mayor in cities which are located inthe San Francisco Bay Area”).

As such, our initial approach to understanding queries doesnot attempt to interpret arbitrarily complex keyword group-ings. Rather, we make two key restrictions on the problemspace. First, we only attempt to understand queries con-taining two terms in order to restrict the space of possibleterm-term interactions. Second, we only attempt to under-stand queries with relatively simple term-term interactions.Namely, we only search for query interpretations with 1-Hopsemantics (which we define below). Our approach can be ex-tended to more complicated queries, however, as we proposein section 7.

Data quality issues. In order to provide structured an-swers to a keyword search, we must transform the searchinto a structured query against a knowledge database. Ourapproach uses a database populated with publicly availablesemantic web (RDF) datasets. These datasets have two ma-jor drawbacks.

First, data is often missing, incomplete, or incorrect. We donot address the issues associated with constructing a robustknowledge database in this work, as they could easily fill upa paper of their own. Rather, we evaluate a query interpre-tation as successful if, were the datasets clean and available,the query could be answered correctly. We also note in sec-tion 7 that extra crowdsourcing steps could be added tothe CrowdQ workflow in order to dynamically clean data asusers encounter errors.

Second, data is often organized inconsistently. For exam-ple, in the DBPedia dataset leaders of cities are related tothe cities via the “leader” relation. However, to find currentmayors of these cities one must check the “leaderTitle” re-lation as well. To further confuse things, some cities have a“mayor” relationship which directly gives the mayor (thoughthis is only true for a minority of cities). This issue is a di-rect consequence of the flexibility of the RDF model (andits lack of a formal schema), but it has unfortunate impli-cations for this work. When we attempt to generalize ourunderstanding of a single query to other similar queries, weassume that the data for all such queries is stored consis-

tently and can therefore be queried consistently. At worst,this could lead us to generate query templates that werecorrect for some initial query, but which incorrectly inter-pret subsequent queries (and return incorrect results). Morecommonly, inconsistent data representation merely leads toless general templates and fails to yield results for queriesthat do not match the template’s implicit schema.

Unreliable crowds. Any work that relies on crowd-basedfeedback must cope with the fact that the crowd often pro-vides conflicting and incorrect responses. We seek to mit-igate this introduced error in three ways. First, we gener-ate candidate query interpretations ourselves and ask thecrowd to disambiguate them, rather than giving the crowdthe open-ended task of generating interpretations from thequery text alone. Second, we provide clear and explicit En-glish descriptions of each query interpretation, to preventmisunderstanding. Finally, we require multiple votes fromthe crowd before we accept an interpretation as correct.However, this methodology is by no means perfect, and aswe suggest in section 7, additional quality assurance stepscould be added to improve crowd accuracy.

Latency requirements. Our goal is to build a system thatprovides live, ad-hoc query answering, much like a modernsearch engine. This requires that users receive answers ininteractive timescales, and has important implications forCrowdQ’s system design. Namely, we cannot embed crowd-sourcing steps that require feedback from other workers intothe online workflow, and we must represent and store tem-plates in a format that can be retrieved, matched against,and executed against the knowledge base while providingsub-second latencies. This acknowledgment led us to breakCrowdQ into an offline query processor that uses the crowdto generate and store templates (see sections 5.1, 5.2.1)and an online query processor that simply matches queriesagainst templates, executes them, and returns results (seesection 5.2.2). Evaluation of the latency characteristics ofCrowdQ can be found in section 6.3.

4. QUERY SEMANTICSTo make query understanding tractable, an answer systemmust make assumptions about queries it can handle. Queriesthat do not conform to the assumptions are ignored by thesystem. Traditionally, search engines have restricted di-rect answers to specific content domains, for example, flighttimes or the weather. Given a restricted domain, the systemcan leverage specific semantic structure to give high qualityanswers to covered queries. The drawback is that domainspecific systems are, by definition, hard to extend to otherdomains.

Another approach to restricting the answerable query spaceis to make assumptions about the linguistic syntax of inputqueries. For example, the Tail Answers system mines querylogs for queries beginning with question words like “how,”“who,”and“where” [3]. Hardcoded syntax rules can be lever-aged by a parser to achieve high quality understanding, butthey also make the system rigid and difficult to extend.

CrowdQ is desgined to be more extensible than other an-swer systems by making weaker assumptions about input.CrowdQ is like domain specific systems in that it depends

Figure 1: The 1-Hop abstraction encapsulates a lim-ited but common class of query semantics.

on an underlying data store to generate answers. CrowdQ’smaximum answerable query space is bounded by the con-tent of its store. Unlike domain specific systems, however,CrowdQ makes no assumptions about the structure of itsstore, and instead uses search algorithms and the crowd tolearn semantic structure.

The schema of a data store is limited in its ability to assistwith query understanding because queries may contain anarbitrary number of semantic “hops.” To illustrate, considerthat any noun phrase, say“CS262a,” can be expanded indefi-nitely (for example to“students in CS262a,”or“favorite foot-ball teams of cousins of professors of students in CS262a”).To make CrowdQ’s job tractable, we limit it to answeringqueries with at most one semantic hop, plus an additionalconstraint hop. We belive this is a good assumption becausequeries with less indirection occur more frequently. We in-tent to extend CrowdQ to handle more complex queries inthe future.

To make the limit of “one semantic hop, plus a constraint”precise, we introduct the 1-Hop abstraction (see figure 1).A 1-Hop is a sub-graph consisting of four nodes and threeedges. Consider the query“students in CS262a.” The sourcenode corresponds to an explicit entity found in the query(in this case “CS262a”). The answer node is an entity inthe database that has an attribute that answers the query(any student in CS262a). The 1-Hop semantics require thatthe answer node be one hop—the source predicate—fromthe source node, where a hop corresponds to one edge in asemantic graph data store or to one foreign-key relationshipin a relational database. Answer nodes can optionally berequired to meet some constraint in the form of being onehop (the filter predicate) from a filter node. Filters oftenenforce a type constraint, for example, “is a student” may beapplied to weed out professors and TAs in case the sourcepredicate is too general (e.g. “participants in CS262a”).

The 1-Hop abstraction is used by the components of CrowdQto add new semantics to the system and extend its querycoverage. Often, it is useful to group 1-Hops that are iden-tical except for their answer nodes—these groups representanswer sets. The relationship extraction system generatescandidate answer sets. The crowd verification system vetsthese candidates and chooses correct ones to be made intotemplates. The template system maps query patterns to

generalized versions of selected candidates—in these tem-plates, the source node is abstracted as a wild card, forexample, “students in <course number>” (see section 5.2.2for more details). Adding new templates enables CrowdQ’son-line component to quickly answer new classes of queries,thus expanding the space of answerable queries. The fol-lowing section will describe this process and the CrowdQsystem components in greater detail.

5. THE CROWDQ SYSTEMThe major components of CrowdQ are an off-line query an-alyzer, a crowd-based verification system, and an on-linequery answer engine. To answer a live search query, the an-swer engine searches a template database for an answer tem-plate that matches the query. The template in turn mapsthe query to a parameterized structured query on an under-lying answer store. This latter query is executed to returnthe result to the user.

CrowdQ can be extended to answer more queries by addingnew templates to its template database. Given an exem-plar query, the query analyzer generates candidate templatesthat match the 1-Hop semantic abstraction described in theprevious section. Candidate templates are then passed tothe crowd to be cleaned, filtered, and verified. Templatesthat pass the crowd are added to the template database tobe used by the on-line engine. The following section exploresthese component systems in detail.

Figure 2 illustrates how these components fit together inCrowdQ’s architecture (as proposed originally in [9]). Tothis point, CrowdQ has been implemented against semantic-web answer stores, namely DBPedia [4], YAGO [17], andMusicBrainz 1. We put these data sets into the ApacheJena [15] triple store, and access the data with the SPARQLquery interface.

5.1 Query AnalyzerGiven a new exemplar query, the CrowdQ query analyzerattempts to interpret the query to generate a set of sub-graph patterns in the answer database. In particular, theanalyzer parses the query to identify relevant entities, thensearches for relationships between those entities that con-form to the 1-Hop abstraction. The 1-Hop objects that theanalyzer identifies are used as input to the crowd interfacefor candidate templates that answer the exemplar query andothers with equivalent semantics.

5.1.1 Entity ExtractionThe first step when given an exemplar query is to parseit and map n-grams in it to objects in the answer database.For example, given the query “actors in Top Gun,” we wouldlike to map “Top Gun” to its corresponding entity in DBPe-dia, and “actors” to an actor type or relation. In practice,an n-gram in the query may map to multiple objects—theYAGO data set, as an example, contains hundreds of cate-gories about actors, including “ActorsFromNewMexico” and“ActorsFromCalifornia.”

To accomodate the 1-Hop abstraction, we expect the entityextractor to find exactly two objects, a source and a sink,both linked to n-grams in the query. If the extractor finds

1http://musicbrainz.org

more objects, the query is discarded, since CrowdQ doesnot handle more complex queries at present. Likewise, ifno objects are found, the query is dropped as trivial. Incases where only a single entity is found, select informationis displayed on the search results page about that entity.

Named entity recognition (NER) is a challenging problemthat has been studied deeply in the NLP field. Many toolsexist to perform NER with acceptable accuracy, but theyrequire careful training and set up to achieve accuracy on anew corpus. Working within the scope of a class project, weopted to avoid the potential rabbit hole of exploring sophisti-cated NLP extraction techniques. Therefore, the present im-plementation of CrowdQ expects that the source portion ofexemplar queries be tagged and mapped by hand. The sinktext must also be labeled, but not mapped to any databaseobject, since it is used simply as input to a full text searchfor matching objects.

5.1.2 Candidate GenerationThe procedure to produce candidate templates is a search ofthe answer database for sub-graphs that connect a query’ssource and sink and that match the 1-Hop graph pattern.The 1-Hop abstraction greatly simplifies the search, which,given a triple store answer database, reduces to a handfulof SPARQL queries. There are four cases that need to bechecked. In all cases, the query’s source is identified with the1-Hop’s source node. Then each case identifies the query’ssink as one of the source predicate, answer node, filter pred-icate, or filter node. In fact, the sink text is identified withthese objects indirectly, by means of a full-text search query.Apache JENA supports full-text search powered by Lucene.Non-string objects are returned by the search when theirtext labels match the search term.

These four semantic case are illustrated below, for map-ping the query “CS262a room number” to different databaseschemas:

1. Sink matches source predicate. This case coversthe simple situation where the answer to a query is anattribute of the source node, for example, if the CS262aentity has an attribute labeled “Room Number” thatpoints to the string “Soda 310”

2. Sink matches answer node. Here, the sink matchesan entity directly, for example, if the property locationof CS262a points to an entity named “Soda Hall Roomnumber 310.”

3. Sink matches filter node. This case is slightly morecomplex because the sink text points to a node twohops away from the source. An example is if CS262a’slocation property points to multiple entities, namedperhaps “Soda Hall” and “310.” The correct answernode here is distiguinshed because its type propertypoints to “Room Number.”

4. Sink matches filter predicate. This is an odd vari-ant of the previous case, and would apply to the casewhere the “310” node above has, say, a property called“Is Room Number” that points to true. In practice thiscase introduces many low quality candidates, and so itis not used in our current implementation.

User

Keyword QueryOn#line'Complex'Query

ProcessingComplex

query classifier

CrowdsourcingPlatform

Vetrical selection,

Unstructured Search, ...

POS + NER tagging

Query Template Index

Crowd Manager

N

Y

Queries Templ +Answer Types

StructuredLOD Search

Result Joiner

Template Generation

SERP

t1t2t3

Off#line'Complex'QueryDecomposition

Structured Query

Query Logquery

N

Answ

erCo

mpo

sitio

n

LOD Open Data Cloud

Match with existingquery templates

Figure 2: CrowdQ Architecture, as proposed in our original vision paper [9]. Off-line processing starts at thetop-right with a search log as input. On-line processing starts at top-left with a user keyword query.

Note that we do not find the answer if it is more than onehop from the source node, for example, if CS262a points toan abstract ClassLocation entity that itself points to locationdetails like a Soda Hall node and a Room 310 node.

The candidate generator in CrowdQ is a Python modulethat generates and executes SPARQL queries correspond-ing to the cases above. The queries are executed againstJENA. Duplicate results are removed and those with equiv-alent graph patterns but different answer nodes are grouped.These groups are joined with their associated tagged exem-plar query and passed to the crowd system for verification.

5.2 Template Verification5.2.1 Crowd Verification

Candidate generation produces a set of potential 1-Hop graphswithout any indication of which are most likely to corre-spond to the intent of the query. In order to find the an-swer graph with the most appropriate semantics, naturallanguage representations of the 1-Hop graphs are generatedand presented to the crowd. The most common answer fromthis crowd verification step is used to determine the answerpattern for similar queries, while other common answers maybe presented as alternatives on the search results page.

Figure 3 shows the interface presented to the crowd to verifythe correct query meaning. Due to the tendency of crowdworkers to pick one of the first few answer choices, the listof answer candidates is randomized for each crowd worker.Additionally, since multiple answer graphs are sometimescorrect for a query, the interface has the option of multiplechoice or single-select answers from the crowd. Since dis-playing 1-Hop graph patterns to the crowd might be confus-ing, we build explicit natural language sentences to describeeach answer candidate. For example, in the query “actorsin Top Gun”, the 1-Hop with a source node of “Top Gun”,a source predicate of “starring”, the filter predicate “type”

and the filter node “ActorsFromNewYork” would generatethe natural language text “objects related to Top Gun withthe starring relationship, to ActorsFromNewYork withthe type relationship”.

5.2.2 TemplatesAt this point, CrowdQ has matched the query to a 1-Hopsubgraph representation, with an accompanying structuredSPARQL query that would answer the query if run againstour knowledge database. We observe, however, that by ab-stracting parts of the 1-Hop graph structure, we can answera wide range of related queries. Particularly, by convert-ing the source node of a 1-Hop to a wildcard, we can answerany query that would match the rest of the 1-Hop subgraph.For example, “actors in Top Gun” would generate a 1-Hopwith the <Top_Gun> entity as its source node. By replac-ing <Top_Gun> with the wildcard <MOVIE>, we can triviallymatch “actors in Forrest Gump”, “actors in Cloud Atlas”, or“actors in Saving Private Ryan”. We call a 1-Hop graph withwildcard replacements a generalized 1-Hop, and the specific1-Hop from which it was created the origin 1-Hop. Given ageneralized 1-Hop, we can trivially produce an equally gen-eral structured SPARQL query:

1. For each entity ei that was generalized from the origin1-Hop, generate a variable vi.

2. Replace each occurrence of ei in the origin 1-Hop’sSPARQL query with vi.

So we could generalize the SPARQL query for“actors in TopGun”:

Figure 3: Crowd Relationship Extraction Interface.

SELECT distinct ?actor

WHERE {

?actor a class:Actor .

<Top_Gun> rel:starring ?actor

}

to a query for “actors in <MOVIE>”:

SELECT distinct ?actor

WHERE {

?actor a class:Actor .

?movie rel:starring ?actor

}

We term these generalized SPARQL queries query templates.Thus, given an incoming keyword query, CrowdQ’s queryanalyzer builds and stores a corresponding query template.

Templates are only useful, however, if incoming queries canbe matched against them. The challenge for this match-ing process is twofold—incoming queries must match againsttemplates and look up results in web search timescales, andmatching must be accurate to avoid bombarding users withirrelevant answers. As an initial approach, CrowdQ usesregular expressions to match only queries that bear strongtextual similarity. Regular expressions are generated alongwith the query templates, and simply represent the exacttext of the query with potential wildcards. For example,“actors in Top Gun” would generate the regular expression\^actors in (.+)$\. This would match “actors in TopGun”, “actors in Forrest Gump”, etc. The group of textwhich matches (.+) corresponds to the movie name, andcan be linked using our Lucene index to an entity in theknowledge database.

This approach is intentionally limited: it cannot match“TopGun actors”, for example. However, regular expression match-ing still allows templates to answer a large number of related

Figure 4: In-browser Results Interface

queries (see section 6). We have considered other, more so-phisticated approaches as well—see section 7 for a discus-sion.

5.3 Query AnsweringOnce the system has stored a body of query templates, it canimmediately begin processing incoming queries. In order tosuccessfully answer a new query, CrowdQ must match thekeyword query against the set of stored templates, evaluatethe query against matching templates, and present resultsto the user.

In order to successfully match a query against a template,we simply evaluate the query against the template’s associ-ated regular expression. We attempt to match the incomingquery against each template in the store. If there are multi-ple matches, we have several possible interpretations of thequery. In such a case, when results are displayed to theuser, only the first interpretation is displayed, with a notethat other interpretations are available and the option toswitch between them.

To evaluate a template, we link wildcard entities in the in-coming query to our knowledge database using a Lucene

index, substitute them into the template’s SPARQL query,and execute the query against our database. If the resultset of such a query is non-empty, we parse the results anddisplay them to the user via the browser interface (figure 4).If there were no returned results, there are several possibleexplanations: the query might be unanswerable, our datastore might not cover the requested data, or we may havematched against an incorrect template. In such a case, wereturn an empty result to the user (falling back to a straightlisting of Google results) and queue the query for inspectionto determine the cause of the failure. This step is currentlymanual, though it could be implemented as a crowdsourcedtask.

6. EVALUATIONCrowdQ has been evaluated using a set of queries fromthe QALD-2 Challenge2 for question answering on linkeddata. The QALD-2 data set contains 200 natural languagequeries with associated important keywords and gold stan-dard SPARQL queries for DBPedia [4] and Musicbrainz3

datastores. Since our system doesn’t attempt to answerqueries involving aggregation, any queries requiring aggre-gation were removed from the dataset. Our implementationof CrowdQ uses the Jena [15] framework for storing andquerying RDF data and Lucene4 for text indexing of thedata. All evaluations are run using the DBPedia [4] infoboxand category data.

Evaluation of CrowdQ requires defining quality metrics forthe crowd interface, template generation and template match-ing. Answerable query classification and query tagging aren’taddressed in the CrowdQ evaluation, but are dealt with indetail in related work [3, 12].

6.1 Crowd InterfaceTo evaluate the crowd interface, the “gold standard” answergraph (or graphs if multiple choices are equally correct) arehand-labeled before being submitted to the crowd interfaceand then compared to the results from the crowd. The querycorpus used was the QALD-2 DBPedia and Musicbrainzdatasets. In the first experiment, answer candidates weregenerated from the DBPedia graph with category data andin the second on DBPedia without categories. We foundthat categories resulted in far more false answer candidates,though there was a correct answer more often when usingcategories. Without categories, the most common answerwas the best answer 64.1% of the time, and the correct an-swer was one of the top two most common 87.1% of the time.With the category data from DBPedia, the most commonanswer was the “gold standard” 26.9% of the time and one ofthe top two were 30.7% of the time. This demonstrates thedanger of presenting too many potentially confusing results.The crowd often chose results belonging to categories toospecific for the query.

6.2 Query CoverageAn answer system can only be as useful as the number ofqueries it is able to respond to. We have argued that the

2http://greententacle.techfak.uni-bielefeld.de/ cunger/qald/index.php?x=challenge&q=23http://musicbrainz.org4http://lucene.apache.org

Figure 5: The 1-Hop abstraction encapsulates a lim-ited but common class of query semantics.

1-Hop abstraction allows CrowdQ to extend the size of itsanswerable-query space rapidly, by enabling templates withsimple yet generalizable semantics. To test this hypothesis,we need a measure of the generality of a template. We de-fine a template’s size to be the number of distinct entitiesthat are in the source node position of a sub-graph thatmatches the templates 1-Hop pattern. For example, supposea template fitting queries of the form“students in <course>”uses a 1-Hop where the source node points to answer nodesvia the source predicate attendee. Then the size of this tem-plate will be the number of distinct entities that have oneor more attendee properties (i.e., the number of classes forwhich we can find students).

We believe this notion of size is a good measure of thegenerality of a template because it is proportional to thenumber of distinct keyword queries—given successful entityextraction—that the template can answer. We measuredthe sizes of templates vetted by the crowd and found thatthe majority of our templates are larger than one thousanddistinct source entities (see figure 5). Though our sample isnot large (N=39), this is evidence for the efficiency of ourapproach: using templates enables CrowdQ to expand itsquery coverage by several orders of magnitude faster than aquery-at-a-time system. And while these results were mea-sured for templates covering only one data set, DBPedia is aworst-case for templates because it is mass-authored and hasa very heterogeneous schema. In a more uniformly struc-tured data set, for example a properly modeled relationalschema, templates should perform better.

Finally, outliers in template size tend to reveal incorrect tem-plates. On the one hand, a template with very small size isoften an overfit. For example, for the exemplar query “mem-bers of The Prodigy,” the crowd selected a template witha filter node that required that all band members belongto the category “Members of The Prodigy.” Such a tem-plate cannot generalize to bands other than The Prodigy;its template size was 3 distinct entities (The Prodigy, andtwo bands that Prodigy members also belonged to). On theother hand, very large templates tend to be overly general,and perhaps lack a necessary filter constraint. In the future,we plan to use template size to assess template candidatesand optimize crowd review.

Figure 6: Request latency on a local machine.

6.3 Template MatchingFor template matching to be useful, the process of matchingthe query text to the correct template, linking the entityin the template to the data store, and querying the datastore for the result set must occur with latency appropri-ate for a search results page. Figure 6 shows the latencyresults for local CrowdQ requests for randomly chosen tem-plates and entities which match them (to avoid the effect ofcaching similar queries). Latency rarely climbs above 50ms,though after network latency this figure would be higher.This shows the practicality of our approach with an unopti-mized storage solution.

6.4 Template AccuracyUsers of search engines demand high-quality, accurate re-sults, and an answer engine that returns incorrect resultsonly trains users to ignore its output. We use the standardIR metric of precision—the ratio of correct answers returnedto total answers returned—to measure the accuracy of ourtemplates. The QALD-2 benchmark provides questions andanswers for the DBPedia data set. We assessed the precisionof 13 of our templates against QALD-2 (the rest of our tem-plates were either from the MusicBrainz data set, which didnot have useable answer queries, or their QALD-2 answersreturned no results). For these templates, we measured aprecision of 83%, a respectable number, but one which webelieve can be further improved with further optimizationof the crowd pipeline.

Another standard IR metric is recall, or the ratio of answersreturned to answers requested. On our test set, recall was alow 46%, owing primarily to limitations of the Lucene full-text search we depend on for entity extraction at query time.We believe that using more sophisticated entity extractiontools would significantly improve recall, but such work is outof scope for our first prototype.

7. FUTURE WORKIn the scope of the semester, we were able to prototype andevaluate our approach to understanding queries with thecrowd, but there are a number of areas in which we cancontinue to build upon our ideas.

Closing the loop. Our current system uses a hand-tagged

query corpus to generate templates. By implementing state-of-the-art NLP techniques for entity extraction and recog-nition, augmented with crowdsourced verification steps, wecan automate this step. This will enable us to use our livequery interface as a feedback loop for template generation:incoming queries that do not match any templates can beimmediately queued for processing by the query analyzer,leading to the desirable behavior that users who find no re-sults to a query may return minutes later to find that theirqueries can now be answered.

Improved use of the crowd. Our experiments showedthat in the majority of cases, the crowd can be used suc-cessfully to interpret queries. However, we can continue todevelop our interface to improve the crowd’s accuracy. Forexample, we detected a bias in the crowd’s choices towardsearlier items in the list of candidates, possibly indicatingthat crowd workers don’t bother to read all candidates be-fore answering. In order to ameliorate this effect, we couldexperiment with breaking candidates into subsets to mini-mize the number of candidates each worker must consider.Additionally, crowd accuracy can be improved by addingin extra verification steps, both to confirm initial interpre-tations of queries and to verify that generalized templatespreserve their desired semantics.

Beyond simply using the crowd to generate accurate querytemplates, we can begin to leverage an engaged user basein order to fundamentally improve the system. By makingthe results page more interactive, we could allow users toindicate issues with the results returned by CrowdQ. Usersresponses would constitute input from a ‘secondary crowd’,which could correct errors in templates or knowledge basedata, or even define the types and format of returned data.

Support for more complex query semantics. One ofthe major constraints of the current system is its inabil-ity to handle queries which have more than two terms, andwhich do not correspond to a 1-Hop semantic subgraph inthe knowledge database. Future work could easily expandon this. Allowing more terms in search queries would com-plicate the graph search problem, as discovered subgraphswould need to match against more than just a single sourcenode and sink text. Matching subgraphs with more complexstructures than 1-Hop graphs would allow us to interpretkeyword queries with more involved semantics (“bay areamayors”, for example), but would also result in more com-plexity in the graph exploration. A combination of heuris-tics for graph search and increased crowd involvement in theearlier phases of candidate generation could help limit thiscomplexity.

Generation of more general templates without lossof accuracy. Templates currently can only match key-word queries with identical semantic structure, with textthat matches a simple regular expression. While this limitedmodel is remarkably effective at expanding knowledge of asingle query to a large number of similar queries, the modelcould be improved. First, we could increase the number ofqueries that match a stored template by developing a moresophisticated matching algorithm. For example, followingthe work of Pound et. al. [16], we can model queries asordered sets of semantic types (e.g. “capital of Canada” be-

comes {<property> <relationship> <entity>}), then traina model to estimate whether an incoming query matcheseach template. This would allow us to match incomingqueries against templates without requiring identical syntax.For example, our current system would build a template for“capital of Canada” that could match “capital of France”,but not “France’s capital”. A machine learning-based modelcould successfully determine that the two forms are identi-cal.

Additionally, we could expand on the semantic structuresrecognized by a template by integrating NLP tools such asstemmers and WordNet to capture queries with synonymous(but not identical) semantics. For example, if our knowl-edge database represented the “acted in” relationship withboth the rel:actedIn and the rel:starredIn relations, ourcurrent system might not find rel:starredIn for a querylike “actors in Top Gun”, as there is little textual similar-ity between “actors” and “starredIn”. Using word stems andsynsets, however, we could discover that “act in” is relatedto “star in”, and match the query against our template.

Evaluation of system scalability. While we showed thatour current template system is sufficient to respond in inter-active timescales to user queries, we currently store a limitednumber of templates. We could evaluate how the systemperforms at scale, and develop more sophisticated storageand lookup mechanisms to improve performance when thereare many templates in the database and users query tem-plates with high concurrency. We expect that templateswould work well in a distributed, highly-available store, asuser queries are read-only and templates are immutable.

8. CONCLUSIONWe have introduced CrowdQ, a search engine that leveragesthe crowd to understand keyword queries and build theminto general templates that can answer large classes of sim-ilar queries. By limiting the keyword queries we interpretto those that correspond to 1-Hop semantic subgraphs overour knowledge database, we ensure that our templates re-turn precise and accurate answers to the queries they ad-dress. We have determined that with reasonable success,the crowd is able to correctly verify query semantics. Ourprototype provides a low-latency interface for the delivery ofstructured answers to unstructured queries, especially thosein the long tail of complex queries with which traditionalkeyword-based information retrieval systems struggle. Thus,CrowdQ represents a promising step towards building searchengines that truly understand user queries, and can respondimmediately with structured answers to complex questions.

9. REFERENCES[1] G. Agarwal, G. Kabra, and K. Chang. Towards rich query

interpretation: walking back and forth for mining querytemplates. In Proceedings of the 19th InternationalConference on World Wide Web, pages 1–10. ACM, 2010.

[2] O. Alonso and M. Lease. Crowdsourcing for informationretrieval: principles, methods, and applications. InProceedings of the 34th international ACM SIGIRconference on Research and development in InformationRetrieval, pages 1299–1300. ACM, 2011.

[3] M. Bernstein, J. Teevan, S. Dumais, D. Liebling, andE. Horvitz. Direct answers for search queries in the longtail. In Proceedings of the 2012 ACM Annual Conferenceon Human Factors in Computing Systems, pages 237–246.

ACM, 2012.

[4] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker,R. Cyganiak, and S. Hellmann. Dbpedia-a crystallizationpoint for the web of data. Web Semantics: Science,Services and Agents on the World Wide Web,7(3):154–165, 2009.

[5] M. Cafarella, A. Halevy, D. Wang, E. Wu, and Y. Zhang.Webtables: exploring the power of tables on the web.Proceedings of the VLDB Endowment, 1(1):538–549, 2008.

[6] L. Chilton and J. Teevan. Addressing people’s informationneeds directly in a web search result page. In Proceedings ofthe 20th International Conference on World Wide Web,pages 27–36. ACM, 2011.

[7] D. Damljanovic, M. Agatonovic, and H. Cunningham.FREyA: An interactive way of querying linked data usingnatural language. In The Semantic Web: ESWC 2011Workshops, pages 125–138. Springer, 2012.

[8] G. Demartini, D. Difallah, and P. Cudre-Mauroux.Zencrowd: leveraging probabilistic reasoning andcrowdsourcing techniques for large-scale entity linking. InWWW, pages 469–478. ACM, 2012.

[9] G. Demartini, B. Trushkowsky, T. Kraska, and M. J.Franklin. CrowdQ: Crowdsourced query understanding. InCIDR, 2013.

[10] M. Fernandez, V. Lopez, M. Sabou, V. Uren, D. Vallet,E. Motta, and P. Castells. Semantic search meets the web.In Semantic Computing, 2008 IEEE InternationalConference on, pages 253–260. IEEE, 2008.

[11] M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, andR. Xin. CrowdDB: Answering Queries with Crowdsourcing.2011.

[12] J. Guo, G. Xu, X. Cheng, and H. Li. Named entityrecognition in query. In Proceedings of the 32ndinternational ACM SIGIR conference on Research anddevelopment in information retrieval, pages 267–274. ACM,2009.

[13] V. Lopez, M. Fernandez, E. Motta, and N. Stieler.Poweraqua: Supporting users in querying and exploring thesemantic web. Semantic Web, 3(3):249–265, 2012.

[14] A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller.Crowdsourced databases: Query processing with people.CIDR, 2011.

[15] B. McBride. Jena: A semantic web toolkit. InternetComputing, IEEE, 6(6):55–59, 2002.

[16] J. Pound, A. Hudek, I. Ilyas, and G. Weddell. Interpretingkeyword queries over web knowledge bases. In Proceedingsof the 21st International Conference on Information andKnowledge Management, 2012.

[17] F. Suchanek, G. Kasneci, and G. Weikum. Yago: a core ofsemantic knowledge. In Proceedings of the 16thinternational conference on World Wide Web, pages697–706. ACM, 2007.

[18] T. Tran, H. Wang, S. Rudolph, and P. Cimiano. Top-kexploration of query candidates for efficient keyword searchon graph-shaped (rdf) data. In Data Engineering, 2009.ICDE’09. IEEE 25th International Conference on, pages405–416. IEEE, 2009.

[19] R. White, M. Bilenko, and S. Cucerzan. Studying the use ofpopular destinations to enhance web search interaction. InProceedings of the 30th annual international ACM SIGIRconference on Research and development in informationretrieval, pages 159–166. ACM, 2007.

[20] Q. Zhou, C. Wang, M. Xiong, H. Wang, and Y. Yu. Spark:adapting keyword query to semantic search. The SemanticWeb, pages 694–707, 2007.

CrowdQ: A Search Engine with Crowdsourced Query ...kubitron/courses/cs262...structured template for...

Documents

Transcript of CrowdQ: A Search Engine with Crowdsourced Query ...kubitron/courses/cs262...structured template for...