Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets
-
Upload
machine-learning-prague -
Category
Technology
-
view
98 -
download
1
Transcript of Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets
Thomson Reuters © 2014. Confidential. All Rights Reserved. No part of this document may be disclosed, reproduced or used in any form without the prior permission of Thomson Reuters
TR DISCOVERDeZhao Song, Frank Schilder, Charese Smiley…TR Corporate Research and Development, Eagan, MinnesotaChris BrewTR Corporate Research & Development, LondonML Prague, April 23th 2016
Outline
• TR Discover: NLP as part of the solution to a business problem.
– Problem
– Technologies used
– Demonstration
– Reflections
• What is it like to be a scientist working in a business setting?
About me• B.Sc Chemistry, Bristol• Search Examiner, European Patent Office, Berlin, Germany• M.Sc and D.Phil, Sussex, with Steve Isard in EP• Postdoc at Edinburgh, Scotland• Sharp Laboratories of Europe, Oxford• Research (and faculty-ish) positions at Edinburgh• Core faculty in Linguistics and CSE, OSU, Columbus OH, USA• Educational Testing Service, Princeton, NJ, USA• Nuance Communications, Sunnyvale, CA, USA• Thomson Reuters Corporate Research, London, England
Disclaimer
All opinions are my own, and do not reflect official positions of The Thomson Reuters Corporation
Thomson Reuters’ Business • Offer people information that they value enough to pay for.• Professional users• Many products, each catering for its own market segment.
Thomson Reuters’ Business • Not an internet company with tens of millions of users• Company is set up to build long-term trust relationships with
clients.– Many product managers and marketers, who are highly
expert in maintaining these relationships– Privacy and data security are crucial.
• Cannot be “one size fits all”.• Role of technology is to support and improve products. Primary
responsibility for daily support is the Platform Group.
Who are Corporate Research & Development?• Mission: To support Thomson Reuters by carrying out
applied research relevant to our businesses.• Team of approx. 40 researchers, developers, managers,
administrators, architects• Distributed group:
• Rochester, NY, USA• Eagan, MN, USA• New York City, NY, USA (3 Times Square, 12th floor)• London, UK (1 Mark Square), opened in August 2013
The business context
Financial and Risk
Legal
News
IP&Science
The technology context• Databases (mainly SQL, mainly Oracle)• Search (mainly Elastic Search, built on top of Lucene)• Virtualized servers in data centers• Front ends mostly in Javascript with AngularJS + components
to aid branding via common look and feel.• Back ends mostly Java.• Products often consist of a bundle of related capabilities,
packaged together to help potential users understand.
BOLD• Big• Open• Linked• Data
The Knowledge Graph
Experts and non-experts• There are also expert and non-expert professional users
– Cortellis (product for drug companies)• First time user, asks broad questions. No idea what is
available. Needs whatever guidance we can give. • Expert user. Knows roughly what is available, but may
need help locating what they want.– Common thread: users are trying to do something specific,
such as a market overview, a comparison, or verification of a hunch about a trend. Give them a data visualization, not just raw data.
Expert user
Natural Language Query: TR-Discover• Keyword based search is not enough to express user intent.• What if the user could type queries, and be guided towards
things that our system can answer?– Experts and first timers alike can access through NL– Enables discovery of data– Capture of user intent allows well-targeted analytics
• This is not new, there have been NL database query systems since the 60s, but these tend to be hard wired to specific databases and their schemas. We want a reusable tool.
Placeholder for demo.Available to you at http://cortellislabs.com/. You do have to register, but anyone can. NB. Beta version. Works, but has rough edges.
First-time user
Market technology trends
NER Sentiment Analytics
Comparing top 10 indications for companies for Drugs having a primary indication of pain
How we did it• Feature based context-free grammar with features, using the
formalism of NLTK.• Real logic-based formal semantics.• Autosuggest based on grammar, logical form and heuristics
derived from our databases.• Query via translation from logic to SPARQL or SQL
– SQL is just for now, for efficiency. – But we plan to keep logic as a separate level, not translate
directly to query language.
The Grammar• Feature based context-
free grammar with features, using the formalism of NLTK.– Grammar captures
selectional restrictions relevant to the drug domain.
– Adding a new domain should (mainly) be a matter of adding new lexical entries.
Grammar
• The word “drugs” is plural, and has λx.drug(x) as its semantics• For now, prepositional phrases have features that enforce very
tight attachment preferences. This is going to break, but OK for now.
• The type a of verb specifes both the potential subject-type and object-type, which can be used to filter out nonsensical questions like “drugs headquartered in the U.S”.
G1: Nom→NG2: NP→NomG3: NPbar → NPG4: NPbar → NPbar VPbarG5: VPbar → TV NPbarLex1: N[type=drug, num=pl, sem=<λx.drug(x)>] → ’drugs’Lex2: TV[TYPE=[drug,org,dev], sem=<λX x.X(λy.dev org drug(y,x))>, tns=past, NUM=?n] → ’developed by’Lex3: TV[TYPE=[org,country,hq], NUM=?n] → ’headquartered in’
Query translation input: Drugs developed by Merck
Query translation output: Drugs developed by Merck
• This was SPARQL. It works, but is much the slowest part of the system
• Similar translation for SQL. – In our demo, we can use the fact that we know all the
words of the grammar to make the database small.– This lets us replace the big costly knowledge graph with a
single-file Sqlite database. • Yes, we know it won’t scale
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX example: <http://www.example.com#>select ?xwhere {?id042 rdfs:label ’Merck’.?id042 rdf:type example:Company . ?x rdf:type example:Drug .?id042 example:develops ?x .}
Autosuggest• Use the grammar to calculate possible continuations from what
we have so far.– Currently this process does not use a full-fledged parser, and
relies on the fact that the grammar is carefully engineered to minimize local and eradicate global ambiguity.
– I want to achieve a tighter integration with the parser, and generate predictions based on elements present in the parser’s chart, allowing more ambiguity
• Rank suggestions by preferring concepts that correspond to nodes in the RDF graph that are involved in many relationships.– When we have large enough query logs we hope to add in
an additional preference component based on a domain specific n-gram language model.
What is it like being a scientist in the business world?• It varies with the DNA of the organization…
– ETS– Nuance– Thomson Reuters