Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

Thomson Reuters © 2014. Confidential. All Rights Reserved. No part of this document may be disclosed, reproduced or used in any form without the prior permission of Thomson Reuters

TR DISCOVERDeZhao Song, Frank Schilder, Charese Smiley…TR Corporate Research and Development, Eagan, MinnesotaChris BrewTR Corporate Research & Development, LondonML Prague, April 23th 2016

Outline

• TR Discover: NLP as part of the solution to a business problem.

– Problem

– Technologies used

– Demonstration

– Reflections

• What is it like to be a scientist working in a business setting?

About me• B.Sc Chemistry, Bristol• Search Examiner, European Patent Office, Berlin, Germany• M.Sc and D.Phil, Sussex, with Steve Isard in EP• Postdoc at Edinburgh, Scotland• Sharp Laboratories of Europe, Oxford• Research (and faculty-ish) positions at Edinburgh• Core faculty in Linguistics and CSE, OSU, Columbus OH, USA• Educational Testing Service, Princeton, NJ, USA• Nuance Communications, Sunnyvale, CA, USA• Thomson Reuters Corporate Research, London, England

Disclaimer

All opinions are my own, and do not reflect official positions of The Thomson Reuters Corporation

Thomson Reuters’ Business • Offer people information that they value enough to pay for.• Professional users• Many products, each catering for its own market segment.

Thomson Reuters’ Business • Not an internet company with tens of millions of users• Company is set up to build long-term trust relationships with

clients.– Many product managers and marketers, who are highly

expert in maintaining these relationships– Privacy and data security are crucial.

• Cannot be “one size fits all”.• Role of technology is to support and improve products. Primary

responsibility for daily support is the Platform Group.

Who are Corporate Research & Development?• Mission: To support Thomson Reuters by carrying out

applied research relevant to our businesses.• Team of approx. 40 researchers, developers, managers,

administrators, architects• Distributed group:

• Rochester, NY, USA• Eagan, MN, USA• New York City, NY, USA (3 Times Square, 12th floor)• London, UK (1 Mark Square), opened in August 2013

The business context

Financial and Risk

Legal

News

IP&Science

The technology context• Databases (mainly SQL, mainly Oracle)• Search (mainly Elastic Search, built on top of Lucene)• Virtualized servers in data centers• Front ends mostly in Javascript with AngularJS + components

to aid branding via common look and feel.• Back ends mostly Java.• Products often consist of a bundle of related capabilities,

packaged together to help potential users understand.

BOLD• Big• Open• Linked• Data

The Knowledge Graph

Experts and non-experts• There are also expert and non-expert professional users

– Cortellis (product for drug companies)• First time user, asks broad questions. No idea what is

available. Needs whatever guidance we can give. • Expert user. Knows roughly what is available, but may

need help locating what they want.– Common thread: users are trying to do something specific,

such as a market overview, a comparison, or verification of a hunch about a trend. Give them a data visualization, not just raw data.

Expert user

Natural Language Query: TR-Discover• Keyword based search is not enough to express user intent.• What if the user could type queries, and be guided towards

things that our system can answer?– Experts and first timers alike can access through NL– Enables discovery of data– Capture of user intent allows well-targeted analytics

• This is not new, there have been NL database query systems since the 60s, but these tend to be hard wired to specific databases and their schemas. We want a reusable tool.

Placeholder for demo.Available to you at http://cortellislabs.com/. You do have to register, but anyone can. NB. Beta version. Works, but has rough edges.

http://cortellislabs.com/

http://cortellislabs.com/

First-time user

Market technology trends

NER Sentiment Analytics

Comparing top 10 indications for companies for Drugs having a primary indication of pain

How we did it• Feature based context-free grammar with features, using the

formalism of NLTK.• Real logic-based formal semantics.• Autosuggest based on grammar, logical form and heuristics

derived from our databases.• Query via translation from logic to SPARQL or SQL

– SQL is just for now, for efficiency. – But we plan to keep logic as a separate level, not translate

directly to query language.

The Grammar• Feature based context-

free grammar with features, using the formalism of NLTK.– Grammar captures

selectional restrictions relevant to the drug domain.

– Adding a new domain should (mainly) be a matter of adding new lexical entries.

Grammar

• The word “drugs” is plural, and has λx.drug(x) as its semantics• For now, prepositional phrases have features that enforce very

tight attachment preferences. This is going to break, but OK for now.

• The type a of verb specifes both the potential subject-type and object-type, which can be used to filter out nonsensical questions like “drugs headquartered in the U.S”.

G1: Nom→NG2: NP→NomG3: NPbar → NPG4: NPbar → NPbar VPbarG5: VPbar → TV NPbarLex1: N[type=drug, num=pl, sem=<λx.drug(x)>] → ’drugs’Lex2: TV[TYPE=[drug,org,dev], sem=<λX x.X(λy.dev org drug(y,x))>, tns=past, NUM=?n] → ’developed by’Lex3: TV[TYPE=[org,country,hq], NUM=?n] → ’headquartered in’

Query translation input: Drugs developed by Merck

Query translation output: Drugs developed by Merck

• This was SPARQL. It works, but is much the slowest part of the system

• Similar translation for SQL. – In our demo, we can use the fact that we know all the

words of the grammar to make the database small.– This lets us replace the big costly knowledge graph with a

single-file Sqlite database. • Yes, we know it won’t scale

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX example: <http://www.example.com#>select ?xwhere {?id042 rdfs:label ’Merck’.?id042 rdf:type example:Company . ?x rdf:type example:Drug .?id042 example:develops ?x .}

Autosuggest• Use the grammar to calculate possible continuations from what

we have so far.– Currently this process does not use a full-fledged parser, and

relies on the fact that the grammar is carefully engineered to minimize local and eradicate global ambiguity.

– I want to achieve a tighter integration with the parser, and generate predictions based on elements present in the parser’s chart, allowing more ambiguity

• Rank suggestions by preferring concepts that correspond to nodes in the RDF graph that are involved in many relationships.– When we have large enough query logs we hope to add in

an additional preference component based on a domain specific n-gram language model.

What is it like being a scientist in the business world?• It varies with the DNA of the organization…

– ETS– Nuance– Thomson Reuters

Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

Technology

Transcript of Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets