Gain Super Powers in Data Science: Relationship Discovery Across Public Data

39
Relationship Discovery Across Public Data Webinar Ontotext, 2016

Transcript of Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Page 1: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Across Public Data Webinar

Ontotext, 2016

Page 2: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Presentation Outline

• Use cases: Relation discovery and Media monitoring• FactForge-News open-data playground• Relationship Discovery Examples• Media Monitoring Examples• Panama Papers and Global Legal Entity Identifier as Open Data• Mapping Datasets to DBPedia with the GraphDB Lucene Connector• Tracing Panama Papers entities in the news

Aug 2016

Page 3: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Using FIBO and Open Data to Discover Relationships

Relation Discovery Case

Mar 2016 #3

• Find suspicious relationships like:− Company in USA controls

− Another company in USA

− Through a company in an off-shore zone

• Show news relevant to them

Page 4: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Linking News to Big Knowledge Graphs

Aug 2016

• The DSP platform links text to knowledge graphs

• One can navigate from news to concepts, entities and topics, and from there to other news

Try it at http://now.ontotext.com

Page 5: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Semantic Media Monitoring

Aug 2016

For each entity:

• popularity trends

• Relevant news

• Related entities

• Knowledge graph information

Try it at http://now.ontotext.com

Page 6: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Presentation Outline

• Use cases: Relation discovery and Media monitoring• FactForge-News open-data playground• Relationship Discovery Examples• Media Monitoring Examples• Panama Papers and Global Legal Entity Identifier as Open Data• Mapping Datasets to DBPedia with the GraphDB Lucene Connector• Tracing Panama Papers entities in the news

Aug 2016

Page 7: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Our approach to Big Data

1. Integrate relevant data from many sources− Build a Big Knowledge Graph from proprietary databases and

taxonomies integrated with millions of facts of Linked Data

2. Infer new facts and unveil relationships− Performing reasoning across data from different sources

3. Interlink text and with big data− Using text-mining to automatically discover references to

concepts and entities

4. Use NoSQL graph database for metadata management, querying and search

Aug 2016

Page 8: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

FF-NEWS: Data Integration and Loading

• DBpedia (the English version only) 496M statements

• Geonames (all geographic features on Earth) 150M statements− owl:sameAs links between DBpedia and Geonames 471K statements

• Company registry data (GLEI) 3M statements

• Panama Papers DB (#LinkedLeaks) 20M statements

• News metadata (from NOW) 145M statements

• Total size: 1 026М statements− Mapped to FIBO; 724M explicit statements + 302M inferred statementsAug 2016

Page 9: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

News Metadata

• Metadata from Ontotext’s Dynamic Semantic Publishing platform− Automatically generated as part of the NOW.ontotext.com semantic news showcase

• News stream from Google since Feb 2015, about 10k news/month− ~70 tags (annotations) per news article

• Tags link text mentions of concepts to the knowledge graph− Technically these are URIs for entities (people, organizations, locations, etc.) and key phrases

Aug 2016

Page 10: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

News Metadata

Aug 2016

Category Count International 52 074Science and Technology 23 201Sports 20 714Business 15 155Lifestyle 11 684

122 828

Mentions / entity type Count Keyphrase 2 589 676Organization 1 276 441Location 1 260 972Person 1 248 784Work 309 093Event 258 388RelationPersonRole 236 638Species 180 946

Page 11: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Class Hierarchy Map (by number of instances)

Aug 2016

Left: The big pictureRight: dbo:Agent class (2.7M organizations and persons)

Page 12: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Sample queries at http://ff-news.ontotext.comF1: Big cities in Eastern Europe

F2: Airports near London

F3: People and organizations related to Google

F4: Top-level industries by number of companies

Available as Saved Queries at http://ff-news.ontotext.com/sparql

Note 1: Open Saved Queries with the folder icon in the upper-right corner

Note 2: FF-NEWS is still in Beta testing ! But available to play with

Aug 2016

Page 13: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Presentation Outline

• Use cases: Relation discovery and Media monitoring• FactForge-News open-data playground• Relationship Discovery Examples• Media Monitoring Examples• Panama Papers and Global Legal Entity Identifier as Open Data• Mapping Datasets to DBPedia with the GraphDB Lucene Connector• Tracing Panama Papers entities in the news

Aug 2016

Page 14: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Offshore control exampleQuery: Find companies, which control other companies in the same country, through company in an off-shore zone

How it works:

1. Establish control-relationship

2. Establish a company-country mapping good for the purpose

3. Establish an “off-shore criteria”

4. SPARQL it

Aug 2016

Page 15: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Off-shore company control exampleSELECT *FROM onto:disable-sameAsWHERE { ?c1 fibo-fnd-rel-rel:controls ?c2 . ?c2 fibo-fnd-rel-rel:controls ?c3 . ?c1 ff-map:orgCountry ?c1_country . ?c2 ff-map:orgCountry ?c2_country . ?c3 ff-map:orgCountry ?c1_country .

FILTER (?c1_country != ?c2_country) ?c2_country ff-map:hasOffshoreProvisions true .}

Aug 2016

Page 16: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Presentation Outline

• Use cases: Relation discovery and Media monitoring• FactForge-News open-data playground• Relationship Discovery Examples• Media Monitoring Examples• Panama Papers and Global Legal Entity Identifier as Open Data• Mapping Datasets to DBPedia with the GraphDB Lucene Connector• Tracing Panama Papers entities in the news

Aug 2016

Page 17: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Semantic Media Monitoring/Press-Clipping

• We can trace references to a specific company in the news− This is pretty much standard, however we can deal with syntactic variations in the names, because state

of the art Named Entity Recognition technology is used

− What’s more important, we distinguish correctly in which mention “Paris” refers to which of the following: Paris (the capital of France), Paris in Texas, Paris Hilton or to Paris (the Greek hero)

• We can trace and consolidate references to daughter companies

• We have comprehensive industry classification− The one from DBPedia, but refined to accommodate identifier variations and specialization (e.g.

company classified as dbr:Bank will also be considered classified as dbr:FinancialServices)

Aug 2016

Page 18: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Media Monitoring QueriesF5: Mentions in the news of an organization and its related entities

F7: Most popular companies per industry, including children

F8: Regional exposition of company – normalized

Aug 2016

http://ff-news.ontotext.com/sparql?name=Orgs+by+number+of+children&infer=true&sameAs=false&query=#+F5:+Mentions+in+the+news+of+an+organization+and+its+related+entities%0A%23+-+retrieves+people+related+to+a+given+organization+with+any+relation+;%0A%23+++this+would+be+slow+if+predicate+indices+are+not+switched+on%0A%23+-+retrieves+related+organizations+using+ff-map:agentRelation+;+%0A%23%20it+generalizes+the+important+relations+between+agents+%0A%23%20(people+and+organizations)+from+DBPedia+++%0A%23+-+the+entity+itself+is+also+added+to+the+set+of+%22related+entities%22%0A%23+++so+that+its+mentions+in+the+news+are+easily+extracted%0A%23+-+uses+news+metadata+imported+continuously+from+http://now.ontotext.com%0A%23+Change+Gazprom+to+any+organization,+e.g.+type+dbr:Berks+and+press+%0A%23+Ctrl-Space+to+auto-complete+and+get+dbr:Berkshire_Hathaway%0A%0APREFIX+dbr:+%3Chttp://dbpedia.org/resource/%3E%0APREFIX+pub-old:+%3Chttp://ontology.ontotext.com/publishing%23%3E%0APREFIX+pub:+%3Chttp://ontology.ontotext.com/taxonomy/%3E%0APREFIX+dbo:+%3Chttp://dbpedia.org/ontology/%3E%0APREFIX+ff-map:+%3Chttp://factforge.net/ff2016-mapping/%3E%0A%0ASELECT+DISTINCT+?news+?title+?date+?related_entity++%0A%7B%0A++++%7B+SELECT+DISTINCT+?related_entity+%7B%0A++++++++BIND+(+dbr:Gazprom+as+?entity+)%0A%0A%20%20%7B%20?related_entity+a+dbo:Person+;+?p+?entity+.%0A+++++++++++++FILTER+NOT+EXISTS+%7B+?related_entity+dbo:club+?entity+.+%7D+%0A++++++++%7D+%20++++++++++++%0A++++++++UNION++++%0A++++++++%7B%20?related_entity+a+dbo:Organisation+;+dbo:parent+?entity+.+%7D+%0A++++++++UNION%0A++++++++%7B+++BIND(?entity+as+?related_entity)+%7D+%0A%20%7D+%7D%0A++++%0A++++?news+pub-old:containsMention+/+pub-old:hasInstance+/+pub:exactMatch+?related_entity+.%0A++++?news+pub-old:creationDate+?date;+pub-old:title+?title+.%0A%7D+%0AORDER+BY+DESC(?date)+LIMIT+1000&execute=
Page 19: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

News Popularity Ranking: Automotive

Rank Company News # Rank Company incl. mentions of child companies News #

1 General Motors 2722 1 General Motors 46202 Tesla Motors 2346 2 Volkswagen Group 39993 Volkswagen 2299 3 Fiat Chrysler Automobiles 26584 Ford Motor Company 1934 4 Tesla Motors 23705 Toyota 1325 5 Ford Motor Company 21256 Chevrolet 1264 6 Toyota 16567 Chrysler 1054 7 Renault-Nissan Alliance 13328 Fiat Chrysler Automobiles 1011 8 Honda 8649 Audi AG 972 9 BMW 715

10 Honda 717 10 Takata Corporation 547

Aug 2016

Page 20: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

News Popularity: Finance

Rank Company News # Rank Company incl. mentions of controlled News #1 Bloomberg L.P. 3203 1 Intra Bank 2616672 Goldman Sachs 1992 2 Hinduja Bank (Switzerland) 497313 JP Morgan Chase 1712 3 China Merchants Bank 382884 Wells Fargo 1688 4 Alphabet Inc. 226015 Citigroup 1557 5 Capital Group Companies 40766 HSBC Holdings 1546 6 Bloomberg L.P. 36117 Deutsche Bank 1414 7 Exor 27048 Bank of America 1335 8 Nasdaq, Inc. 20829 Barclays 1260 9 JP Morgan Chase 1972

10 UBS 694 10 Sentinel Capital Partners 1053

Note: Including investment funds, stock exchanges, agencies, etc.

Aug 2016

Page 21: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

News Popularity: Banking

Rank Company News # Rank Company incl. mentions of controlled News #1 Goldman Sachs 996 1 China Merchants Bank * 382882 JP Morgan Chase 856 2 JP Morgan Chase 19723 HSBC Holdings 773 3 Goldman Sachs 10304 Deutsche Bank 707 4 HSBC 9665 Barclays 630 5 Bank of America 7716 Citigroup 519 6 Deutsche Bank 7427 Bank of America 445 7 Barclays 6818 Wells Fargo 422 8 Citigroup 6309 UBS 347 9 Wells Fargo 428

10 Chase 126 10 UBS 347

Aug 2016

Page 22: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Using FIBO and Open Data to Discover Relationships #22

Relations extracted from text

Apr 2016

Subject Object Countdbr:Chrysler dbr:Fiat_Chrysler_Automobiles 455

dbr:NASA dbr:Goddard_Space_Flight_Center 69

dbr:Time_Warner_Cable dbr:Comcast 44

dbr:National_Football_League dbr:New_England_Patriots 40

dbr:DirecTV dbr:AT&T 33

dbr:Alcatel-Lucent dbr:Nokia 31

dbr:AOL dbr:Verizon_Communications 30

dbr:University_of_Pennsylvania dbr:Perelman_School_of_Medicine_at_... UPEN 29

dbr:Time_Warner_Cable dbr:Charter_Communications 27

dbr:Continental_Airlines dbr:United_Airlines 26

Note: relation types "RelationOrganizationAffiliatedWithOrganization" "RelationAcquisition" "RelationMerger"

Page 23: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Presentation Outline

• Use cases: Relation discovery and Media monitoring• FactForge-News open-data playground• Relationship Discovery Examples• Media Monitoring Examples• Panama Papers and Global Legal Entity Identifier as Open Data• Mapping Datasets to DBPedia with the GraphDB Lucene Connector• Tracing Panama Papers entities in the news

Aug 2016

Page 24: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Global Legal Entity Identifier (GLEI) data

• Global Markets Entity Identifier (GMEI) Utility data− The Global Markets Entity Identifier (GMEI) utility is DTCC's legal entity identifier solution offered in

collaboration with SWIFT

− We downloaded as XML data dump from https://www.gmeiutility.org/

• RDF-ized company records − Fields: LEI#, legal name, ultimate parent, registered country

− 3M explicit statements for 211 thousand organizations▪ For comparison, there are 490 000 organizations in DBPeda and D&B covers above 200 million

− 10,821 ultimate parent relationships and 1632 ultimate parents

• 2 800 organizations from the GLEI dump mapped to DBPediaAug 2016

Page 25: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

GLEI Company Data Sample: ABN-AMRO

Aug 2016

lei:businessRegistry "Kamer van Koophandel"^^xsd:string

lei:businessRegistryNumber "34334259"^^xsd:string

lei:duplicateReference data:549300T5O0D0T4V2ZB28

lei:entityStatus "ACTIVE"^^xsd:string

lei:headquartersCity "Amsterdam"^^xsd:string

lei:headquartersState "Noord-Holland"^^xsd:string

lei:legalForm "NAAMLOZE VENNOOTSCHAP"^^xsd:string

lei:legalName "ABN AMRO Bank N.V."^^xsd:string

lei:lei "BFXS5XCH7N0Y05NIXW11"^^xsd:string

lei:registeredCity "Amsterdam"^^xsd:string

lei:registeredCountry "NL"^^xsd:string

lei:registeredPostCode "1082 PP"^^xsd:string

lei:registeredState "Noord-Holland"^^xsd:string

Page 26: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Global Legal Entity Identifier (GLEI) data

Aug 2016

Ultimate parent Children Country1 The Goldman Sachs Group, Inc. 1 851 US2 United Technologies Corporation 427 US3 Honeywell International Inc. 341 US4 Morgan Stanley 228 US5 Cargill, Incorporated 217 US6 1832 Asset Management L.P. 202 CA7 Aegon N.V. 174 NL8 Union Bancaire Privée, UBP SA 138 CH9 Citigroup Inc. 135 US

10 State Street Corporation 128 US

Country Companies1 dbr:United_States 103 5482 dbr:Canada 17 4253 dbr:Luxembourg 13 9844 dbr:Sweden 7 9345 dbr:United_Kingdom 7 4216 dbr:Belgium 6 8687 dbr:Ireland 4 7628 dbr:Australia 4 3859 dbr:Germany 3 039

10 dbr:Netherlands 2 561

Page 27: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Offshore Leaks Database from ICIJ

• Published by the International Consortium of Investigative Journalists (ICIJ) on 9th of May

• A “searchable database” about 320 000 offshore companies− 214 000 extracted from Panama Papers (valid until 2015)

− More than 100 000 from 2013 Offshore leaks investigation (valid until 2010)

• CSV extract from a graph database available for download• https://offshoreleaks.icij.org/

Aug 2016

Page 28: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Offshore Leaks Database

Aug 2016

Page 29: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Offshore Leaks DB as Linked Open Data

• Ontotext published the Offshore Leaks DB as Linked Open Data• Available for exploration, querying and download at

http://data.ontotext.com• ONTOTEXT DISCLAIMERSWe use the data as is provided by ICIJ. We make no representations and warranties of any kind, including warranties of title, accuracy, absence of errors or fitness for particular purpose. All transformations, query results and derivative works are used only to showcase the service and technological capabilities and not to serve as basis for any statements or conclusions.

Aug 2016

Page 30: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Enrichment and structuring of the data

• Relationship type hierarchy− About 80 types of relationship types in the original dataset got organized in a property hierarchy

• Classification of officers into Person and Company− In the original database there is no way to distinguish whether an officer is a physical person

• Mapping to DBPedia: − 209 countries referred in Offshore Leaks DB are mapped to DBPedia

− About 3000 persons and 300 companies mapped to DBPedia

• Overall size of the repository: 22M statements (20M explicit)

Aug 2016

Page 31: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

The RDF-ization Process

• Linked data variant produced without programming− The raw CSV files are RDF-ized using TARQL, http://tarql.github.io/

− Data was further interlinked and enriched in GraphDB using SPARQL

• The process is documented in this README file• All relevant artifacts are open-source, available at

https://github.com/Ontotext-AD/leaks/• The entire publishing and mapping took about 15 person-days !!!

− Including data.ontotext.com portal setup, promotion, documentation, etc.

Aug 2016

Page 33: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Presentation Outline

• Use cases: Relation discovery and Media monitoring• FactForge-News open-data playground• Relationship Discovery Examples• Media Monitoring Examples• Panama Papers and Global Legal Entity Identifier as Open Data• Mapping Datasets to DBPedia with the GraphDB Lucene Connector• Tracing Panama Papers entities in the news

Aug 2016

Page 34: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Mapping datasets to DBPedia

• The task: map people, organizations and locations to IDs in DBPedia − So that we can analyze the original data with the help of the extra information available in DBPedia and

other datasets that are related to it, e.g. Geonames

− For instance, #LinkedLeaks doesn’t contain any extra information about the companies, e.g. industry sector, controlling or controlled companies, etc.

• Specific conditions: we had to map by names− Other than names, the information about the entities in the source datasets couldn’t help the mapping

▪ Address and country attributes are present, but those appeared to be marginally useful for mapping

− In both cases we mapped locations only in terms of countries and not finer grained locations▪ For this purpose DBPedia geographic data is sufficient and it is also well mapped with GeoNames

Aug 2016

Page 35: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Mapping datasets to DBPedia (2)

• We used the GraphDB connector to Lucene for these mappings− Using the GraphDB connector, Lucene index was created for Organizations and People from DBPedia,

indexing all sorts of names, descriptions and other textual information for each entity

− The mapping process consists mostly of using the name of the entity from the 3rd party dataset (in this case Panama Papers or GLEI) as a FTS query, embedded in a SPARQL query

• What is that Lucence does better than SPARQL?− When there is little information other than the name, we benefit from the free text indexing of Lucene,

because it deals well with minor syntactic variations and sorts the results by relevance

− When mappings 300 000 organizations against another 500 000 organizations, without a key, the complexity of a SPARQL query is 300 000 x 500 000, which is slower that 300 000 Lucene queries

Aug 2016

Page 36: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

#LinkedLeaks Mapping Queries

Companies mapped by industry

Companies mapped in the Finance sector

Politicians mapped

Available as Saved Queries at http://ff-news.ontotext.com/sparql

Note 1: Open Saved Queries with the folder icon in the upper-right corner

Note 2: FF-NEWS is still in Beta testing ! But available to play with

Aug 2016

Page 37: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Presentation Outline

• Use cases: Relation discovery and Media monitoring• FactForge-News open-data playground• Relationship Discovery Examples• Media Monitoring Examples• Panama Papers and Global Legal Entity Identifier as Open Data• Mapping Datasets to DBPedia with the GraphDB Lucene Connector• Tracing Panama Papers entities in the news

Aug 2016

Page 38: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Tracing Panama Papers entities in the news

• After mapping #LinkedLeaks entities to DBPedia identifiers, we can load them, together with the mappings, in the FF-NEWS repository

• This way we have in a single repo, mapped to one another: #LinkedLeaks data, DBPedia, News metadata

• We can make queries like: Give me news mentions of entities which appear in the Panama Papers dataset

• This way the mapping enabled media monitoring at no extra costAug 2016

Page 39: Gain Super Powers in Data Science: Relationship Discovery Across Public Data

Relationship Discovery Webinar

Thank you!

Experience the technology with NOW: Semantic News Portalhttp://now.ontotext.com

Play with open data at

http://data.ontotext.com and http://ff-news.ontotext.com

Aug 2016