Search challenges for collections of book records

28
Search challenges for collections of book records Roberto Cornacchia ECIR 2014 – Industry day Amsterdam, 16 April 2014 > design > publish > search!

Transcript of Search challenges for collections of book records

Page 1: Search challenges for collections of book records

Search challenges forcollections of book records

Roberto Cornacchia

ECIR 2014 – Industry dayAmsterdam, 16 April 2014

> design > publish > search!

Page 2: Search challenges for collections of book records

2

Outline

● COMSODE (EU-FP7)– Publication platform for Linked Open Data

● Spinque– Search modelling

● A use-case from Digital Humanities– link, clean, search

● A step further– Rank. Everything. Always.

– Query-time resolution of data conflicts

Page 3: Search challenges for collections of book records

3

Unlocking the value of L(O)D...

In the public sector

Source: Open Data 500 by The GovLab

In industry

...is a hot topic

In science

Source: Bradley Allen, SlideShare

Page 4: Search challenges for collections of book records

4

The COMSODE project has received funding from the Seventh Framework Programme of the European Union in the grant agreement number 611358.

COMSODE

Unlock LOD valueby improving publication

www.comsode.eu

Page 5: Search challenges for collections of book records

5

Spinque

● Spin-off of CWI Amsterdam (2009)

● Develops domain-tailored search technology– Applied to:

● IP, multimedia, cultural heritage, child-friendly, ...

– Search by Strategy● visual modelling of search processes

– Rank. Everything. Always.● integrated support for all-round probabilistic search

● Work in progress in COMSODE– Search Linked Data

Page 6: Search challenges for collections of book records

6

A use case in Digital Humanities

● "Can We Rank Scholarly Book Publishers? A Bibliometric Experiment with the Field of History"(Zuccala et al., Journal of the American Society for Information Science and Technology, 2014)

● Goal: indicate publisher prestige quantitatively– bibliographic citations to books from journal articles.

● Dataset: Elsevier Scopus journal citations– Granted via the 2012 Elsevier Bibliometrics Research Program– 5.6M citations, 3M from journals to books

– History & literature

– Periods 1996-2000 and 2007-2011

Page 7: Search challenges for collections of book records

7

Elsevier Scopus dataset

"Power and community: The archaeology of slavery at the hermitage plantation" American Antiquity

(journal, history)

Thomas B.

MISSISSIPPIANPOLITICAL ECONOMY

Muller J.

1998

1997

citing_eid,cited_eid,source_title,source_id,article_pubyear,authors,article_title,volume,page_start,doctype4702,232311,"American Antiquity",40554,1996,"Graybill D. (6603866252);Michaelsen J. (7003483600);Neff H. (7005907495);Larson D. (7402633779);Ambos E. (14048059100)","Risk, climatic variability, and the study of southwestern prehistory: An evolutionary perspective",61,217,re4702,1333725,"American Antiquity",40554,1997,"Raab L. (6601955075);Larson D. (7402633779)","Medieval climatic anomaly and punctuated cultural evolution in coastal Southern California",62,319,ar4702,7613691,"American Antiquity",40554,1997,"Colten R. (8363369400);Arnold J. (8754215200);Pletka S. (25221793700)","Contexts of cultural change in insular California",62,300,ar4702,30302643,"Quarternary Science Reviews",26239,1996,"Stuiver M. (7007003882);Reimer P. (7103071876);Taylor R. (26030669400)","Development and extension of the calibration of the radiocarbon time scale: Archaeological applications",15,655,ar4702,30317536,"Canadian Journal of Earth Sciences",22031,1996,"Dyke A. (7003706220);McNeely R. (7004891098);Hooper J. (7102438470)","Marine reservoir corrections for bowhead whale radiocarbon age determinations",33,1628,ar4702,30739323,"Journal of Coastal Research",27374,1997,"Mason O. (7004241927);Hopkins D. (7202255075);Plug L. (7801522080)","Chronology and paleoclimate of storm-induced erosion and episodic dune growth across Cape Espenberg spit, Alaska, U.S.A.",13,770,ar7154,2287569,"American Sociological Review",16929,1997,"Goodwin J. (7402339411)","The libidinal constitution of a high-risk social movement: Affectual ties and solidarity in the Huk rebellion, 1946 to 1954",62,53,re7154,30495855,"Sociological Theory",18110,1996,"Emirbayer M. (23110549400)","Useful Durkheim",14,109,ar9412,9986565,"British Journal for the Philosophy of Science",19977,1997,"Eliasmith C. (6603720957);Thagard P. (6701846211)","Waves, Particles, and Explanatory Coherence",48,1,ar9412,30006171,Gastroenterology,28330,1996,"Hamlet A. (6701690210);Dalenb<E4>ck J. (7003418017);F<E4>ndriks L. (7005233384);Olbe L. (7006954993)","A mechanism by which Helicobacter pylori infection of the antrum contributes to the development of duodenal ulcer",110,1386,ar

citesarticle

book

CSV files

RDF

Page 8: Search challenges for collections of book records

8

Warm-up

● Load RDF data– (subject, predicate, object)

● Most cited publications

● No problem with SPARQL or SQL

SELECT ?publication       count(*) as ?nCitationsWHERE {[] scopus:cites ?publication}GROUP BY ?publicationORDER BY desc(?nCitations)

subject predicate object

publication1 cites publication2

publication1 cites publication3

publication3 publisher publisher5

publication nCitations

publication3 288

publication5 223

publication2 124

Page 9: Search challenges for collections of book records

9

Warm-up

● Load RDF data– (subject, predicate, object)

● Most cited publications

● No problem with SPARQL or SQL

SELECT ?publication       count(*) as ?nCitationsWHERE {[] scopus:cites ?publication}GROUP BY ?publicationORDER BY desc(?nCitations)

subject predicate object

publication1 cites publication2

publication1 cites publication3

publication3 publisher publisher5

publication nCitations

publication3 288

publication5 223

publication2 124

Predicate traversal

Aggregation

Page 10: Search challenges for collections of book records

10

Warm-up .. visually"Search by Strategy"

Page 11: Search challenges for collections of book records

11

Warm-up .. visually"Search by Strategy"

Elsevierdata source

Predicate traversal

Aggregation

Deploy REST APIDeploy REST API

Data flow

Deploy search engineDeploy search engine

Page 12: Search challenges for collections of book records

12

Back to the original goal: rank publishers

journal articles

cited books

Elsevier – Scopus(closed data)

“cited” publishers

Page 13: Search challenges for collections of book records

13

Back to the original goal: rank publishers

"cited" publishers

journal articles

cited books cited books

aggregated"cited" publishers

sameAs

Elsevier – Scopus(closed data)

OCLC - WorldCat(open data)

“cited” publishers

● Open Data Node

– Links books● Search

– Uses links

– On-the-fly matching?

Deploy search engineDeploy search engine

Page 14: Search challenges for collections of book records

14

Surprise..

– University Press,Cambridge [England]

– University Press,Cambridge [etc.]

– University Press,"Cambridge, Mass.,"

– University Press,"Cambridge, N.E."– University Press,"Cambridge, U.K."

– University Press,"Cambridge, UK"

– University Press,Cambridge [U.K.]

– University Press [etc.],Cambridge– University Press [etc.],"Cambridge

[Eng., etc.]"

– University Press [etc.],Cambridge [etc.]– "University press [etc.,

etc.]","Cambridge,"

– University Pressf ats collnutz,Cambridge

– University Press of Cambridge,"Boston, Mass."

– University Press of Cambridge,"[Cambridge, Mass.]"

– Univ. of Cambridge,Cambridge

– Univ. P.,Cambridge

– Univ. Pr,Cambridge

– Univ. Pr.,Cambridge

– Univ.Pr.,Cambridge

– Univ. Pr.,Cambridge [u.a.]

– Univ. Pr.,"Cambridge, U.S.A."

– Univ. Pr.,Cambridge [usw.]

2588 variations (just for "Cambridge Universty Press").

Probably only 2 or 3 distinct entities in there.

Page 15: Search challenges for collections of book records

15

De-duplicate publishers

"cited" publishers

journal articles

cited books cited books

aggregated"cited" publishers

sameAs

Elsevier – Scopus(closed data)

OCLC - WorldCat(open data)

Page 16: Search challenges for collections of book records

16

De-duplicate publishers

● Open Data Node

– Links duplicates● Search

– Uses links

– On-the-fly matching?

"cited" publishers

journal articles

cited books cited books

aggregated"cited" publishers

sameAs

Elsevier – Scopus(closed data)

OCLC - WorldCat(open data)

sameAs

Deploy search engineDeploy search engine

Page 17: Search challenges for collections of book records

17

Is the DH researcher happy?

● Yes. All very nice...– ...but...?

● Data are not 100% clean yet.● Can we rank publishers of books about “women in war”?

The initial database problem

needs to deal with uncertainty

Page 18: Search challenges for collections of book records

18

Uncertainty from ranking

"cited" publishers

journal articles

cited books

aggregated"cited" publishers

Elsevier – Scopus(closed data)

OCLC - WorldCat(open data)

subject predicate object

book1 sameAs book9

book7 publisher publisher3

book9 publisher publisher5

subject

book1

book1

book2

rankedcited books

about"women in war"

aggregationsjoins

Page 19: Search challenges for collections of book records

19

Uncertainty from ranking

"cited" publishers

journal articles

cited books

aggregated"cited" publishers

Elsevier – Scopus(closed data)

OCLC - WorldCat(open data)

subject predicate object

book1 sameAs book9

book7 publisher publisher3

book9 publisher publisher5

prob

0.7

0.5

0.4

subject

book1

book1

book2

rankedcited books

about"women in war"

aggregationsjoins

Page 20: Search challenges for collections of book records

20

Uncertainty from ranking

"cited" publishers

journal articles

cited books

aggregated"cited" publishers

Elsevier – Scopus(closed data)

OCLC - WorldCat(open data)

subject predicate object

book1 sameAs book9

book7 publisher publisher3

book9 publisher publisher5

prob

0.7

0.5

0.4

subject

book1

book1

book2

rankedcited books

about"women in war"

probabilistic joins

probabilistic aggregations

Deploy search engineDeploy search engine

Page 21: Search challenges for collections of book records

21

More uncertainty from...

"cited" publishers

journal articles

cited books

aggregated"cited" publishers

Elsevier – Scopus(closed data)

OCLC - WorldCat(open data)

cited books

Page 22: Search challenges for collections of book records

22

More uncertainty from...

"cited" publishers

journal articles

cited books

aggregated"cited" publishers

Elsevier – Scopus(closed data)

OCLC - WorldCat(open data)

cited books

Ranking

Page 23: Search challenges for collections of book records

23

More uncertainty from...

"cited" publishers

journal articles

cited books

aggregated"cited" publishers

Elsevier – Scopus(closed data)

OCLC - WorldCat(open data)

cited books

Ranking

Fuzzy matching

Page 24: Search challenges for collections of book records

24

More uncertainty from...

"cited" publishers

journal articles

cited books

aggregated"cited" publishers

Elsevier – Scopus(closed data)

OCLC - WorldCat(open data)

cited books

Ranking

Priors in data

Fuzzy matching

Page 25: Search challenges for collections of book records

25

More uncertainty from...

"cited" publishers

journal articles

cited books

aggregated"cited" publishers

Elsevier – Scopus(closed data)

OCLC - WorldCat(open data)

cited books

Ranking

Priors in data

In fact...

Fuzzy matching

Page 26: Search challenges for collections of book records

26

Rank. Everything. Always.

● Unstructured search: uncertainty is first-class citizen

● Structured search: let's switch from "facts" to "evidence"– Forcing uncertainty to “facts” risks to corrupt data and search results

● Static data normalisation is good when it comes with high confidence● Otherwise, evidence can be used at query-time, depending on the context

– Strategy blocks contain code for probabilistic DB● Based on Probabilistic Relational Algebra

(Fuhr 1990, Rölleke et al. 2008)

● Let's just call it "search", finally.

Page 27: Search challenges for collections of book records

27

Summary

● The use case shown– benefits from LOD

● data and results can be expanded / improved

– benefits from Search by Strategy● probabilistic modelling of search scenarios

● On-going effort in the COMSODE context– Open Data Node: good quality LOD

– Search by Strategy: exploit uncertainty

● Currently● improving RDF support (e.g. vocabularies, inference)● Improving query-time resolution of data conflicts

Page 28: Search challenges for collections of book records

www.spinque.com

www.youropendata.eu

www.comsode.eu

Thank you