Search challenges for collections of book records
-
Upload
arjen-de-vries -
Category
Technology
-
view
54 -
download
1
Transcript of Search challenges for collections of book records
Search challenges forcollections of book records
Roberto Cornacchia
ECIR 2014 – Industry dayAmsterdam, 16 April 2014
> design > publish > search!
2
Outline
● COMSODE (EU-FP7)– Publication platform for Linked Open Data
● Spinque– Search modelling
● A use-case from Digital Humanities– link, clean, search
● A step further– Rank. Everything. Always.
– Query-time resolution of data conflicts
3
Unlocking the value of L(O)D...
In the public sector
Source: Open Data 500 by The GovLab
In industry
...is a hot topic
In science
Source: Bradley Allen, SlideShare
4
The COMSODE project has received funding from the Seventh Framework Programme of the European Union in the grant agreement number 611358.
COMSODE
Unlock LOD valueby improving publication
www.comsode.eu
5
Spinque
● Spin-off of CWI Amsterdam (2009)
● Develops domain-tailored search technology– Applied to:
● IP, multimedia, cultural heritage, child-friendly, ...
– Search by Strategy● visual modelling of search processes
– Rank. Everything. Always.● integrated support for all-round probabilistic search
● Work in progress in COMSODE– Search Linked Data
6
A use case in Digital Humanities
● "Can We Rank Scholarly Book Publishers? A Bibliometric Experiment with the Field of History"(Zuccala et al., Journal of the American Society for Information Science and Technology, 2014)
● Goal: indicate publisher prestige quantitatively– bibliographic citations to books from journal articles.
● Dataset: Elsevier Scopus journal citations– Granted via the 2012 Elsevier Bibliometrics Research Program– 5.6M citations, 3M from journals to books
– History & literature
– Periods 1996-2000 and 2007-2011
7
Elsevier Scopus dataset
"Power and community: The archaeology of slavery at the hermitage plantation" American Antiquity
(journal, history)
Thomas B.
MISSISSIPPIANPOLITICAL ECONOMY
Muller J.
1998
1997
citing_eid,cited_eid,source_title,source_id,article_pubyear,authors,article_title,volume,page_start,doctype4702,232311,"American Antiquity",40554,1996,"Graybill D. (6603866252);Michaelsen J. (7003483600);Neff H. (7005907495);Larson D. (7402633779);Ambos E. (14048059100)","Risk, climatic variability, and the study of southwestern prehistory: An evolutionary perspective",61,217,re4702,1333725,"American Antiquity",40554,1997,"Raab L. (6601955075);Larson D. (7402633779)","Medieval climatic anomaly and punctuated cultural evolution in coastal Southern California",62,319,ar4702,7613691,"American Antiquity",40554,1997,"Colten R. (8363369400);Arnold J. (8754215200);Pletka S. (25221793700)","Contexts of cultural change in insular California",62,300,ar4702,30302643,"Quarternary Science Reviews",26239,1996,"Stuiver M. (7007003882);Reimer P. (7103071876);Taylor R. (26030669400)","Development and extension of the calibration of the radiocarbon time scale: Archaeological applications",15,655,ar4702,30317536,"Canadian Journal of Earth Sciences",22031,1996,"Dyke A. (7003706220);McNeely R. (7004891098);Hooper J. (7102438470)","Marine reservoir corrections for bowhead whale radiocarbon age determinations",33,1628,ar4702,30739323,"Journal of Coastal Research",27374,1997,"Mason O. (7004241927);Hopkins D. (7202255075);Plug L. (7801522080)","Chronology and paleoclimate of storm-induced erosion and episodic dune growth across Cape Espenberg spit, Alaska, U.S.A.",13,770,ar7154,2287569,"American Sociological Review",16929,1997,"Goodwin J. (7402339411)","The libidinal constitution of a high-risk social movement: Affectual ties and solidarity in the Huk rebellion, 1946 to 1954",62,53,re7154,30495855,"Sociological Theory",18110,1996,"Emirbayer M. (23110549400)","Useful Durkheim",14,109,ar9412,9986565,"British Journal for the Philosophy of Science",19977,1997,"Eliasmith C. (6603720957);Thagard P. (6701846211)","Waves, Particles, and Explanatory Coherence",48,1,ar9412,30006171,Gastroenterology,28330,1996,"Hamlet A. (6701690210);Dalenb<E4>ck J. (7003418017);F<E4>ndriks L. (7005233384);Olbe L. (7006954993)","A mechanism by which Helicobacter pylori infection of the antrum contributes to the development of duodenal ulcer",110,1386,ar
citesarticle
book
CSV files
RDF
8
Warm-up
● Load RDF data– (subject, predicate, object)
● Most cited publications
● No problem with SPARQL or SQL
SELECT ?publication count(*) as ?nCitationsWHERE {[] scopus:cites ?publication}GROUP BY ?publicationORDER BY desc(?nCitations)
subject predicate object
publication1 cites publication2
publication1 cites publication3
publication3 publisher publisher5
publication nCitations
publication3 288
publication5 223
publication2 124
9
Warm-up
● Load RDF data– (subject, predicate, object)
● Most cited publications
● No problem with SPARQL or SQL
SELECT ?publication count(*) as ?nCitationsWHERE {[] scopus:cites ?publication}GROUP BY ?publicationORDER BY desc(?nCitations)
subject predicate object
publication1 cites publication2
publication1 cites publication3
publication3 publisher publisher5
publication nCitations
publication3 288
publication5 223
publication2 124
Predicate traversal
Aggregation
10
Warm-up .. visually"Search by Strategy"
11
Warm-up .. visually"Search by Strategy"
Elsevierdata source
Predicate traversal
Aggregation
Deploy REST APIDeploy REST API
Data flow
Deploy search engineDeploy search engine
12
Back to the original goal: rank publishers
journal articles
cited books
Elsevier – Scopus(closed data)
“cited” publishers
13
Back to the original goal: rank publishers
"cited" publishers
journal articles
cited books cited books
aggregated"cited" publishers
sameAs
Elsevier – Scopus(closed data)
OCLC - WorldCat(open data)
“cited” publishers
● Open Data Node
– Links books● Search
– Uses links
– On-the-fly matching?
Deploy search engineDeploy search engine
14
Surprise..
– University Press,Cambridge [England]
– University Press,Cambridge [etc.]
– University Press,"Cambridge, Mass.,"
– University Press,"Cambridge, N.E."– University Press,"Cambridge, U.K."
– University Press,"Cambridge, UK"
– University Press,Cambridge [U.K.]
– University Press [etc.],Cambridge– University Press [etc.],"Cambridge
[Eng., etc.]"
– University Press [etc.],Cambridge [etc.]– "University press [etc.,
etc.]","Cambridge,"
– University Pressf ats collnutz,Cambridge
– University Press of Cambridge,"Boston, Mass."
– University Press of Cambridge,"[Cambridge, Mass.]"
– Univ. of Cambridge,Cambridge
– Univ. P.,Cambridge
– Univ. Pr,Cambridge
– Univ. Pr.,Cambridge
– Univ.Pr.,Cambridge
– Univ. Pr.,Cambridge [u.a.]
– Univ. Pr.,"Cambridge, U.S.A."
– Univ. Pr.,Cambridge [usw.]
2588 variations (just for "Cambridge Universty Press").
Probably only 2 or 3 distinct entities in there.
15
De-duplicate publishers
"cited" publishers
journal articles
cited books cited books
aggregated"cited" publishers
sameAs
Elsevier – Scopus(closed data)
OCLC - WorldCat(open data)
16
De-duplicate publishers
● Open Data Node
– Links duplicates● Search
– Uses links
– On-the-fly matching?
"cited" publishers
journal articles
cited books cited books
aggregated"cited" publishers
sameAs
Elsevier – Scopus(closed data)
OCLC - WorldCat(open data)
sameAs
Deploy search engineDeploy search engine
17
Is the DH researcher happy?
● Yes. All very nice...– ...but...?
● Data are not 100% clean yet.● Can we rank publishers of books about “women in war”?
The initial database problem
needs to deal with uncertainty
18
Uncertainty from ranking
"cited" publishers
journal articles
cited books
aggregated"cited" publishers
Elsevier – Scopus(closed data)
OCLC - WorldCat(open data)
subject predicate object
book1 sameAs book9
book7 publisher publisher3
book9 publisher publisher5
subject
book1
book1
book2
rankedcited books
about"women in war"
aggregationsjoins
19
Uncertainty from ranking
"cited" publishers
journal articles
cited books
aggregated"cited" publishers
Elsevier – Scopus(closed data)
OCLC - WorldCat(open data)
subject predicate object
book1 sameAs book9
book7 publisher publisher3
book9 publisher publisher5
prob
0.7
0.5
0.4
subject
book1
book1
book2
rankedcited books
about"women in war"
aggregationsjoins
20
Uncertainty from ranking
"cited" publishers
journal articles
cited books
aggregated"cited" publishers
Elsevier – Scopus(closed data)
OCLC - WorldCat(open data)
subject predicate object
book1 sameAs book9
book7 publisher publisher3
book9 publisher publisher5
prob
0.7
0.5
0.4
subject
book1
book1
book2
rankedcited books
about"women in war"
probabilistic joins
probabilistic aggregations
Deploy search engineDeploy search engine
21
More uncertainty from...
"cited" publishers
journal articles
cited books
aggregated"cited" publishers
Elsevier – Scopus(closed data)
OCLC - WorldCat(open data)
cited books
22
More uncertainty from...
"cited" publishers
journal articles
cited books
aggregated"cited" publishers
Elsevier – Scopus(closed data)
OCLC - WorldCat(open data)
cited books
Ranking
23
More uncertainty from...
"cited" publishers
journal articles
cited books
aggregated"cited" publishers
Elsevier – Scopus(closed data)
OCLC - WorldCat(open data)
cited books
Ranking
Fuzzy matching
24
More uncertainty from...
"cited" publishers
journal articles
cited books
aggregated"cited" publishers
Elsevier – Scopus(closed data)
OCLC - WorldCat(open data)
cited books
Ranking
Priors in data
Fuzzy matching
25
More uncertainty from...
"cited" publishers
journal articles
cited books
aggregated"cited" publishers
Elsevier – Scopus(closed data)
OCLC - WorldCat(open data)
cited books
Ranking
Priors in data
In fact...
Fuzzy matching
26
Rank. Everything. Always.
● Unstructured search: uncertainty is first-class citizen
● Structured search: let's switch from "facts" to "evidence"– Forcing uncertainty to “facts” risks to corrupt data and search results
● Static data normalisation is good when it comes with high confidence● Otherwise, evidence can be used at query-time, depending on the context
– Strategy blocks contain code for probabilistic DB● Based on Probabilistic Relational Algebra
(Fuhr 1990, Rölleke et al. 2008)
● Let's just call it "search", finally.
27
Summary
● The use case shown– benefits from LOD
● data and results can be expanded / improved
– benefits from Search by Strategy● probabilistic modelling of search scenarios
● On-going effort in the COMSODE context– Open Data Node: good quality LOD
– Search by Strategy: exploit uncertainty
● Currently● improving RDF support (e.g. vocabularies, inference)● Improving query-time resolution of data conflicts
www.spinque.com
www.youropendata.eu
www.comsode.eu
Thank you