Download - The Semantic Web for Librarians and Publishers, by Michael Keller, Stanford University

Seman&c Web for

Libraries & Publishers

Charleston Conference

111103

Monday, November 21, 11

so, what’s the problem?

2

The Problem Set



Silos


More silos


Lots of different silos


Blue silos


Old SilosWe in the library and publishing trades force readers, some of them who are authors as well, to search iteratively for information they want or need or thinks might exist, in many different silos, using many different search engines, forms, and vocabularies. We do not make it easy for them to discover what is locally available, what is more or less easy to get, or everything that might be available. No wonder the young and foolish depend upon and believe in Google’s searches. Google is quick...and in terms of search terms of relevance, very, very dirty.


We give them better interfaces, ones that permit refinement of results, to our holdings at the title level, BUT...


Simulateneously, we show them many other tools, each excellent in some ways, to continue their exploration of the literature. No single tool is comprehensive. We do not refer our clients to the Web, at least not on our own web sites! // Our OPACs refer to our holdings. While Indices and abstracts refer our readers to articles in journals to which we may have licensed. SFX and similar provide readers with links to titles revealed to which we have subscribed. Neither our opacs nor the secondary databases directly to more than a tiny, percentage of the vast collection of pages that is the World Wide Web. The Web, of course, refers in fragmentary fashion to information resources we might, I emphasize, MIGHT have on hand for our readers.


And the results of using other, often very good, discovery tools differ in relevance ranking, format, and options than the ones we provide for our OPAcs, thus adding confusion.


some of us provide our readers with lots of databases to search. Too many really, for all but a few are not forensic-level scholars.


Selecting a licensed data base is an art in itself!Once again notice that we rarely offer a web search engine as an option, and for good reasons. Nevertheless, the discoverable relevant information resources on the web apparently are not part of our repertory.

!!!


We have not conspired to make the search for relevant information objects difficult. We just have not yet had the tools, the methods, the vision, and yes, the gumption to try something new.

Ntl Cntr forBiotech Info

NSF CyberInfrastructurequake engineering simulation

ATLAS at LHC -- 150*106 sensors


Here’s a teensy slice of the information and communication environment in which our faculty and students find themselves. And it gets more complex every day. Alas the larger the number of websites indexed by Bing or Google or whatever search engine du jour, the more likely it is that the relevance of the returns will be less pointed and precisely matched to what the searcher hoped to find.


Too many silos.Here’s the biggest of the lot...

16


17

One size fits all???


Does one size fit all?

18


Not quite. Even Google has silos and uses, as do others, clever interfaces to hide the fact of the silos.


Given all these silos and search engines, our users, our authors, and readers, and teachers, and students, people on the street, our nations...need us to find a better way. Facts about the information objects we have acquired or leased, facts about books, articles, films, and so forth that we have published need to be found in the wild, on the web. Ideally, we, librarians and publishers will get the facts about what we have and what we are making public, for fun or profit, discoverable on the Web.

Discovery & Access

... the problems


Let’s dwell on the problems briefly...

1. Too many stovepipe systems

2. Too little precision with inadequate recall

3. Too far removed from W3 WorldWide

Web


The landscape of discovery & access services is a shambles




It can’t be mapped in any logical way




It can’t be mapped in any logical way• not by us (the supposed information pros)• not by the faculty & students who must navigate the chaos




It can’t be mapped in any logical way• not by us (the supposed information pros)• not by the faculty & students who must navigate the chaos

This state of affairs shouldn’t be a surprise



Some of the problem ... too many stovepipe systems



Some of the problem ... too many stovepipe systems• dumbing-down effects of federation often hinder explicit searches• each interface has its own search-refinement tricks• numerous, overlapping discovery paths hamper full recall



Some of the problem ... too many systems• dumbing down effects of federation often hinder explicit searches• each interface has its own search-refinement tricks• numerous, overlapping discovery paths hamper full recall

Most of the problem ... limitations in the design & execution of infrastructure that supports discovery & access



the 1st limiting factor ... ambiguity


the 1st limiting factor ... ambiguityMost of our metadata uses a string of bytes to label a semantic entity [people, places, things, events, ...]


the 1st limiting factor ... ambiguityMost of our metadata uses a string of bytes to label a semantic entity [person, place, thing, event, ...]

• discovery based on matching text labels• not on the gist of semantic entities


the 1st limiting factor ... ambiguityMost of our metadata uses a string of bytes to label a semantic entity [person, place, thing, event, ...]

• discovery based on matching text labels• not on the gist of semantic entitiesFor libraries, the fix is authorities• authoritative forms of strings (names, organization, titles, places, events, topics, etc.)



For libraries, the fix is authorities• authoritative forms of strings (names, organization, titles, places, events, topics, etc.) work to improve precision and recall

hold on ... what about cases where no one-to-one relationship exists between a string-of-text label & the underlying semantic entity

Most of our metadata uses a string of bytes to label a semantic entity [person, place, thing, event, ...]




For libraries, the fix is authorities• authoritative forms of strings (names, organization, titles, places, events, topics, etc.) work to improve precision and recall

hold on ... what about cases where no one-to-one relationship exists between a string-of-text label & the underlying semantic entity

byte string: 4a 61 67 75 61 72

Take for example the text string: jaguar

Most of our metadata uses a string of bytes to label a semantic entity [person, place, thing, event, ...]



MacintoshOS X 10.2

E-Type (UK) or XK-E (US) mftg 1961 to 1974

Atari videogame console

XK series, in pro-duction since 1996

etc.

Ltd.

... a rose is a rose is a rosecompany

cars

hardware & software

John Giannandrea, CTO, Metaweb


Imagine this keyword search and realize the ambiguity of the term “jaquar”

inspired by John Giannandrea, CTO, Metaweb ... from his presentation at PARC in April, 2008

MacintoshOS X 10.2

type 140 Jaguar class fast attack craft [torpedo],Germany WWII


Fender electric guitar,introduced in 1962

XF10F prototype swing-wing fighter, early 1950s, Grumman



Anglo-French ground attack aircraft

etc.

Ltd. heavy metal band formed in Bristol, England. Dec 1979

Philadelphia-basedsinger/songwriter Jaguar Wright


cars

hardware & software

music

military


Monday, November 21, 11inspired by John Giannandrea, CTO, Metaweb ... from his presentation at PARC in April, 2008

MacintoshOS X 10.2


Jacksonville



DC Comics' Impact series, ... loosely based on Archie Comics' character


The Jaguar is a superheropublished by Archie Comics




etc.




cars

hardware & software

music

military

heros

pro footbal


Monday, November 21, 11inspired by John Giannandrea, CTO, Metaweb ... from his presentation at PARC in April, 2008

MacintoshOS X 10.2


Jacksonville



DC Comics' Impact series, ... loosely based on Archie Comics' character


The Jaguar is a superheropublished by Archie Comics




etc.



Prrrrr... a rose is a rose is a rosecompany

cars

hardware & software

music

military

heros

pro footbal



inspired by John Giannandrea, CTO, Metaweb ... from his presentation at PARC in April, 2008

the 2nd limiting factor ... instance-based metadata



Most of our metadata uses focuses on publication artifacts

• identify responsibility for its creation • list topical headings



For simple cases ... few worries• as with ambiguity, one-to-one relationships pose few problems• things work for authors with a few books in several editions





For simple cases ... few worries• as with ambiguity, one-to-one relationships pose few problems• things work for authors with a few books in several editions



But, as complexity increases, precision & recall suffer


search: Shakespeare’s Hamlet 811 entriesWading thru search results for authors

like Shakespeare shows clearly the effects that instance-based metadata has on precision & recall

Prolific authors ...


A Socrates (Stanford Libraries OPAC) keyword search for the terms shakespeare and hamlet



Unflagging patience marks the task of flipping back & forth between hundreds of brief and full records to sort thru the varied instances of a single entity





Unflagging patience marks the task of flipping back & forth between hundreds of brief and full records to sort thru the varied instances of a single entity, e.g.• critical editions based on primary sources• 18th & 19th century collections of the plays• social, historical and literary essays• histories & critiques of such writings• video and audio recordings of performances• reviews and indices of the same• treatments of stagecraft, costumes, music• life & works of notables associated with the plays (e.g., performers, directors)• other art forms inspired by the plays




Web


Together, our metadata & collections make up a big chunk of the “dark web”

[ info resources that search-engine spiders can’t see ]


Web




It’s clear that visibility on the web promotes dramatic increases in discovery and access


Web




It’s clear that visibility on the web promotes dramatic increases in discovery and access• Library of Congress & Smithsonian images (FLICKR)


Web




It’s clear that visibility on the web promotes dramatic increases in discovery and access• Library of Congress & Smithsonian images (FLICKR)• SULAIR’s Highwire Press ( > 2x increase via Google)


Web




It’s clear that visibility on the web promotes dramatic increases in discovery and access• Library of Congress & Smithsonian images (FLICKR)• SULAIR’s Highwire Press ( > 2x increase via Google)

The state of affairs is well known ...


Web


54

Our Working Environment


library

academy

produceprovide

publisher

Scholars& students


Here is a schematic to suggest how our ecosystem works. It is more complex, of course, but the basics are embodied here.

internet

Once upon a &me…the Internet


And here is the way the e-discovery and e-communication environment is developing. First there was the Internet. Prophets such as Vannevar Bush, Ted Nelson, and Doug Englebart showed us the way.

internet

Then…the World Wide Web

webof

pages


Thanks to another profit, Tim Berners-Lee, the Internet, a network of communicating computers, became a web of pages of information. Scholarly journal publishers and some librarians realized early on that there were functional advantages to scholarship and to publishing in the web of pages. Yahoo, Google, and others realized that mining the web of pages by words on those pages, could make the rapidly growing web of pages reveal more through indexing and cataloging the web. Indexing won out as we now know over cataloging.

The next thing is the subject of this talk. It is the web of data. It is the web of relationships constructed and expressed so that both computers and humans can identify and understand relationships in that web. The web of data lives with the web of pages and is carried on the Internet, the global carrier.

internet

web

of

pages

web

of

data

Under construc&on


This web of data is the next big thing in discovering relevant information objects and the next big thing in empowering individuals, communities, and industries in making better use of information that they or others create. What distinguishes this web of data, this linked data environment, is the principal of identifying entities, virtual & real by statements of relationships and descriptions in machine readable form. More about this as we go along.

internet

web

of

pages

web

of

data

aka Linked Data

Under construc&on


We are calling this next phase the Linked Data phase, because it is enGrely dependent upon statements of relaGonships and descripGons in machine readable form, but this phase may be only a pre-‐cursor to another, more complex and more difficult web world to engineer. The next phase is the SemanGc Web, which in theory allows the machine readable relaGonships and descripGons to interoperate to saGsfy a person’s requirements, albeit without constant interacGon. In short, in the SemanGc Web, the machines will understand meaning and presumably act on it. Scarey, eh?

60

ConstrucGon Tools


How to we work to alleviate our problems as informaGon professionals, librarians and publishers?

• identify people, places, things, events, and other entities embedded in the knowledge resources that a research university consumes and produces

Recipe for crea+ng the web of data


• identify people, places, things, events, and other entities embedded in the knowledge resources that a research university consumes and produces• tie those facts together with named connections



• identify people, places, things, events, and other entities embedded in the knowledge resources that a research university consumes and produces• tie those facts together with named connections• publish the relationships as crawl-able links on the web



• identify people, places, things, events, and other entities embedded in the knowledge resources that a research university consumes and produces• tie those facts together with named connections• publish the relationships as crawl-able links on the web


Build/use apps supporting discovery via the web of data


65


Here is a pile of words represenGng all the words on the web that most search engines index constantly. Good search engines today can do a lot with this pile. BUT, the search engines create the percepGon of relaGonships, not based on meaning, but on other factors, such as number of links to a site containing the words of interest OR the traffic to a site.

66From this pile of words, structure!


The Linked Data approach aSempts to structure the pile in anGcipaGon of the need for discovery. That structure is based on meaning, on relaGonships. I will make this clearer in the next slides.

67


Here’s a graph of a very few relaGonships to Yo Yo Ma, the great ‘cellist.

68Linked Data WebMonday, November 21, 11

Here’s a graph of relaGonships to Haggis, just a fun one I could not resist throwing in. Meaning is provided by understanding relaGonships.

69

RDF$triples$&$URIs$

•  RDF$triples$=$subject$–$object$–$predicate$– A$way$to$describe$objects$or$even$ideas$on$the$web$– An$object$or$idea$might$have$many$RDF$triples$describing$it$– Objects$or$ideas$need$not$exist$on$the$web!$

•  URIs$=$Uniform$Resource$IdenDfiers$– Allows$machine$interacDon$among$Web$objects$–  Various$syntacDcal$schemes$&$protocols$used$to$construct$URIs$

– At$least$3$needed$to$support$an$RDF$(subject$–$objectJ$predicate)$


Geek ingredients to the construcGon of the Linked DAta Web. RDF means Resource DescripGon Framework, always expressed as a simple sentence, though mulGple such statements might aSach to a single enGty. In fact, we need mulGple RDFs in this scheme.

70


A graph of RDF statements and URIs

71

The Linked Data Principles1. Use Resource Description Frameworks as names of things (people, places, times, objects, ideas...anything really)2. Use HTTP URIs so that people can look up those names3. When someone looks up a URI, provide useful RDF information4. Include RDF statements that link to other URIs so that they can discover related things


The really great aspect of RDFs is that they can refer to ideas, not just to physical or virtual enGGes. Any kind of idea could be treated.

72

Library'Metadata'

•  Library'metadata'standards'closed'•  “Passive”'metadata,'searchable,'but…'•  In'Silos ''•  Readable,'but'not'ac=onable'•  Search'results'refinable,'but'final'

'


These are some of the edges of the problem of library metadata.

73

Library'Metadata'•  Library'metadata'standards'

closed'•  “Passive”'metadata,'

searchable,'but…'•  In'Silos ''•  Readable,'but'not'

ac<onable'•  Search'results'refinable,'but'

final'

Seman/c'Web'Metadata'•  Open'

•  Dynamic,'Contextualized'

•  In'the'wild'•  Interac<ve,'Responsive'

•  Leading'to'other'queries'&'views'

Library'Metadata'•  Library'metadata'standards'

closed'•  “Passive”'metadata,'

searchable,'but…'•  In'Silos ''•  Readable,'but'not'

ac<onable'•  Search'results'refinable,'but'

final'

Seman/c'Web'Metadata'•  Open'

•  Dynamic,'Contextualized'

•  In'the'wild'•  Interac<ve,'Responsive'

•  Leading'to'other'queries'&'views'


And here is the comparison between the library metadata scene now and the one we advocate for the Linked Data/SemanGc Web. Library metadata in the Linked Data Web should be freely available, constantly updated, o[en reconciled with RDF triple statements from non-‐library sources. Library Linked Data should be enGrely open on the web.

74

Make Library bibliographic factsin to RDFs & URIs;Release them into the wild.Make Library Linked Data OPEN.


I should add that accounGng for physical objects in our collecGons, locaGng them, making our collecGons auditable, and managing our collecGons seems to be possible using Linked Data too, at least in principal.

75

What about Publishers?


76

Publishers*&*Socie/es**making*use*of*Linked*Data*

•  Aggregate*content*in*their*own*realms*&*beyond*•  Aggregate*informa/on*about*–  Conferences*–  Career*building*&*employment*opportuni/es*–  Communi/es*in*collabora/on*–  Commercial*&*other*services*suppor/ng*research*with*specimens,*source*material,*processing,*trials*

–  Produc/ve*rela/onships*with*others*•  Provide*ac/onable,*constantly*updated*links*in*support*of*scholars,*teachers,*and*learners*

•  Provide*compelling*services*tying*users*to*them*


Libraries too can use Linked Data to reveal and adverGse compelling services offered to their clients.

77Seman4c Web adoptersMonday, November 21, 11

Here are some of the big players in the Linked Data / SemanGc Web world. The BriGsh Library has released RDFs/URIs for the enGre BriGsh NaGonal Bibliography. The Library of Congress has released the same for LCSH & Name Authority Files. LCSH includes links to AGROVOC, RAMEAU, DNB, GLIN Subject Thesaurus, and the NaGonal Agriculture Library's Subject Index. Every Personal and Corporate entry in LC/NAF links to VIAF, the Virtual InternaGonal Authority File based at OCLC. The N Y Times 18 months ago made all 500,000 (and growing) of its index terms available in the wild as RDFs and URIs.

78


For publishers and libraries...though we should not neglect services.

79

...if users can find it in their own contextMonday, November 21, 11

Context

80

ContentUsers

Users = readers, authors, teachers, students


Context

81

ContentUsers

Publishers must make content VISIBLEMonday, November 21, 11

I am using the imperaGve here, because invisible published content means invisible benefit to the author and/or the publisher.

82


Here is a recent PLoS arGcle from PLoS Neglected Tropical Diseases.

83


And here is the semanGcally enhanced version of this arGcle, enhancements provided by David ShoSen et al. in the form of links to further informaGon, interacGve figures, re-‐orderable reference list, citaGons in context and tag trees. These enhancements took 10 man weeks in 2009! However, with the growing ecology of linked data, much of this could be accomplished by auto-‐tagging and algorithmic construcGon of the basic RDFs & URIs for the unique arGcle. Microdata submiSed by some publishers and their supporGng services to schema.org lead to these exciGng possibiliGes.

84

aggrega+onMonday, November 21, 11

AggregaGon counts, but think how much more we would get if we could aggregate from libraries, publishers, and the wild and weird variety of sources on the web?

85


86

Disambigua4on


RDFs and URIs can operate in many languages and relaGonships can be expressed across languages, a potenGal big benefit to research and collaboraGon in research.

87

Web of Data Progress


88

2007


FOAF = Friend of a Friend. Hundreds of millions of RDFs/URIs. Fortunately they do not take much space in memory!

89


This is the 2011 graph of enGGes supplying RDFs and URIs. Now the populaGon is in the hundreds of billions, heading to trillions.

90hSp://inkdroid.org/lod-‐graph/

2011


http://inkdroid.org/lod-graph/

http://inkdroid.org/lod-graph/

91

EncouragementExamples


92

Linked'Open'Data'Value'Proposi4on'•  Linked'open'data'(LOD)'puts'informa4on'where'people'are'looking'for'it'–'on'

the'Web;''•  LOD'can'expands'discoverability'of'our'content;''•  LOD'opens'opportuni4es'for'crea4ve'innova4on'in'digital'scholarship'and'

par4cipa4on;''•  LOD'allows'for'open'con4nuous'improvement'of'data;''•  LOD'creates'a'store'of'machineDac4onable'data'on'which'improved'services'can'

be'built;''•  Library'linked'open'data'might'facilitate'the'break'down'the'tyranny'of'domain'

silos;''•  LOD'can'provide'direct'access'to'data'in'ways'that'are'not'currently'possible;''•  LOD'provides'unan4cipated'benefits'that'will'emerge'later'as'the'stores'of'LOD'

expand'exponen4ally.'''A"product"of"the"Stanford/CLIR"Linked"Data"Workshop"June"2011."


25 ParGcipants from the BriGsh Library, the Bibliothèque naGonale de France, the Deutsch NaGonalbibliothek, the Royal Library of Denmark, Aalto University in Finland, the Library of Congress, the Bibliotheca Alexandrina, the NaGonal InsGtute of InformaGcs of Japan, Google, Seme4, Emory, University of Virginia, University of Michigan, California Digital Library, Knowledge MoGfs, CLIR, and Stanford.

93Google using Stanford bib facts + web resources


This is a movie of a live interacGon with Freebase using bibliographic facts from Stanford, and linked informaGon resources from the web. It shows in a limited way the potenGal for discovery and retrieval in the Linked Data Web.

94

BnF using data only from its catalogs & Gallica


This is another movie of the Linked Data prototype based enGrely on bibliographic facts from the BnF catalogs and digital texts in Gallica. There are no other web resources drawn into this prototype...yet.

95


96

A"Bibliographic"Framework"for"the"Digital"Age"(October"31,"2011)!

•  “The!new!bibliographic!framework!project!will!be!focused!on!the!Web!environment,!Linked!Data!principles!and!mechanisms,!and!the!Resource!Descrip?on!Framework!(RDF)!as!a!basic!data!model.!!The!protocols!and!ideas!behind!Linked!Data!are!natural!exchange!mechanisms!for!the!Web!that!have!found!substan?al!resonance!even!beyond!the!cultural!heritage!sector.!!Likewise,!it!is!expected!that!the!use!of!RDF!and!other!W3C!(World!Wide!Web!Consor?um)!developments!will!enable!the!integra?on!of!library!data!and!other!cultural!heritage!data!on!the!Web!for!more!expansive!user!access!to!informa?on.”!

Deanna%Marcum,%Associate%Librarian%of%Congress,%introducing%a%transi7on%from%MARC.%


97

We in the cultural heritage and knowledge management institutions are discovering better ways of publishing, sharing, and using information by linking data and helping others do the same. Through this work, we have come to value and to promote the following practices:

1. Publishing data on the web for discovery and use, rather than preserving it in dark, more or less unreachable archives that are often proprietary and pro?it driven;

2. Continuously improving data and Linked Data, rather than waiting to publish “perfect” data;

3. Structuring data semantically, rather than preparing ?lat, unstructured data;

4. Collaborating, rather than working alone;

5. Adopting Web standards, rather than domain speci?ic ones;

6. Using open, commonly understood licenses, rather than closed and/or local licenses.

Value Proposi-on for LAM’s

from the Stanford/CLIR Workshop on Linked Data, June 2011


In each couplet, we emphasize the second half, a[er “rather than”, admitng that someGmes the first half of the couplet has to be operaGve.

98

DARPA InternetMonday, November 21, 11

This is where we started 2.5 decades ago.

99World Wide Web


Thanks to Tim Berners-‐Lee and many others, we advanced in this environment from the early 1990s unGl today.

100

SOCIAL WEB


We cannot ignore the social web that exists in the current WWW, but think how much more, some of it scarey, could be done in the Linked Data Web with the behaviors of the Social Web.

101Linked Data WebMonday, November 21, 11

Just that funny reminder of the fundamental nature of the Linked Data Web: expressing machine acGonable relaGonships.

102Seman+c WebMonday, November 21, 11

And in the next web, the SemanGc Web, who knows what may be possible.

103

Ubiquitous compu+ng


To the progression of network types, we need to add a couple of enormously important environmental factors. Ubiquitous compuGng is a very important one. Having lots of computers on the net makes the possibility of an open global linked data web very strong.

104

Mobility


And our ability to communicate by voice (how about that Siri?) and by bits/bytes from everywhere, is, perhaps, just another aspect of ubiquitous compuGng.

105

Ubiquitous Compu4ng

Mobile

Internet

Web

Social Web

Linked Web


The black box in the upper right corner is the SemanGc Web, a level of sophisGcaGon yet to be achieved. The linked data web is at hand, though.Will Librarians and Publishers join the development of the Linked Open Data web? I certainly think we should.


NO MORE SILOS ARE NEEDED or wanted.

107

W3C Library Linked Data Incubator Grouphttp://www.w3.org/2005/Incubator/lld/

A Bibliographic Framework Initiative General Plan for the Digital Age (October 31, 2011)http://www.loc.gov/marc/transition/news/framework-103111.html

Linked Data Survey & Workshop June 2011hSp://www.clir.org/pubs/archives/linked-‐data-‐survey/


http://www.loc.gov/marc/transition/news/framework-103111.html






http://www.clir.org/pubs/archives/linked-data-survey/




108


109


110


111


112


113