1 Stanford InterLib Technologies Hector Garcia-Molina and the Stanford DigLib Team.

55
1 Stanford InterLib Technologies Hector Garcia-Molina and the Stanford DigLib Team

Transcript of 1 Stanford InterLib Technologies Hector Garcia-Molina and the Stanford DigLib Team.

1

Stanford InterLib Technologies

Hector Garcia-Molina

and the Stanford DigLib Team

2

Stanford Digital Libraries Team

Faculty:– Dan Boneh, Hector Garcia-Molina, Terry Winograd

Research Scientist– Andreas Paepcke

Librarians– Vicky Reich, Rebecca Wesley

Partners:– InterLib Partners, ACM, Dialog, Hitachi, IBM, Intel,

Microsoft, NASA Ames Library, Stanford Libraries,SUL HighWire Press, Xerox

3

Barriers to Effective DLs

Economic Concerns

Information Loss

Information Overload

Service Heterogeneity

Physical Barriers

4

Thrusts

Economic Concerns

Information Loss

Information Overload

Service Heterogeneity

Physical Barriers

• Interoperability

• Value Filtering

• Mobile Access

• IP Infrastructure

• Archival Repository

5

DL Interoperability Challenges

Growing number of players, formats, countries,... Repositories Services Dynamic artifacts Reliability

Digital Libraries

6

DL Interoperability Challenges

Growing number of players, formats, countries,... Repositories Services Dynamic artifacts Reliability

Solution:

InfoBus InterServ

InfoBus Example

Folio Dialog DigiCash F.V.

FolioProxy

DialogProxy

DigiCashProxy

F.V.Proxy

DLite GlossQueryTrans

MetaData U-Pai

Con-tracts

Q: Find Ti distributed (W) systems

InfoBus Example

Folio Dialog DigiCash F.V.

FolioProxy

DialogProxy

DigiCashProxy

F.V.Proxy

DLite GlossQueryTrans

MetaData U-Pai

Con-tracts

Q: Find Ti distributed (W) systems

Suggested: Folio, Dialog

InfoBus Example

Folio Dialog DigiCash F.V.

FolioProxy

DialogProxy

DigiCashProxy

F.V.Proxy

DLite GlossQueryTrans

MetaData U-Pai

Con-tracts

Q: Find Ti distributed (W) systems

Q’: Find Ti distributed AND systems

Query Translation

InfoBus Example

Folio Dialog DigiCash F.V.

FolioProxy

DialogProxy

DigiCashProxy

F.V.Proxy

DLite GlossQueryTrans

MetaData U-Pai

Con-tracts

Q: Find Ti distributed (W) systems

Pay per View

11

InterServ

InfoBus

Perpetual Activity

Services

Dynamic Artifacts

InfoBus Pro

“Maturity”

“Sop

hist

icat

ion”

12

Perpetual Activity Service

P.A.S.P.A.S.

Service

UserRequest

register

state & plans

13

Perpetual Activity Service

P.A.S.P.A.S.

Service

UserRequest

register

state & plans

restart service,use alternate

restore state,try alternatives

check

check

14

SDLIP

Simple Digital Library Interoperability Protocol Goal: get InterLib (and DLI2) to interoperate!!

15

Search Protocol: Initial Goals

Trivial to implement! Works over CORBA/COM, DASL/HTTP Use XML Does not prescribe query format Does not prescribe result format Small footprint (Desktop/Laptop/PDA) Allows for stateful or stateless operation

But lets you say whatyou’re using

16

Interface Consists of Four Components

InformationClient

DeliveryInterface

InterLibWrapper

Result AccessInterface

SourceMetadataInterface

SearchInterface

17

SDLIP Status Design Meeting June 22, 1999

18

SDLIP Status

Design Meeting June 22, 1999

Client & Server Toolkits Available Extensive Documentation See

http://www-diglib.Stanford.EDU/~testbed/doc2/SDLIP/

19

Current SDLIP Sources

Some Web sources– People Lookup: www.switchboard.com

– Altavista

– IMDB (movies)

NCSTRL services: www.ncstrl.org– Dienst compliant services, e.g., CoRR?

Z39.50 servers– e.g., Library of Congress

Stanford WebBase CDL

– e.g., MELVYL gateway

DASL-compliant servers

20

Existing Clients

Java– command line

– applet

C++– Palm Pilot

TCL (Ray Larson) DASL-compliant clients

21

Filtering Challenges

Too much information Not controlled

22

Current Filtering

textualsimilarity

23

Page Rank Filtering

textualsimilarity

page rank(Google)

24

Initial Page Rank

4

1

25

Recursive Page Rank

4

1

6

1

2

2

1+2+1+2 = 6

26

Value Filtering

textualsimilarity

page rank

geography

context

opinions

access

27

Value Filtering Challenges

Collection of Value Information Scalability Privacy of Value Information Understanding Page Rank Searching Non-Text Objects Combining Value Information HCI Aspects

28

WebBase Goals

Manage very large collections of Web pages Enable large-scale Web-related research Locally provide a significant portion of the Web Efficient wide-area Web data distribution

29

Challenges

Huge information space– Wide area distribution

– URL space (to remember while crawling)

– Web content (to store)

Limited resources– Disk

– Time

– Memory

– Bandwidth

– Server administrator tolerance

Continuous evolution– More pages– Pages change/disappear– Mirror sites installed– Keeping data “fresh”

Crawling issues– Data ‘fiefdoms’: firewalls;

access permissions; load controls

– Overhead per site: DNS lookups; processing robots.txt

– Parallelization– Ability to interrupt & restart

RepositoryMulticastEngine

WWW

FeatureRepository

RetrievalIndexes

Webbase API

Web CrawlerWeb

CrawlerWeb CrawlerWeb Crawlers

Client Client Client Client

Client ClientWebBase Architecture

31

Mobile Access Challenges

Limited Resources Transitions Between Devices Exploiting Context

32

Mobile Access Challenges

Limited Resources Transitions Between Devices Exploiting Context

Solutions: Power Browsing Information Tiles Information Paging

33

Power Browsing

34

Power Browsing

Techniques• Show only text headers• Show URLs, anchors, titles• Order URLs by page rank• Summarize text• Summarize set of pages• Low-resolution pictures• Display “relevant” text• ...

35

PowerBrowser - Start Screen

36

PowerBrowser - Hypertext View

37

PowerBrowser - Text View

38

PowerBrowser - History

39

IP Management Challenges

Heterogeneity Complexity of Interactions Varied Information Appliances Mobile Access Security/Privacy

40

Fundamental Problem

Safeguards (security, privacy, authentication, payment, non-repudiation...) are afterthought

“Spaghetti” code for safeguards

Experience at Stanford:•InterPay, CommPacts, Copy Detection•Goal was interoperability•Correctness, complexity were problems

41

Example: Simple Pay Per View

patron library bank

view(docId, account, amt)

transfer(amt, account, libAccount)

42

Example: Simple Payment

Goals• Do not want others to see data• Do not want library to see account number• Need receipt from bank

patron library bank

view(docId, account, amt)

transfer(amt, account, libAccount)

43

Example: Simple Payment

Goals• Do not want others to see data• Do not want library to see account number• Need receipt from bank

Result: A Mess!!

patron library bank

view(docId, account, amt)

transfer(amt, account, libAccount)

44

Declarative Safeguards for DLs

Safeguards built in at system design time Declare goals, not mechanisms

– Players, data, ...– Who can see what, who can do what, ...

(Note: access information can also be protected)

Declarative Infrastructure

SecureDLs

Components:IP Mgmt, Wallets, ...

45

Solution

Extended Interface Definition Language– Corba or D-COM like

Example:

class artRecord { authorized(policy) setOwner(encrypted string ownerName, encrypted(bank) int price, picture pic; ) …}

46

Declarative Safeguards for DLs

Declarative Infrastructure

SecureDLs

Components:IP Mgmt, Wallets, ...

47

Information Preservation Challenges

Preserving the Bits– Evolving hardware

– Evolving software

– Evolving organizations

Preserving the Meaning

48

Stanford Archival Repository

Object Identifier Signature

No Deletions (never ever!)

handle

set set

new version?

49

Repository Layers

IdentityIdentity

Object StoreObject Store

Complex ObjectsComplex Objects

ReliabilityReliability

Indexing, NamingIndexing, Naming

Intellectual PropertyIntellectual Property

50

Archiving the Web - Problem

File System

Web Server

users

51

Archiving the Web - One Solution

File System

Web Server

users

Archival Repository

52

Archiving the Web - Our Solution

File System

Web Server

InfoMonitor

usersusers

Archival Repository

53

InfoMonitor History View

54

InfoMonitor Snapshot View

Economic Concerns

Information Loss

Information Overload

Service Heterogeneity

Physical Barriers

• Interoperability

• Value Filtering

• Mobile Access

• IP Infrastructure

• Archival Repository

StanfordStanfordInterLib TechnologiesInterLib Technologies