Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of...

Web Archives, IDEAL, and PBL Overview

Edward A. Fox

Digital Library Research Laboratory

Dept. of Computer Science

Virginia Tech

Blacksburg, VA, USA

21 May 2015 for Qatar National Library

Outline

• A broader perspective• Key topics to consider• WARC• IDEAL pipeline• Crawling• CS5604 Spring 2015: integrated archiving and IR/DL• Other presentations• Acknowledgments

Information Life Cycle

AuthoringModifying

OrganizingIndexing

StoringRetrieving

DistributingNetworking

Retention/ Mining

AccessingFiltering

UsingCreating

5S LayersSocieties

Scenarios

Spaces

Structures

Streams

1. Edward A. Fox and Jonathan P. Leidig, editors. Digital Library Applications: CBIR, Education, Social Networks, eScience/Simulation, and GIS. Morgan & Claypool Publishers, San Francisco, March 2014, http://dx.doi.org/10.2200/S00565ED1V01Y201401ICR032

2. Edward A. Fox and Ricardo da Silva Torres, editors. Digital Library Technologies: Complex Objects, Annotation, Ontologies, Classification, Extraction, and Security. Morgan & Claypool Publishers, San Francisco, March 2014, http://dx.doi.org/10.2200/S00566ED1V01Y201401ICR033

3. Rao Shen, Marcos Andre Goncalves, and Edward A. Fox. Key Issues Regarding Digital Libraries: Evaluation and Integration. Morgan & Claypool Publishers, San Francisco, Feb. 2013, http://dx.doi.org/10.2200/S00474ED1V01Y201301ICR026

4. Edward A. Fox, Marcos Andre Goncalves, and Rao Shen. Theoretical Foundations for Digital Libraries: The 5S (Societies, Scenarios, Spaces, Structures, Streams) Approach. Morgan & Claypool Publishers, San Francisco, July 2012, http://dx.doi.org/10.2200/S00434ED1V01Y201207ICR022

Key topics to consider

• Problems: bit rot, loss of knowledge/heritage• Community building to solve problems, provide services

• Training, collaboration

• Requirements• Use cases / scenarios / services

• Constraints: staff, equipment/storage, bandwidth, funding• Content scope, Contracts (e.g., Twitter/Gnip)

• Design• Buy vs. make, performance, level of automation

• Implementation, maintenance, operation, refinement

Realities of Web Archiving

• Ephemeral content, especially news, or due to economics• More focus on immediate use by humans, not long-term

access with computers or support for historical research• Automation is essential, but curation also is crucial• Tradeoffs required between better AI, better HCI• Only can sample from the universe of content

• Where to look? How far to explore around each “seed”?• Role of sources: news, publishers, organizations, social media

• Social scientists are concerned that the sample will be biased when later pose research questions

Tradeoffs: Need to Integrate Instead

• Archives vs. Libraries• Preservation vs. Access• Big data vs. Storage & HPC• Traditional services (searching, browsing) vs. advanced

services (personalization, recommendation, exploration, analysis, visualization)

• Adding value through automation vs. cost of development of sophisticated but often narrow tools

WARC Information

• 500MB (5x10^8 bytes) is recommended as a practical target size for WARC files, when record sizes allow.

• http://bibnum.bnf.fr/WARC/ • http://warcreate.com/• http://matkelly.com/wail/• https://www.youtube.com/watch?v

=aYqW8CI-eQ8#action=share• http://warc.readthedocs.org/en/latest/• https://github.com/internetarchive/warc

IDEAL Pipeline

• Detect event using signals from feeds• See some related student projects in Spring 2015 offering of CS4624

at http://vtechworks.lib.vt.edu/handle/10919/18655

• Collect tweets using @... and #... and plain keywords• Extract and expand URLs to be seeds for crawls• Build collection through crawling, aided by curator• Run IDEAL team solutions to get Solr, with

• Raw tweet and webpage collections on events of interest• Cleaned collections as well as WARC files of webpages• Graphs from tweets and webpages using (inferred) links• Added value for documents from information analysis/extraction• Enhanced searching and browsing for event archives

Crawling

• SeerQ• Heritrix -> WayBack Machine, Time view• Nutch• Focused crawler

• How much automation?• What is the scope and type of content?

CS5604 Cleaning of Data

• Removing (HTML) tags• Removing add-ons (e.g., advertisements, navigation)

• Text cleanup (e.g., remove stop words, stem)

• Train and apply classifiers to remove documents, or parts of documents, according to resulting labeling

CS5604 Information analysis & extraction

• Classification• Clustering• LDA• NER• Social network graphs and importance computation

CS5604 Spring 2015 Final Team Talks

Files (Alphabetical)

• 2015 IIPC GA-Mohamed3.pptx IDEAL, FC• 2015-iipc-uws.ppt.pdf,pptx UWS overview• 20150521ELISQhadoop.pptx• 20150521FoxOverviewQNL.pptx (this file)• CS4624Lecture20150428SolrKanan• CS5604Presentations2015 (8 teams)• ELISQpapersTAMUall.pdf• gruss_fox_qnl_presentation.pptx PBL• UWS_demo.mov• VTWebArchivingPresentation.pptx• WARC_ISO_28500_final_draftAccepted.doc

Files (Sequenced)

• 20150521FoxOverviewQNL.pptx (this file)• WARC_ISO_28500_final_draftAccepted.doc• 2015-iipc-uws.ppt.pdf,pptx (UWS overview)• UWS_demo.mov• VTWebArchivingPresentation.pptx• 20150521ELISQhadoop.pptx• 2015 IIPC GA-Mohamed3.pptx (IDEAL, FC)• gruss_fox_qnl_presentation.pptx (PBL)• CS5604Presentations2015 (8 teams)

• CS4624Lecture20150428SolrKanan• ELISQpapersTAMUall.pdf

Acknowledgements

Thanks go to the ELISQ team, including QU Library and the Qatar National Library, as well as the US National Science Foundation, for support through grants: DUE-1141209, IIS-1319578, IIS-0916733

This project was made possible by NPRP Grant # 4 - 029 - 1 – 007 from the Qatar National Research Fund (a member of Qatar Foundation).

Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of...

Documents

Transcript of Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of...

Becky Eidelman PBL Projects Develops Sixth AM PBL … Eidelman PBL Projects Develops Sixth AM PBL Challenge ... nebhe

Activationoftoll-likereceptors2,3,and4onhuman ... · blood lymphocytes (PBL) from healthy donors (PBL-A, PBL-B,PBL-1,PBL-2,PBL-3,PBL-4,PBL-5,PBL-6,PBL-A1, PBL-A2, PBL-A3, and PBL-A4),

Edward A. Fox fox@vt CS DLRL Virginia Tech, Blacksburg, VA, USA

Blacksburg Middle School

What Is PBL? Why PBL?

Blacksburg Community Band

Blacksburg, VA Homebuyers' Workbook

US-Korea Joint Workshop on Digital Libraries SDSC - August 10-11, 2000 Edward A. Fox and Cortney Martin Virginia Tech, Blacksburg, VA, USA Broadband Wireless.

Networked Digital Library of Theses and Dissertations Edward A. Fox fox@vt.edu CS DLRL Internet TIC Virginia Tech, Blacksburg, VA,

CTRnet: A Crisis, Tragedy, & Recovery Network () Oct.16, 2009 VCOM Research Day Blacksburg, VA 24061 USA Edward Fox (fox@vt.edu), Bidisha.

Digital libraries, computer science, and education: 10 projects Edward A. Fox fox@vt.edu Department of Computer Science Virginia Tech (VPI&SU) Blacksburg,

D4/D5 MEETING IN BLACKSBURG VA USA BLACKSBURG, VA , …

Visual Semantic Modeling of Digital Libraries Qinwei Zhu, Marcos André Gonçalves, Rao Shen, Edward A. Fox – Virginia Tech,, Blacksburg, VA, USA Lillian.

Introduction to Concept Maps Edward A. Fox and Rao Shen CS5604 Fall 2002 “Information Storage & Retrieval” Dept. of Computer Science Virginia Tech, Blacksburg,

fox@vt.edu http:// fox.cs.vt.edu Dept. of Computer Science, Virginia Tech Blacksburg, VA 24061 USA

ELISQ Discussion with QNL Director Lux 20 May 2015 Edward A. Fox Professor, Computer Science, Virginia Tech Blacksburg, VA 24061 USA fox@vt.edufox@vt.edu.

BLACKSBURG NATIONAL CAPITAL REGION BUS · Arlington Salem EXIT 140 Blacksburg Park & Ride parking.vt.edu bit.ly/ncr-shuttle BLACKSBURG NATIONAL CAPITAL REGION Monday–Friday Saturday

fox@vt.edu http:// fox.cs.vt.edu/talks/2010/ … Dept. of Computer Science, Virginia Tech Blacksburg, VA 24061 USA

Networked Digital Library of Theses and Dissertations Croatia - May 1999 Edward A. Fox Virginia Tech, Blacksburg, VA, USA fox@vt.edu.

PSU/Villanova/VT Discussion Virginia Tech’s Digital Library Research Laboratory Jan. 10, 2005 -- PSU Edward A. Fox, fox@vt.edu Virginia Tech, Blacksburg,