Integrating Data Mining and Data Management Technologies for Scholarly Inquiry

15
2013.10.12 SLIDE 1 DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California, Berkeley Paul Watry Richard Marciano University of Liverpool University of North Carolina, Chapel Hill

description

Integrating Data Mining and Data Management Technologies for Scholarly Inquiry. Ray R. Larson University of California , Berkeley Paul Watry Richard Marciano University of Liverpool University of North - PowerPoint PPT Presentation

Transcript of Integrating Data Mining and Data Management Technologies for Scholarly Inquiry

Page 1: Integrating Data Mining and Data Management Technologies for  Scholarly Inquiry

2013.10.12 SLIDE 1DID Meeting - Montreal

Integrating Data Mining and Data Management Technologies for Scholarly Inquiry

Ray R. LarsonUniversity of California, Berkeley

Paul Watry Richard MarcianoUniversity of Liverpool University of North Carolina, Chapel Hill

Page 2: Integrating Data Mining and Data Management Technologies for  Scholarly Inquiry

2013.10.12 SLIDE 2

• Integrating Data Mining and Data Management Technologies for Scholarly Inquiry

• Goals:– Text mining and NLP techniques to extract

content (named Persons, Places, Time Periods/Events) and associate context

• Data:– Internet Archive Books Collection (with

associated MARC where available) ~7.2T– Jstore ~1T– Context sources: SNAC Archival and Library

Authority records.• Tools

– Cheshire 3 – DL Search and Retrieval Framework

– iRODS – Policy-driven distributed data storage– Amazon S3 storage and EC2 computing

DID Meeting - Montreal

Page 3: Integrating Data Mining and Data Management Technologies for  Scholarly Inquiry

2013.10.12 SLIDE 3DID Meeting - Montreal

Grid-Based Digital Libraries: Needs

• Large-scale distributed storage requirements and technologies

• Organizing distributed digital collections• Shared Metadata – standards and

requirements• Managing distributed digital collections• Security and access control• Collection Replication and backup• Distributed Information Retrieval

support and algorithms

Page 4: Integrating Data Mining and Data Management Technologies for  Scholarly Inquiry

2013.10.12 SLIDE 4

But…• Hasn’t Hadoop and its menagerie already

solved everything?– Yes – many tasks can be done now with great

scaleup– And No – most Hadoop solutions are batch

oriented and not geared towards information access, but more towards summarization

– Maybe – we are looking at replacing or supplementing the low-level data management with Hadoop or Spark tools

DID Meeting - Montreal

Page 5: Integrating Data Mining and Data Management Technologies for  Scholarly Inquiry

2013.10.12 SLIDE 5DID Meeting - Montreal

Grid/Cloud IR Issues• Want to preserve the same retrieval performance

(precision/recall) while hopefully increasing efficiency (I.e. speed)

• Very large-scale distribution of resources is (still) a challenge for sub-second retrieval

• Different from most other typical Grid/Cloud processes, IR is potentially less computing intensive and more data intensive

• In many ways Grid IR replicates the process (and problems) of metasearch or distributed search

• We have developed the Cheshire3 system to evaluate and manage these issues. The Cheshire3 system is actually one component in a larger Grid-based environment

Page 6: Integrating Data Mining and Data Management Technologies for  Scholarly Inquiry

2013.10.12 SLIDE 6DID Meeting - Montreal

Cheshire3 Environment

or iRODS

Page 7: Integrating Data Mining and Data Management Technologies for  Scholarly Inquiry

2013.10.12 SLIDE 7DID Meeting - Montreal

Cheshire3 IR Overview• XML Information Retrieval Engine

– 3rd Generation of the UC Berkeley Cheshire system, as co-developed at the University of Liverpool

– Uses Python for flexibility and extensibility, but uses C/C++ based libraries for processing speed

– Standards based: XML, XSLT, CQL, SRW/U, Z39.50, OAI to name a few

– Grid/Cloud capable. Uses distributed configuration files, workflow definitions and PVM or MPI to scale from one machine to thousands of parallel nodes

– Free and Open Source Software

Page 8: Integrating Data Mining and Data Management Technologies for  Scholarly Inquiry

2013.10.12 SLIDE 8

Cheshire3 Object Model

DID Meeting - Montreal

Page 9: Integrating Data Mining and Data Management Technologies for  Scholarly Inquiry

2013.10.12 SLIDE 9

Current Version• iRODS and C3 on Amazon EC2 and S3

DID Meeting - Montreal

Bucket 2

Bucket 1Amazon

S3

iRODS

Cache

Resource

Amazon

EC2

Data Ingestion

Cheshire3

Indexing

RetrievaliCAT

Rule

Engine

Data Presentation

Page 10: Integrating Data Mining and Data Management Technologies for  Scholarly Inquiry

2013.10.12 SLIDE 10

Sample demo

DID Meeting - Montreal

Page 11: Integrating Data Mining and Data Management Technologies for  Scholarly Inquiry

2013.10.12 SLIDE 11DID Meeting - Montreal

Page 12: Integrating Data Mining and Data Management Technologies for  Scholarly Inquiry

2013.10.12 SLIDE 12DID Meeting - Montreal

Page 13: Integrating Data Mining and Data Management Technologies for  Scholarly Inquiry

2013.10.12 SLIDE 13DID Meeting - Montreal

Page 14: Integrating Data Mining and Data Management Technologies for  Scholarly Inquiry

2013.10.12 SLIDE 14DID Meeting - Montreal

Summary• Indexing and IR work very well in the

Grid/Cloud environment, with the expected scaling behavior for multiple processes

• Still in progress:– We are still processing collecting the books

collection from the Internet Archive– We are still extracting place names, personal

names, corporate names and linking with reference sources (such as GeoNames, VIAF, and SNAC)

Page 15: Integrating Data Mining and Data Management Technologies for  Scholarly Inquiry

2013.10.12 SLIDE 15DID Meeting - Montreal

Thank you!

iRODS available via https://www.irods.org Project web site http://diggingintodata.web.unc.edu

Available via https://github.com/cheshire3

Special thanks to John Harrison (Liverpool),

Chien-Yi Hou (UNC), Shreyas and Luis Aguilar (UCB)