Promoting Urine Elimination+Promoting Bowel Elimination+Promoting Proper Nutrition
The Indexer’s Legacy: Promoting Access to a Million Books
description
Transcript of The Indexer’s Legacy: Promoting Access to a Million Books
![Page 1: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/1.jpg)
The Indexer’s Legacy:Promoting Access to a Million
Books
Michael HuggettEdie Rasmussen
ICDL 2010
![Page 2: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/2.jpg)
Overview
• Problem statement• Background to study
– Indexers and Indexes– From Print to Digital Book Collections– Searching Digital Collections
• Research Project – Pilot Study– Phase I: Building the Collection– Phase II: Deconstructing the Indexes– Phase III: Building a meta-index– Phase IV: Index-augmented search
![Page 3: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/3.jpg)
Digital Book Projects
• Project Gutenberg (1971+)• Million Book Project (2002+)
– Universal Digital Library• Google Books Library Project (2004+)• Open Content Alliance (2005+)
– Universal Digital Library, Internet Archive• And many others...
![Page 4: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/4.jpg)
Searching Digital Collections
• Combination of ’dirty OCR’ of text plus page image
• Standard IR retrieval techniques: query leads to relevance ranked output
• Text level vs. Passage level retrieval (e.g. INEX Book Track)
• Adequate for many purposes• Problems with heterogeneity of text, ambiguity of
terms
![Page 5: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/5.jpg)
Problem Statement I
The ”million books problem”– ”…the human life contains only about 30,000
days; reading a book a day we would finsih a million books only after 30 lifetimes of reading…No longer a distant probability, a digital representation of [the vast written record of humanity] is taking shape before us… ”
– ”digitization does provide scale (or quantity) but does so at the price of rich, largely manual encoding” (Many More Than A Million, 2007)
![Page 6: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/6.jpg)
Problem Statement II
• Role of indexes: the index is one of the oldest known information retrieval devices, representing a network of interrelationships among concepts in a text
• Intellectual effort: an index represents hours of interpretation and analysis
• Intellectual content: includes information about a book’s content but also incorporates the structure of knowledge in a given field
• Standard information retrieval techniques reduce index terms (and all text terms) to a ’bag of words’ model
![Page 7: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/7.jpg)
Research Goal
As we move from print to digital collections of scholarly works, how can we retain, extract and use the knowledge that is embedded in the indexes?
The goal of this research is to develop techniques that will help to capture, visualize and access the world’s digital knowledge through application of text processing techniques to digital indexes of legacy materials
![Page 8: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/8.jpg)
The Indexing Process
• Read identify indexable concepts (mark) create vocabulary invert? sort and format (s/ware) add cross references -edit for consistency
• Reduces contents of a book to its essentials (5 – 10%)
• Vocabulary is author’s plus indexer’s• Goal is to facilitate access to material in the text
![Page 9: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/9.jpg)
Knowledge in Indexes
• Premises:– The index identifies the most significant topics
in the book– The index expresses the topics in the author’s
vocabulary and in the vocabulary of the field (i.e. that of the reader)
– The index provides links between concepts, showing how they are related
– As indexes on a topic are aggregated, significant concepts related to that topic, and the relationships between them are reinforced, creating both a vocabulary and a guide to the collection
![Page 10: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/10.jpg)
Challenges
• Not all books are indexed• Indexing conventions have changed over time• Books in public domain are older; quality of index
may be lower• Quality of OCR, errors in text• No markup; index structure is indicated visually
(e.g. indents, punctuation)• Matching page numbers in index to physical
pages in text
![Page 11: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/11.jpg)
![Page 12: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/12.jpg)
![Page 13: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/13.jpg)
Related Research
• ’key ideas’ (Schilit and Kolak, Google Research, 2008)– Mining and linking ideas in digital books– Quotation extraction (quote plus context)
• ’Searching in a book’ (Liesaputra, Witten and Bainbridge, NZDL, 2009)
• E-book usability with indexes (Noorhidawata, 2007)
• Reorganizing indexes (Chi et al., 2004)– Creating mini-indexes ’on the fly’
![Page 14: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/14.jpg)
Pilot Study I• Work on a small number of digital items
– 3 biographies of Charles Darwin– 12 books on BC history from UBC University
Press• Software to parse indexes
– From pdf to index structures• Operator driven: scan and correct ocr
errors; key indicators in database• Parse index terms and entries by shared
references• Identify common words on shared page
references
![Page 15: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/15.jpg)
Pilot Study II
• Preliminary results:– Measure of coherence
• Rank terms by frequency and normalize• Deviation = ∑(average rank – term rank) • Calculated for content, index entries, index
words• Calculated for all terms, and for shared
terms only
![Page 16: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/16.jpg)
Pilot Study III
• Preliminary Results: – index terms show more coherence than
corpus terms– Suggests that BoB are a good source of
corpus-level keywords
Corpus Index Entries Index Words
All terms 0.5361 0.3163 0.1913
Shared terms 0.3792 0.0129 0.0278
![Page 17: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/17.jpg)
Phase I: Building a Test Collection
• Needed:– General collection
• Collection of 1000 books • With indexes!• In the public domain
– Topic-oriented collections (5-6?)• Collections of 100(?) books in a topic
• GRAs to identify and download target books• Result: a test collection for this project (and
others)
![Page 18: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/18.jpg)
Phase II: Deconstructing the Indexes
• No BoB indexing standards• No controlled vocabulary• A few indexing conventions
– Headings, subheadings, sub-subheadings...– Structure is indicated by spacing and
punctuation• Need to parse the index to identify entries and
page references• Parsing software written and tested
![Page 19: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/19.jpg)
Phase II: Research Questions
• How can index structure (run-on or indented, heading hierarchies) be extracted?
• Can keyphrases be extracted (proper nouns, concepts)?
• What are the syntax and semantics of indexes?• Can we identify the historical development of
indexes? How have they changed over time?• Can we use XML to create a useful intermediate
product?
![Page 20: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/20.jpg)
Phase III: Building a Meta-Index
• Meta-index: a digital collection-level aggregation of the BoB indexes for a digital collection
• Merging/ concatenating index entries• May be a standard index format (alphabetical,
hierarchical entries), i.e., a digital browsalbe index
• Or may use new formats, e.g. Visualizations, topic maps
![Page 21: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/21.jpg)
INDEX
BOOK
META-INDEX
1 DIGITAL
COLLECTION
META-INDEX
2
![Page 22: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/22.jpg)
Phase III: Research Questions• Can digital versions of BoB indexes be used to
facilitate access to digital collections?• What form should these indexes take?
• Conventional index format (alphabetical/searchable with headings and subheadings)
• Index visualization• How do these meta-indexes compare to a
standard search engine when searching a digital collection?
• Evaluation: task-oriented evaluation with human subjects (e.g. Humanities scholars)
![Page 23: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/23.jpg)
Phase IV: Index Augmented Search
• Using the index information in new ways– Building a ontology in domain areas– Identifying concept relationships between
index vocabulary and term vocabulary– Use for
• Query expansion• Question answering• Summarization• Categorization
![Page 24: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/24.jpg)
Phase IV: Research Questions
• Based on standard text processing procedures, i.e. stemming, use of stopwords, keyphrase extraction, term weighting such as tf*idf or BM25
• How strong is the relationship between the index entry and the words on the page(s) referred to?
• Assume that for a single entry, this relationship is weak; over multiple similar entries in many books, do real relationships emerge and false ones disappear?
• Evaluation: using external collections, e.g. TREC or INEX, to measure contribution of index term relationships to retrieval performance
![Page 25: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/25.jpg)
Further Research
• Building themed or personalized collections (using index for book similiarity measures)
• Ability to mine large multidisciplinary collections for references (historical, economic, etc.)
• Ability to mine collections and build special-format indexes and browsers (e.g. images, figures)
• Changes in topics over time, evolution of thinking on a subject
• Knowledge discovery: detecting previously undiscovered links between topics
![Page 26: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/26.jpg)
The Indexer’s Legacy…
(a) an archaic addendum to an obsolete medium?
OR(b) value-added knowledge in electronic text
that enhances access to digital collections?
![Page 27: The Indexer’s Legacy: Promoting Access to a Million Books](https://reader035.fdocuments.us/reader035/viewer/2022062804/5681493c550346895db6871b/html5/thumbnails/27.jpg)
Thank you!