Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief...
-
Upload
jameson-lucey -
Category
Documents
-
view
216 -
download
0
Transcript of Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief...
Natural Language Processing for LODLAM
Presented at IGeLU 2014by Corey A Harper2014-09-16
A brief intro to machine learning & data science
for Libraries
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Context
Narrative
Story telling
The Library's story,
and the Archives story,
but also…
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Users’ stories
Scholars' stories
Adding context through recombinant metadata
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Scholars & Users Stories – Tim Sherratt (@wragge)
Also: http://discontents.com.au/a-map-and-some-pins-open-data-and-unlimited-horizons/
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Library Authority Data
“Include links to other URIs. so that they can discover more things.”
Short of providing and linking to URIs, this *is* authority data.
This is what our authority files are for.
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Linked data is about context
authorities provide context
and yet our controlled vocabs
are nearly gone
because the interfaces to them were broken
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
The Death of Browse
• Next-Gen Discovery Systems don't make use of Authority Control
• “Browse” was/is broken as a UI Design
• Rich data in Authorities, disconnected from narrative, context, search
• Richer “Authority” type data outside libraries...
• “Next Gen Next Gen Discovery…
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Fuzzy Wuzzy – Seat GeekF
uzzy Wuzzy – A
wesom
e Library from S
eatGeek
https://github.com/seatgeek/fuzzyw
uzzyh
ttp://se
atg
ee
k.com
/blo
g/d
ev/fu
zzywu
zzy-fuzzy-strin
g-m
atch
ing
-in-p
ytho
n
Slide courtesy of Doug Oard Univ. of Maryland
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Tools - Natural Language Processing
• DBPedia Spotlighthttps://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki
• Zemanta: http://www.zemanta.com/?wpst=1
• Open Calais: http://www.opencalais.com/
• Open Refine: http://openrefine.org/
• DataTXT: https://dandelion.eu/products/datatxt/
• AlchemyAPI: http://www.alchemyapi.com/
• FuzzyWuzzy: https://github.com/seatgeek/fuzzywuzzy
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Where does this lead?
We need new interfaces
new tools
for new kind of catalogers
for knowledge organization experts
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Linked Jazz Back End
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Primo PNX and Authorities
• Indexing Cross References
• New Browse Functionality
• Authority Control from Aleph / Alma• What about non-MARC, or non-
Aleph Data?
• Matching Strings to Authorities
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Enter Open Refinehttp://freeyourm
etadata.org/
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Match strings to vocabularies…
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Like LCNAF…
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Or Wikipedia
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Automated Authority Control?
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Open Refine RDF Skeleton
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Proposed System Architecture
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Hydra Modeling & Architecture
• Approaches to Provenance• Prov-O
• Named Graphs
• Named Datastreams
• “n” nyucore “records”• Same properties defined for each
• Keep data sources separate
• Merge for display in Blacklight & export to Primo
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Separate Metadata Datastreams
• source_metadata, enrich_metadata• Reload one or both without affecting other
or native metadata
• native_metadata• Edited only through Hydra UI• Partitioned from external sources
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Metadata Provenance
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Fedora Datastreams
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Blacklight User Interface
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Where does this lead?
We need new interfaces
new tools
for new kind of catalogers
for knowledge organization experts
A Role for Ex Libris
• Alma &/or Primo• Named Entity Recognition
• Vocabulary Reconciliation
• Provenance Management
• Primo Central• Named Entity Recognition on Full Text
• Auto Classification
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
A bit louder...
we need new interfaces
we need enterprise tools
Integrated into our metadata management systems
for new kind of catalogers
for knowledge organization experts
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Simplified Workflow Proposal
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
More Tools – At Programming Level
• Open NLP: https://opennlp.apache.org/
• Stanford Natural Language Toolkit: http://nlp.stanford.edu/software/index.shtml
• Python Tools • SciKitLearn, Pandas, NLTK, SciPi, NumPi• https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience
• http://pandas.pydata.org/
• http://www.nltk.org/
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
More Data Science-ey Toolshttp://w
ww
.rexeranalytics.com/D
ata-Miner-S
urvey-Results-2013.htm
l
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Data Science Techniques
• Feature Extraction / Feature Engineering
• Predictive Modeling
• Probabilistic Classification – Large Multi-Class Problems
• Text Analytics• Vectorization
• Bags & Sets of Words
• TF/IDF
• N-Grams
• Sparse Matrices
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Simple Example – Predict Yelp Star Ratings
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Fitting a Model – Naïve Bayes
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Data Science Venn Diagramhttp://drew
conway.com
/zia/2013/3/26/the-data-science-venn-diagram
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
1+ ln𝑇𝑜𝑡𝑎𝑙 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝐶𝑜𝑢𝑛𝑡
𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠𝐶𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔𝑇𝑒𝑟𝑚
http://www.amazon.com/Data-Science-Business-data-analytic-thinking/dp/1449361323
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Where can we go from here?
• NER is just the beginning
• Feature Engineering
• Hiring Statisticians
• Clustering & Classification
• Vocabulary Pruning and Engineering• Manageable 10-20k Class Text Classification Problems
• Domain Specific
• Ex Libris’ Activity in this space
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Thanks!
212.998.2479
@chrpr