WIGOS-Vision-2040 wksp, Oct 2016 Web viewFinal Report from WIGOS-Vision-2040_Oct-2016
Text Mining Wksp Auvil
-
Upload
loretta-auvil -
Category
Technology
-
view
1.790 -
download
0
description
Transcript of Text Mining Wksp Auvil
![Page 1: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/1.jpg)
Engineering Knowledge for the Humanities
Text Mining WorkshopApril 26, 2008
Loretta AuvilNational Center for Supercomputing Applications (NCSA)
University of Illinois
![Page 2: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/2.jpg)
No Formulas
![Page 3: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/3.jpg)
More Visualizations
www.visualcomplexity.com
![Page 4: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/4.jpg)
NoraVis OpenLaszlo
www.noraproject.org
![Page 5: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/5.jpg)
NoraVis Backend
• Leverages D2K as web service call for predictive modeling• Passing parameters for some options
• Known modeling problems:• Training on a very sparse set of words, so improvements in
modeling can be achieved through additional semantic additions
![Page 6: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/6.jpg)
MONK
www.monkproject.org
![Page 7: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/7.jpg)
Challenges in Humanities Collaboration
• Understanding terminology and text mining capabilities• Learning their needs• Creating meaningful ways to display and present results• Technology innocence• Bridging different software tools• Appreciating how long things take to develop• Working collaboratively as a team
![Page 8: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/8.jpg)
How To Address these Challenges
• Educate team on data and text mining approaches• Demonstrate approaches with working examples• Develop use cases that drive software development• Create an environment/infrastructure that lets us create data
flows that are component based• Deploy web services for computations• Develop web application for setting up problem and delivery
of results
![Page 9: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/9.jpg)
SEASR Project Highlights
• SEASR will employ a comprehensive environment thatintegrates two complementary and revolutionary technicaladvances – Service Oriented Architecture and SemanticWeb, into a single computing architecture – SemanticEnabled Service Oriented Architecture
• SEASR will be enriched with a broad range of knowledgerepresentation and reasoning capabilities
• SEASR addresses the challenges of transforminginformation into knowledge by constructing the softwarebridges that are required to move from the unstructured andsemi-structured data world to the structured data world
![Page 10: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/10.jpg)
What does this mean for the Humanities Community?
SEASR will:• help scholars locate and access documents of interest in the
sea of large data stores• provide scholars with enhanced data synthesis and query
analysis• from focused data retrieval and data integration• to intelligent human-computer interactions for knowledge access• to semantic data enrichment• to entity and relationship discovery• to knowledge discovery and hypothesis generation
• empower collaboration among scholars by enhancing andinnovating virtual research environments
![Page 11: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/11.jpg)
Specific Project Highlights
• Common Services Layer: Provide execution environmentand supporting infrastructure that map from the problemsolving layer to the resource layer• Designed and developed Meandre (semantic, web-driven data flow
execution environment)• Developed the ability to define extensions for executing
components in languages other than Java; extensions have alreadybeen created for python and common lisp
• Problem Solving Layer: Visual environments that turncomponents and web services into a domain-specificproblem solving environment
![Page 12: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/12.jpg)
Workbench
![Page 13: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/13.jpg)
Community Hub
![Page 14: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/14.jpg)
Semantically Enabled SOA
![Page 15: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/15.jpg)
Semantically Enabled SOA 2
![Page 16: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/16.jpg)
A Problem from the MONK Project
• Analyze the repetition that occurred in the “The Making ofAmericans” by Gertrude Stein
![Page 17: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/17.jpg)
Repetition in The Making of Americans
~900~623~530Total pages
97.0612.8116.28Average wordfrequency
532917,19011,730Unique words (types)
517,207220,254190,906Total words(tokens)
Making ofAmericans
Moby DickUncleTom’sCabin
Text Source
![Page 18: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/18.jpg)
Visualization Approach from ManyEyes
Many Eyes Website: http://services.alphaworks.ibm.com/manyeyes/view/S4ZIjIsOtha6H~kYwoKjI2~
![Page 19: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/19.jpg)
Solution… came gradually
• Examine book by comparing each paragraph• Create feature set based on moving window of n-grams (3
grams) across each paragraph• Preprocess text
• To Stem or Not– "I will throw the umbrella in the mud"– “Martha was throwing the umbrella in the mud”
• To Keep Punctuation or Not• Execute the Closet algorithm (from Jiawei Han, et.al)
• Providing the following early results:
5:[a description of]:[1085|1087|1084|1082|1086]4:[men and women]:[1085|1083|1084|1088]4:[this is now|is now a]:[1087|1082|1086|1088]3:[a description of|now a description|this is now|is now a]
:[1087|1082|1086]3:[kinds of men|of men and|men and women]:[1085|1083|1088]3:[this is now|is now a|now a description|a description of]
:[1087|1082|1086]How do we make this meaningful to humanists??
![Page 20: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/20.jpg)
How to visualize… Trying existing tools
Brad Paley, TextArc
M. Wattenberg, Arc Diagrams
TimeSearcher
SpotFire
No context, No reading original text, No scale, No trends…
![Page 21: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/21.jpg)
Custom Solution - FeatureLens
FeatureLens--an early MONK (Metadata Offer New Knowledge)application--uses the machine learning approach of frequentpattern mining to identify fuzzy repetition patterns in a datacollection, and with no initial human input.
• Organized into sections (in this case chapters)• Rank frequent patterns by frequency and length• Show frequent patterns of n-grams in context• Rank frequent patterns by distribution trends, per collection
and per section.• Compare multiple patterns on the same views: distributions,
sections, paragraphs• Read the text (with highlighting of patterns)• Some options for handling scale for large data sets (e.g.
each line is five paragraphs)• Search for particular word
![Page 22: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/22.jpg)
FeatureLens: Organized into sections (chapters)
Created by Anthony Don and team at http://www.cs.umd.edu/hcil/textvis/featurelens/.
![Page 23: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/23.jpg)
FeatureLens: patterns sorted by frequency and length
Created by Anthony Don and team at http://www.cs.umd.edu/hcil/textvis/featurelens/.
![Page 24: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/24.jpg)
FeatureLens: n-gram patterns in context
Created by Anthony Don and team at http://www.cs.umd.edu/hcil/textvis/featurelens/.
![Page 25: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/25.jpg)
FeatureLens: distribution trends
Created by Anthony Don and team at http://www.cs.umd.edu/hcil/textvis/featurelens/.
![Page 26: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/26.jpg)
FeatureLens: multiple patterns
Created by Anthony Don and team at http://www.cs.umd.edu/hcil/textvis/featurelens/.
![Page 27: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/27.jpg)
The New Way to Read
• By visualizing certain patterns in this text and (it follows withlarger collections in general), by looking at the text “from adistance” through textual analytics and visualizations, onecan “read” the novel in ways formerly impossible.
• Franco Moretti has argued that the solution to trulyincorporating a more global perspective in our critical literarypractices is not to read more of the vast amounts of literatureavailable to us, but to read it differently by employing “distantreading.” “We know how to read texts,” he writes, “now let’slearn how not to read them.”Franco Moretti, Conjectures on World Literature. New Left Review, 1 (Jan.-Feb. 2000): 68.
![Page 28: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/28.jpg)
Stories buried in the repetition…
![Page 29: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/29.jpg)
Massive Digitization Projects
• What can be done with these large digital text collections• How can we use these large digital text collections
• Justify the use of computers and advanced techniques toprocess these collections, because we (humans) can’t readthis much
• The point is not to save the reader from reading theindividual texts or from making an independent judgment ofeach document's characteristics; rather, the point is to learnfrom the reader's holistic impression of the text and then,having done so, to show the reader what evidence correlateswith these impressions
![Page 30: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/30.jpg)
Transformational New Research Topics for Humanities
• Track patterns in morphology, syntax, and semantics acrosslarge stretches of time, space and culture
• Track topics or terminology across thousands of text• Track the social and economic influence of topics• Study multi-lingual and cultural impacts• Study literary inheritance• Study the evolution of ideas• and a lot more
![Page 31: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/31.jpg)
Exploratory Analysis Environments
• Provide access to text• Focus on specific passages• Allow for comparative reading• Provide enriched context for text and data analysis
![Page 32: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/32.jpg)
References
• J. Pei, J. Han, and R. Mao, ''CLOSET: An Efficient Algorithmfor Mining Frequent Closed Itemsets'', Proc. 2000 ACM-SIGMOD Int. Workshop on Data Mining and KnowledgeDiscovery (DMKD'00), Dallas, TX, May 2000.
• Tanya Clement, Anthony Don, Loretta Auvil, CatherinePlaisant, Greg Pape and Vered Goren. ‘Something that isinteresting is interesting them’: Using text mining andvisualizations to aid interpreting repetition in Gertrude Stein’sThe Making of Americans, Digital Humanities 2007.
![Page 33: Text Mining Wksp Auvil](https://reader033.fdocuments.us/reader033/viewer/2022051323/5479d548b479599a098b483e/html5/thumbnails/33.jpg)
Automated Learning Group / SEASR Team
Michael WelgeBernie Ac’sBoris CapitanuLily DongPeter GrovesAmit KumarXavier LloràChad OlsonMary PietrowiczDuane SearsmithKelly SearsmithDavid Tcheng