Post on 19-Jun-2020
What?Investigatingwhatacorpusisabout
MaxKemmanUniversityofLuxembourg
October25,2015
DoingDigitalHistory:IntroductiontoToolsandTechnology
Recapfromlasttime
Whatisdistantreading?
Whatisann-gram?
WhatdotheY-axisandX-axisshow?
Recap-Assignment
Howdidtheassignmentgo?
Whatdidyouthinkofthetoolsused?
Couldthisbeusefulforyourresearch?
OnemorethingonHTML:specialcharacters
http://www.ascii.cl/htmlcodes.htm
FindthesymbolandtheHTMLnumber
é&ü ->�&�
é&ü ->é&ü
InyourHTML,write longuedurée towritelonguedurée
Onemorething:whatisanalgorithm?
Asetofrulestofollowtosolveaproblem
Prettymuchlikeacookingrecipe
a=0while(a<10){ a=a+1}
TodayTheW'sofresearch•
Whatacorpusisabout•
Theentitiesinacorpus•
Anotherlookatouremails•
VoyantTools•
Nexttime•
Assignment•
TheW'sofresearch
Thusfar:
Now:wehaveadigitalcorpus,whattodowithit?
1. Abundanceofsources
2. WritingfortheWeb
3. DigitisationandDigitalLibraries
4. BigData
5. DistantReading
Researchthecorpus
NowcometheW'sofresearch:
1. What-Investigatingwhatacorpusisabout
2. Where-Investigatingthespatialentitiesinacorpus
3. When-Investigatingthetemporalentitiesinacorpus
4. Who-Investigatingthesocialentitiesinacorpus
What?
ThefirstWofinterest,whatisthiscorpusactuallyabout?
Differentmethodsarepossible
Findadescriptionofthecorpustoread•
Selectasampleofdocumentstoread•
Visualizetheusedwords•
Whatacorpusisabout
Whatisthisconferenceabout?
Wordclouds
Advantagesofwordclouds
Veryeasytocreate•
Visuallypleasing•
Givesaquickoverview•
Whatdoesawordclouddo?
Putverysimply,awordclouddoesthefollowing:
1. Countthenumberofoccurrencesperword
2. Sizeeachwordbyitsfrequency
3. Layoutthewordstoformashape
4. Optional:colorizewordsfordistinguishingandbetterreadability
Layout
UnliketheNgramviewer:noXorYaxes
Thepositionofeachwordismeaningless
Themeaningisinthesizeofthewords
Counting
Wordcloudsvisualizethefrequencyofwords
Buthowtocountwordsthatvaryinspelling?
E.g."Digital"and"digital"and"digitally","digitize"and"digitization"•
Normalization:
Lowercase•
Tokenize•
Stemmingorlemmatizing•
Stopwords•
Lowercase
WewereonvacationinFranceinAugust2015
wewereonvacationinfranceinaugust2015
Tokenize
wewereonvacation,infrance,inaugust2015
we|were|on|vacation|in|france|in|august|2015
Stemmingorlemmatizing
digitized|digital|digitization|digitizing
Stemming:digit
Lemmatizing:digitiz|digital
CouldbeveryusefulespeciallywithLatintexts
Stopwords
Mostcommonwordsinthelanguage:and,or,the
Sometimes:removenumbers
Notofinterest(usually)
we|were|on|vacation|in|france|in|august|2015
we|were|vacation|france|august|
Whatarethesegrantsabout?(normalized)
Comparingbetweendifferentpartsofthecorpus
Sourcesseparatedbytheircitationbehaviour
Representingamodelofthetext
Whatifwedonotknowhowtoseparatesources?
Orifwewanttoknowwhatotherwordsarerelatedtoourkeywords?
Topicmodelling
Documentsandwordscanbedirectlyobserved,buttopicsarelatent
Howtorepresentthetopicsinacorpus?
(SlidesontopicmodellingfromPimHuijnenandMarijnKoolen)
Statisticstofindtopicsrepresentedbygroupsofwords•
Documentisamixoftopics•
Topicisamixofwords•
Topicmodelling
Assumption:twodocumentswiththesametopicswillhaveoverlapinwords
Foragivencorpus,modellingprocessdoes:
1. Createwordprobabilitydistributionfortopics
2. Createtopicprobabilitydistributionfordocuments
Topicmodelling
Inshort:acorpusisrepresentedbystatisticaltopics
Thisallowsusto:
Separatesourcesbytopics•
Findrelatedkeywords•
Comparingdifferentpartsofthecorpus
MendeleyResearchMaps
Comparingthetopicalsimilarity
Assigneddocumentstodisciplinestomapdisciplinesbytopics
Whichformofmachinelearningwouldthisbe?
Whatisthecorpusabout?
Wecannowrepresentthewordsorthetopicsofacorpus
But,remember:WorldWarI≠"WorldWarI"
Theentitiesinacorpus
Thusfarweknowthefrequenciesofallthewords
Butwhatareweinterestedin?
WhatdoweneedfortheotherW's?
Theentitiesinacorpus
Thusfarweknowthefrequenciesofallthewords
Butwhatareweinterestedin?
WhatdoweneedfortheotherW's?
Where-places•
When-dates•
Who-people•
Peopleinthecorpus
TerBraake&Fokkens-Fairlyeasytodiscoverfamouspeople(withbiographicaldictionariesandNgramviewers)
Ngramshelptop-down:whenyouknowwhotosearchfor
Buthowtodiscoverwhodidnotbecomefamous,whileprominentintheirowntime?
Needtofindallpeoplebottom-upbyidentifyingallthenames
Bottom-upproces
TerBraake&Fokkens
1. Identifyallnamesinthecorpus
2. Giveallnamesanidentifier
3. Disambiguatenamesreferringtothesameperson
4. Compareresultswithanon-digitalcorpus
5. Visualizetheresults
6. Interpret!
Identifyingnames
Combinationsofwordsthatstartwithacapital
Thiswon'tworkforGerman
Theiralgorithmallowsfortwosequentiallowercasewords:JohanvanderCapellen
Note:builtforrecall,notprecision
Recall&Precision
Recall:retrieveallrelevantentities
Precision:donoretrieveirrelevantentities
Foralgorithmsusuallyachoicewhattooptimize
Recallofpeoplereferredtowithsinglename(Erasmus,Rembrandt)wouldleadtotoomuchnoise=lowerprecision
•
Difficulties
Spellingofnames(especiallybefore19thcentury)
Peoplewiththesamename
Nicknamesandchangingnames
Peoplewiththesametitle
Contextmatters!
NamedEntityRecognition
Wewanttoidentifytheentities
WewereonvacationinFranceinAugust2015.IwenttoshopattheIntermarche.TheareaaroundAptisreallynice.Maxalsoboughticecream,whichcost€2.
WewereonvacationinFranceinAugust2015.IwenttoshopattheIntermarche.TheareaaroundAptisreallynice.Maxalsoboughticecream,whichcost€2.
NamedEntities
Orwewanttosee:
WewereonvacationinFranceinAugust2015.IwenttoshopattheIntermarche.TheareaaroundAptisreallynice.Maxalsoboughticecream,whichcost€2.
People:Max•
Places:France,Apt•
Organizations:Intermarche•
Dates:August2015•
Currencies:€2•
Anotherlookatouremails
Forall30kemails,weperformedtextnormalisationandnamedentityrecognition
Let'stakealookathttps://www.wikileaks.org/clinton-emails/emailid/8
Exercise1:trytonormalisethetext
Exercise2:trytodiscoverthenamedentities:People,Places,Organisations
Normalised
SeeEmail8-normalised.txtinMoodleunder"Emails"
unclassife,us,department,state,case,f--,doc,date,release,full,hrod,clintonemailcom,sent,friday,july,pm,sullivanjj,stategov,subject,re,pakistan,bomb,ok,go,original,message,sullivan,jacob,sullivanjj,stategov,sent,fri,jul,subject,pakistan,bomb,fyi,put,follow,statement,statement,secretary,clinton,bomb,shrine,sy,ali,hujviri,lahore,shock,sadden,yesterday,attack,one,pakistan,popular,place,worship,shrine,sy,ali,hujviri,data,ganjbakhsh,lahore,claime,live,many,innocent,pakistane,extremist,shown,respect,neither,human,dignity,fundamental,religious,value,pakistani,society,violact,sanctity,rever,shrine,particularly,sinister,attempt,destabilize,pakistan,intimidate,people,attacker,will,succeed,pakistani,public,refuse,cow,violence,condemn,brutal,crime,reaffirm,commitment,support,pakistani,people,effort,defend,democracy,violent,
commitment,support,pakistani,people,effort,defend,democracy,violent,extremist,seek,destroy,thought,prayer,family,victim,people,pakistanNamedEntities
Trytodoitbyhand
NERtool:http://nlp.stanford.edu:8080/ner/
People Places Organisations
Sullivan
Jacob
CLINTON
AliHujviri
Pakistan
Pakistan
Lahore
Pakistan
Lahore
Pakistan
Pakistan
U.S.DepartmentofStateCaseNo
ShrineofSyedAliHujviri
Visualisetheemail
Gotohttp://tagcrowd.com/
Comparewithandwithoutstopwords
Comparenormalandnormalisedtext
What?
So,what'stheemailabout?Dowegetdifferentperspectives?
VoyantTools
Gotowww.voyant-tools.org/
UseMozillaFirefox,itdoesn'tworkinChrome(that'swhatwentwrongduringlecture)
FromMoodle:downloadthefilesforemails6000-6019f6-20-raw.txtandf6-20-normalised.txt
Youcanpasteintext,oruploadthefile
Continuebyhittingreveal
SavingtheVoyantsession
ItmightbeagoodideatocopytheURLearlyon,asthiswillallowyoutorefreshthepageifthetoolcrashes,ortoopenthetoolagainlateronusingthedataandstopwordsyoualreadyhad
Sharethesession:mousehoverthetopbluebar,andclickthethirdiconinthetopright(seeimage),youcanthenchoosetosharetheURL:thiswillopenanewbrowserwindowwhereyoucancopytheaddressfrom
Voyantwindows
LookatallthewindowsinVoyantandseeifyouunderstandthem
1. Cirrus(wordcloud)
2. Reader
3. Summary
4. Trends
5. Contexts
VoyantWordClouds
IntheCirrus,holdmouseonthetitlebar,andclick3rdicon•
Selectthestopwordlistyouneed•
OrEditListtoaddmorewords:1wordperline,clickSave•
Checkapplygloballytoactivateinallwindows•
Usethewordcloudtodetectcommonwordswe'renotinterestin:unclassified,department,subject,etc
•
HitConfirm•
Wheneditingagain,thestopwordsareorderedalphabetically,soyoumightnotseethemattheendanymore
•
VoyantSummary
Whatisthelongestemail?
Whataredistinctivewords?
DistinctivewordscalculatedbyTF-IDF:whatwasthatagain?
Update:thedistinctivewordsfeaturedoesn'tworknowthatwecombinedalltheemailsinasingletext-file
Searchingspecificwords
IntheCirruswindow,youcanclickTermsinthetopbartogetthelistofwordsorderedbycount
Youcanseeimmediatelyperwordhowitdevelopsovertimeintheemails
Fromthislistyoucanselectawordbycheckingtheboxlefttoit
Alternatively,youcansearchforwordsperwindow.Forexample,intheContextswindow(lower-right),atthebottomisasearchboxwhereyoucansearchforwords
InterpretingwithVoyant
Whatarethebiggestwords?
Howdotheydevelopthroughouttheemails?
Doesthistellwhattheemailsareaboutandhowitgoes?
Ifnot:whatisdifferent?
SharingtheVoyant
Youcaneither
Takescreenshotsofwhatyouwanttoshow•
Sharethesession:mousehoverthetopbluebar,andclickthethirdiconinthetopright(seeimage),youcanthenchoosetosharetheURL:thiswillopenanewbrowserwindowwhereyoucancopytheaddressfrom
•
TheHTMLsnippetwillgiveanHTMLcodethatyoucanembedinyourreport.•
Sharespecificwindows:forexample,inthetopbaroftrends,clickthefirsticon(seeimage),andselecttoexportaurl,aHTMLsnippetforembedding,oraPNGforincludinginyourreport
•
Nexttime
1November:Noclass
8November
When?Temporalentitiesandtimelines
Assignment
PerformVoyantanalysisofHCemails
Compare(seenextslideforalltheavailablefiles):
DocomparisonsinseparateVoyantwindows
f6-100-raw.txtvsf6-100-normalised.txttoseehowtextnormalisationgivesdifferentperspective
•
Forfurthercomparisons,chooseeithertheraworthenormalisedtext:•
f6-1000-*.txtvsf7-1000-*.txttoseehowtheemailsaredifferent•
IfVoyantoryourcomputerhasdifficultywith1000emails,comparef6-100-*.txtvsf7-100-*.txt
•
DownloadfilesfromMoodle:
Emails Raw Normalised
6000-6099 f6-100-raw.txt f6-100-normalised.txt
7000-7099 f7-100-raw.txt f7-100-normalised.txt
6000-6999 f6-1000-raw.txt f6-1000-normalised.txt
7000-7999 f7-1000-raw.txt f7-1000-normalised.txt
Assignment
Workinpairsoftwoorthree
Usethetoolsdiscussedtodaytotryandfindsomethingyoufindinteresting.Documentyourstepsandchoicesanddiscusswhyafindingisofinterest,andwhetheryoucanbecertainofthisfinding.
HandintheassignmentinHTML,includeyournameandadecentprofilephoto
500-1000words,inEnglish
Possiblequestionsyoumightaskofyourcorpora
Whataretheseemailsabout?•
Doweneedtofurthercleanthedata?•
Howarethesecorporadifferent?•
Doestextnormalisationleadtodifferentresults?•
Grading
Donote:thefindingitselfisnotthemostimportantpart
Emailtomax.kemman@uni.lubeforethestartofthenextlecture
1ptforfree•
3ptsforHTML•
3ptsfordocumentationofyourprocess•
3ptsforcriticalreflectiononyourfinding•