What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a...

Post on 19-Jun-2020

9 views 0 download

Transcript of What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a...

What?Investigatingwhatacorpusisabout

MaxKemmanUniversityofLuxembourg

October25,2015

DoingDigitalHistory:IntroductiontoToolsandTechnology

Recapfromlasttime

Whatisdistantreading?

Whatisann-gram?

WhatdotheY-axisandX-axisshow?

Recap-Assignment

Howdidtheassignmentgo?

Whatdidyouthinkofthetoolsused?

Couldthisbeusefulforyourresearch?

OnemorethingonHTML:specialcharacters

http://www.ascii.cl/htmlcodes.htm

FindthesymbolandtheHTMLnumber

é&ü ->�&�

é&ü ->é&ü

InyourHTML,write longuedurée towritelonguedurée

Onemorething:whatisanalgorithm?

Asetofrulestofollowtosolveaproblem

Prettymuchlikeacookingrecipe

a=0while(a<10){ a=a+1}

TodayTheW'sofresearch•

Whatacorpusisabout•

Theentitiesinacorpus•

Anotherlookatouremails•

VoyantTools•

Nexttime•

Assignment•

TheW'sofresearch

Thusfar:

Now:wehaveadigitalcorpus,whattodowithit?

1. Abundanceofsources

2. WritingfortheWeb

3. DigitisationandDigitalLibraries

4. BigData

5. DistantReading

Researchthecorpus

NowcometheW'sofresearch:

1. What-Investigatingwhatacorpusisabout

2. Where-Investigatingthespatialentitiesinacorpus

3. When-Investigatingthetemporalentitiesinacorpus

4. Who-Investigatingthesocialentitiesinacorpus

What?

ThefirstWofinterest,whatisthiscorpusactuallyabout?

Differentmethodsarepossible

Findadescriptionofthecorpustoread•

Selectasampleofdocumentstoread•

Visualizetheusedwords•

Whatacorpusisabout

Whatisthisconferenceabout?

Wordclouds

Advantagesofwordclouds

Veryeasytocreate•

Visuallypleasing•

Givesaquickoverview•

Whatdoesawordclouddo?

Putverysimply,awordclouddoesthefollowing:

1. Countthenumberofoccurrencesperword

2. Sizeeachwordbyitsfrequency

3. Layoutthewordstoformashape

4. Optional:colorizewordsfordistinguishingandbetterreadability

Layout

UnliketheNgramviewer:noXorYaxes

Thepositionofeachwordismeaningless

Themeaningisinthesizeofthewords

Counting

Wordcloudsvisualizethefrequencyofwords

Buthowtocountwordsthatvaryinspelling?

E.g."Digital"and"digital"and"digitally","digitize"and"digitization"•

Normalization:

Lowercase•

Tokenize•

Stemmingorlemmatizing•

Stopwords•

Lowercase

WewereonvacationinFranceinAugust2015

wewereonvacationinfranceinaugust2015

Tokenize

wewereonvacation,infrance,inaugust2015

we|were|on|vacation|in|france|in|august|2015

Stemmingorlemmatizing

digitized|digital|digitization|digitizing

Stemming:digit

Lemmatizing:digitiz|digital

CouldbeveryusefulespeciallywithLatintexts

Stopwords

Mostcommonwordsinthelanguage:and,or,the

Sometimes:removenumbers

Notofinterest(usually)

we|were|on|vacation|in|france|in|august|2015

we|were|vacation|france|august|

Whatarethesegrantsabout?(normalized)

Comparingbetweendifferentpartsofthecorpus

Sourcesseparatedbytheircitationbehaviour

Representingamodelofthetext

Whatifwedonotknowhowtoseparatesources?

Orifwewanttoknowwhatotherwordsarerelatedtoourkeywords?

Topicmodelling

Documentsandwordscanbedirectlyobserved,buttopicsarelatent

Howtorepresentthetopicsinacorpus?

(SlidesontopicmodellingfromPimHuijnenandMarijnKoolen)

Statisticstofindtopicsrepresentedbygroupsofwords•

Documentisamixoftopics•

Topicisamixofwords•

Topicmodelling

Assumption:twodocumentswiththesametopicswillhaveoverlapinwords

Foragivencorpus,modellingprocessdoes:

1. Createwordprobabilitydistributionfortopics

2. Createtopicprobabilitydistributionfordocuments

Topicmodelling

Inshort:acorpusisrepresentedbystatisticaltopics

Thisallowsusto:

Separatesourcesbytopics•

Findrelatedkeywords•

Comparingdifferentpartsofthecorpus

MendeleyResearchMaps

Comparingthetopicalsimilarity

Assigneddocumentstodisciplinestomapdisciplinesbytopics

Whichformofmachinelearningwouldthisbe?

Whatisthecorpusabout?

Wecannowrepresentthewordsorthetopicsofacorpus

But,remember:WorldWarI≠"WorldWarI"

Theentitiesinacorpus

Thusfarweknowthefrequenciesofallthewords

Butwhatareweinterestedin?

WhatdoweneedfortheotherW's?

Theentitiesinacorpus

Thusfarweknowthefrequenciesofallthewords

Butwhatareweinterestedin?

WhatdoweneedfortheotherW's?

Where-places•

When-dates•

Who-people•

Peopleinthecorpus

TerBraake&Fokkens-Fairlyeasytodiscoverfamouspeople(withbiographicaldictionariesandNgramviewers)

Ngramshelptop-down:whenyouknowwhotosearchfor

Buthowtodiscoverwhodidnotbecomefamous,whileprominentintheirowntime?

Needtofindallpeoplebottom-upbyidentifyingallthenames

Bottom-upproces

TerBraake&Fokkens

1. Identifyallnamesinthecorpus

2. Giveallnamesanidentifier

3. Disambiguatenamesreferringtothesameperson

4. Compareresultswithanon-digitalcorpus

5. Visualizetheresults

6. Interpret!

Identifyingnames

Combinationsofwordsthatstartwithacapital

Thiswon'tworkforGerman

Theiralgorithmallowsfortwosequentiallowercasewords:JohanvanderCapellen

Note:builtforrecall,notprecision

Recall&Precision

Recall:retrieveallrelevantentities

Precision:donoretrieveirrelevantentities

Foralgorithmsusuallyachoicewhattooptimize

Recallofpeoplereferredtowithsinglename(Erasmus,Rembrandt)wouldleadtotoomuchnoise=lowerprecision

Difficulties

Spellingofnames(especiallybefore19thcentury)

Peoplewiththesamename

Nicknamesandchangingnames

Peoplewiththesametitle

Contextmatters!

NamedEntityRecognition

Wewanttoidentifytheentities

WewereonvacationinFranceinAugust2015.IwenttoshopattheIntermarche.TheareaaroundAptisreallynice.Maxalsoboughticecream,whichcost€2.

WewereonvacationinFranceinAugust2015.IwenttoshopattheIntermarche.TheareaaroundAptisreallynice.Maxalsoboughticecream,whichcost€2.

NamedEntities

Orwewanttosee:

WewereonvacationinFranceinAugust2015.IwenttoshopattheIntermarche.TheareaaroundAptisreallynice.Maxalsoboughticecream,whichcost€2.

People:Max•

Places:France,Apt•

Organizations:Intermarche•

Dates:August2015•

Currencies:€2•

Anotherlookatouremails

Forall30kemails,weperformedtextnormalisationandnamedentityrecognition

Let'stakealookathttps://www.wikileaks.org/clinton-emails/emailid/8

Exercise1:trytonormalisethetext

Exercise2:trytodiscoverthenamedentities:People,Places,Organisations

Normalised

SeeEmail8-normalised.txtinMoodleunder"Emails"

unclassife,us,department,state,case,f--,doc,date,release,full,hrod,clintonemailcom,sent,friday,july,pm,sullivanjj,stategov,subject,re,pakistan,bomb,ok,go,original,message,sullivan,jacob,sullivanjj,stategov,sent,fri,jul,subject,pakistan,bomb,fyi,put,follow,statement,statement,secretary,clinton,bomb,shrine,sy,ali,hujviri,lahore,shock,sadden,yesterday,attack,one,pakistan,popular,place,worship,shrine,sy,ali,hujviri,data,ganjbakhsh,lahore,claime,live,many,innocent,pakistane,extremist,shown,respect,neither,human,dignity,fundamental,religious,value,pakistani,society,violact,sanctity,rever,shrine,particularly,sinister,attempt,destabilize,pakistan,intimidate,people,attacker,will,succeed,pakistani,public,refuse,cow,violence,condemn,brutal,crime,reaffirm,commitment,support,pakistani,people,effort,defend,democracy,violent,

commitment,support,pakistani,people,effort,defend,democracy,violent,extremist,seek,destroy,thought,prayer,family,victim,people,pakistanNamedEntities

Trytodoitbyhand

NERtool:http://nlp.stanford.edu:8080/ner/

People Places Organisations

Sullivan

Jacob

CLINTON

AliHujviri

Pakistan

Pakistan

Lahore

Pakistan

Lahore

Pakistan

Pakistan

U.S.DepartmentofStateCaseNo

ShrineofSyedAliHujviri

Visualisetheemail

Gotohttp://tagcrowd.com/

Comparewithandwithoutstopwords

Comparenormalandnormalisedtext

What?

So,what'stheemailabout?Dowegetdifferentperspectives?

VoyantTools

Gotowww.voyant-tools.org/

UseMozillaFirefox,itdoesn'tworkinChrome(that'swhatwentwrongduringlecture)

FromMoodle:downloadthefilesforemails6000-6019f6-20-raw.txtandf6-20-normalised.txt

Youcanpasteintext,oruploadthefile

Continuebyhittingreveal

SavingtheVoyantsession

ItmightbeagoodideatocopytheURLearlyon,asthiswillallowyoutorefreshthepageifthetoolcrashes,ortoopenthetoolagainlateronusingthedataandstopwordsyoualreadyhad

Sharethesession:mousehoverthetopbluebar,andclickthethirdiconinthetopright(seeimage),youcanthenchoosetosharetheURL:thiswillopenanewbrowserwindowwhereyoucancopytheaddressfrom

Voyantwindows

LookatallthewindowsinVoyantandseeifyouunderstandthem

1. Cirrus(wordcloud)

2. Reader

3. Summary

4. Trends

5. Contexts

VoyantWordClouds

IntheCirrus,holdmouseonthetitlebar,andclick3rdicon•

Selectthestopwordlistyouneed•

OrEditListtoaddmorewords:1wordperline,clickSave•

Checkapplygloballytoactivateinallwindows•

Usethewordcloudtodetectcommonwordswe'renotinterestin:unclassified,department,subject,etc

HitConfirm•

Wheneditingagain,thestopwordsareorderedalphabetically,soyoumightnotseethemattheendanymore

VoyantSummary

Whatisthelongestemail?

Whataredistinctivewords?

DistinctivewordscalculatedbyTF-IDF:whatwasthatagain?

Update:thedistinctivewordsfeaturedoesn'tworknowthatwecombinedalltheemailsinasingletext-file

Searchingspecificwords

IntheCirruswindow,youcanclickTermsinthetopbartogetthelistofwordsorderedbycount

Youcanseeimmediatelyperwordhowitdevelopsovertimeintheemails

Fromthislistyoucanselectawordbycheckingtheboxlefttoit

Alternatively,youcansearchforwordsperwindow.Forexample,intheContextswindow(lower-right),atthebottomisasearchboxwhereyoucansearchforwords

InterpretingwithVoyant

Whatarethebiggestwords?

Howdotheydevelopthroughouttheemails?

Doesthistellwhattheemailsareaboutandhowitgoes?

Ifnot:whatisdifferent?

SharingtheVoyant

Youcaneither

Takescreenshotsofwhatyouwanttoshow•

Sharethesession:mousehoverthetopbluebar,andclickthethirdiconinthetopright(seeimage),youcanthenchoosetosharetheURL:thiswillopenanewbrowserwindowwhereyoucancopytheaddressfrom

TheHTMLsnippetwillgiveanHTMLcodethatyoucanembedinyourreport.•

Sharespecificwindows:forexample,inthetopbaroftrends,clickthefirsticon(seeimage),andselecttoexportaurl,aHTMLsnippetforembedding,oraPNGforincludinginyourreport

Nexttime

1November:Noclass

8November

When?Temporalentitiesandtimelines

Assignment

PerformVoyantanalysisofHCemails

Compare(seenextslideforalltheavailablefiles):

DocomparisonsinseparateVoyantwindows

f6-100-raw.txtvsf6-100-normalised.txttoseehowtextnormalisationgivesdifferentperspective

Forfurthercomparisons,chooseeithertheraworthenormalisedtext:•

f6-1000-*.txtvsf7-1000-*.txttoseehowtheemailsaredifferent•

IfVoyantoryourcomputerhasdifficultywith1000emails,comparef6-100-*.txtvsf7-100-*.txt

DownloadfilesfromMoodle:

Emails Raw Normalised

6000-6099 f6-100-raw.txt f6-100-normalised.txt

7000-7099 f7-100-raw.txt f7-100-normalised.txt

6000-6999 f6-1000-raw.txt f6-1000-normalised.txt

7000-7999 f7-1000-raw.txt f7-1000-normalised.txt

Assignment

Workinpairsoftwoorthree

Usethetoolsdiscussedtodaytotryandfindsomethingyoufindinteresting.Documentyourstepsandchoicesanddiscusswhyafindingisofinterest,andwhetheryoucanbecertainofthisfinding.

HandintheassignmentinHTML,includeyournameandadecentprofilephoto

500-1000words,inEnglish

Possiblequestionsyoumightaskofyourcorpora

Whataretheseemailsabout?•

Doweneedtofurthercleanthedata?•

Howarethesecorporadifferent?•

Doestextnormalisationleadtodifferentresults?•

Grading

Donote:thefindingitselfisnotthemostimportantpart

Emailtomax.kemman@uni.lubeforethestartofthenextlecture

1ptforfree•

3ptsforHTML•

3ptsfordocumentationofyourprocess•

3ptsforcriticalreflectiononyourfinding•