University of Münster (27–28 June 2019) Kenneth Benoit

218
Quantitative Text Analysis using R and quanteda Kenneth Benoit University of Münster (27–28 June 2019)

Transcript of University of Münster (27–28 June 2019) Kenneth Benoit

Page 1: University of Münster (27–28 June 2019) Kenneth Benoit

QuantitativeTextAnalysisusingRandquantedaKennethBenoit

UniversityofMünster(27–28June2019)

Page 2: University of Münster (27–28 June 2019) Kenneth Benoit

Contents1.Introductions

2.Fundamentals

3.GettingStarted

4.Workingwithacorpus

5.Keywords-in-context

6.Importingtextfiles

7.Tokenization

8.Creatingadocument-featurematrix

9.Dictionary(sentiment)Analysis

10.TextualStatistics

11.Classification

12.Topicmodelling

13.Datawranglingandvisualisation

Page 3: University of Münster (27–28 June 2019) Kenneth Benoit

PurposeoftheworkshopThisworkshopisdesignedto

introducethetoolsforquantitativetextanalysisusingR

focusspecificallyonthequantedapackage

demonstrateseveralmajorcategoriesofanalysis

answeranyquestionsabouttextanalysisthatyoumighthave

3

Page 4: University of Münster (27–28 June 2019) Kenneth Benoit

WhomthisworkshopisforThosefamiliarwithtextanalysismethodsbutnotinR/quanteda

ThosewhohavetriedR/quantedabutwanttolearnmoreproficiency

Themerely"text-curious"

Norealpre-requisites:We'reallheretolearn

4

Page 5: University of Münster (27–28 June 2019) Kenneth Benoit

AboutmeKenBenoit(http://kenbenoit.net)

ProfessorofComputationalSocialScience,DepartmentofMethodology,LondonSchoolofEconomicsandPoliticalScience

CreatorandmaintainerofthequantedaRpackageandrelatedtools

FounderoftheQuantedaInitiative,anon-profitcompanyaimedatadvancingdevelopmentanddisseminationofopen-sourcetoolsfortextanalytics

Myresearch:Politicalscience(partycompetition),measurement,textanalysis

5

Page 6: University of Münster (27–28 June 2019) Kenneth Benoit

Examplefrommyresearch

6

Page 7: University of Münster (27–28 June 2019) Kenneth Benoit

Examplefrommyresearch(cont)

7

Page 8: University of Münster (27–28 June 2019) Kenneth Benoit

Yourturn!1.Name?

2.Department,degree?

3.Researchinterests?

4.Previousexperiencewithtextanalysis/R?

5.Whyareyouinterestedintheworkshop?

8

Page 9: University of Münster (27–28 June 2019) Kenneth Benoit

Assumptions,Concepts,andExamples

Page 10: University of Münster (27–28 June 2019) Kenneth Benoit

Work�ow,demysti�ed

10

Page 11: University of Münster (27–28 June 2019) Kenneth Benoit

BasicconceptsandterminologyTexts:Organizedintodocuments

Corpus:Collectionoftexts,oftenwithassociateddocument-levelmetadata,whatIwillcall"documentvariables"ordocvars

Stems:wordswithsuffixesremoved(usingasetofrules)

Lemmas:canonicalwordform

Word win winning wins won

Stem win win win won

Lemma win win win win

11

Page 12: University of Münster (27–28 June 2019) Kenneth Benoit

Basicconceptsandterminology(cont.)Stopwords:wordsthataredesignedforexclusionfromanyanalysisoftext

Partsofspeech:linguisticmarkersindicatingthegeneralcategoryofaword'singuisticproperty,e.g.noun,verb,adjective,etc.

Namedentities:areal-worldobject,suchaspersons,locations,organizations,products,etc.,thatcanbedenotedwithapropername,oftenaphrase,e.g."UniversityofBergen"or"UnitedKingdom"

Multi-wordexpressions:sequencesofwordsdenotingasingleconcept(andwouldbeinGerman),e.g.valueaddedtax(inGerman:Mehrwertsteuer)

12

Page 13: University of Münster (27–28 June 2019) Kenneth Benoit

Typesandtokenstokens:asequenceofcharactersthataregroupedtogetherasausefulsemanticunit

words

couldalsoincludepuncutationcharactersorsymbols

stemsorlemmas

multi-wordexpressions

namedentities

usually,butnotalways,delimitedbyspaces

type:auniquetoken

13

Page 14: University of Münster (27–28 June 2019) Kenneth Benoit

toks<-tokens(c("Acorpusisasetofdocuments.","Thisistheseconddocumentinthecorpus"),remove_punct=TRUE)toks

##tokensfrom2documents.##text1:##[1]"A""corpus""is""a""set""of"##[7]"documents"####text2:##[1]"This""is""the""second""document""in"##[7]"the""corpus"

tokens_wordstem(toks)

##tokensfrom2documents.##text1:##[1]"A""corpus""is""a""set""of"##[7]"document"####text2:##[1]"This""is""the""second""document""in"##[7]"the""corpus"

14

Page 15: University of Münster (27–28 June 2019) Kenneth Benoit

tokens_wordstem(toks)%>%tokens_remove(stopwords("english"))

##tokensfrom2documents.##text1:##[1]"corpus""set""document"####text2:##[1]"second""document""corpus"

tokens_wordstem(toks)%>%tokens_remove(stopwords("english"))%>%types()

##[1]"corpus""set""document""second"

15

Page 16: University of Münster (27–28 June 2019) Kenneth Benoit

Typicalwork�owsteps1.Lowercase

"acorpusisasetofdocuments."

"thisistheseconddocumentinthecorpus."

2.Removestopwordsandpunctuation

"acorpusisasetofdocuments."

"thisistheseconddocumentinthecorpus."

3.Stem

"corpussetdocuments"

"seconddocumentcorpus"

4.Tokenize

[corpus,set,document]

[second,document,corpus]

5.Createadocument-featurematrix

a"bag-of-words"conversionofdocumentsintoamatrixthatcountsthefeatures(types)bydocument

16

Page 17: University of Münster (27–28 June 2019) Kenneth Benoit

Document-featurematrixDocument1:"Acorpusisasetofdocuments."Document2:"Thisistheseconddocumentinthecorpus."

corpus set document second

Document1 1 1 1 0

Document2 1 0 1 1

...

Documentn 0 1 1 0

17

Page 18: University of Münster (27–28 June 2019) Kenneth Benoit

Gettingstarted

Page 19: University of Münster (27–28 June 2019) Kenneth Benoit

InstallingquantedaInstallthequantedapackagefromCRAN:

install.packages("quanteda")

Youshouldalsoinstallthereadtextpackage:

install.packages("readtext")

Afterwards,loadbothpackages:

library("quanteda")library("readtext")packageVersion("quanteda")

##[1]'1.4.5'

packageVersion("readtext")

##[1]'0.75'

19

Page 20: University of Münster (27–28 June 2019) Kenneth Benoit

ReproduceexampleinquantedaCreateatextcorpus

library(quanteda)#Createtextcorpuscorp<-corpus(c("Acorpusisasetofdocuments.","Thisistheseconddocumentinthecorpus."))

Exercise:Createthiscorpusandgetasummaryusingsummary(corp).

summary(corp)

##Corpusconsistingof2documents:####TextTypesTokensSentences##text1881##text2891####Source:/Users/kbenoit/Dropbox(Personal)/GitHub/quanteda/workshops/modules/*onx86_64bykbenoit##Created:TueJun2520:54:202019##Notes:

20

Page 21: University of Münster (27–28 June 2019) Kenneth Benoit

ReproduceexampleinquantedaCreateadocument-featurematrix

dfm(corp,remove_punct=TRUE,remove=stopwords("en"),stem=TRUE)

##Document-featurematrixof:2documents,4features(25.0%sparse).##2x4sparseMatrixofclass"dfm"##features##docscorpussetdocumentsecond##text11110##text21011

Question:Howdoesthedfmchangewhenwechangethepreprocessingsteps?

21

Page 22: University of Münster (27–28 June 2019) Kenneth Benoit

ThequantedaInfrastructure

Page 23: University of Münster (27–28 June 2019) Kenneth Benoit

quanteda:QuantitativeAnalysisofTextualData6yearsofdevelopment,28releases

8,300commits,270,000downloads

23

Page 24: University of Münster (27–28 June 2019) Kenneth Benoit

Designofthepackageconsistentgrammar

flexibleforpowerusers,simpleforbeginners

analytictransparencyandreproducibility

compabilitywithotherpackages

pipelinedworkflowusingmagrittr's%>%

24

Page 25: University of Münster (27–28 June 2019) Kenneth Benoit

QuantedaInitiativeUKnon-profitorganizationdevotedtothepromotionofopen-sourcetextanalysissoftware

User,technicalanddevelopmentsupport

Teachingandworkshops

https://quanteda.org

25

Page 26: University of Münster (27–28 June 2019) Kenneth Benoit

AdditionalpackagesForsomeexercises,wewillneedquanteda.corpora:

install.packages("devtools")devtools::install_github("quanteda/quanteda.corpora")

ForPOStagging,entityrecognition,anddependencyparsing,youshouldinstallspacyr(notcoveredextensively).

install.packages("spacyr")

Installationinstructions:http://spacyr.quanteda.io

26

Page 27: University of Münster (27–28 June 2019) Kenneth Benoit

CourseresourcesDocumentation:

https://quanteda.io

https://readtext.quanteda.io

https://spacyr.quanteda.org

Tutorials:https://tutorials.quanteda.io

Cheatsheet:https://www.rstudio.com/resources/cheatsheets/

KennethBenoit,KoheiWatanabe,HaiyanWang,PaulNulty,AdamObeng,StefanMüller,andAkitakaMatsuo.2018."quanteda:AnRPackagefortheQuantitativeAnalysisofTextualData."JournalofOpenSourceSoftware3(30):774.

27

Page 28: University of Münster (27–28 June 2019) Kenneth Benoit

Recap:Work�ow1.Corpus

Savescharacterstringsandvariablesinadataframe

Combinestextswithdocument-levelvariables

2.Tokens

Storestokensinalistofvectors

Positional(string-of-words)analysisisperformedusingandtextstat_collocations(),tokens_ngrams()andtokens_select()orfcm()withwindowoption

3.Document-featurematrix(DFM)

Representsfrequenciesoffeaturesindocumentsinamatrix

Themostefficientstructure,butitdoesnothaveinformationonpositionsofwords

Non-positional(bag-of-words)analysisareperformedusingmanyofthetextstat_*andtextmodel_*functions

28

Page 29: University of Münster (27–28 June 2019) Kenneth Benoit

MainfunctionclassesTextcorpus:corpus()

Tokenization:tokens()

Document-featurematrix:dfm()

Featureco-occurencematrix:fcm()

Textstatistics:textstat_()

Textmodels:textmodel_()

Plots:textplot_()

29

Page 30: University of Münster (27–28 June 2019) Kenneth Benoit

NamingconventionsandusefulshortcutsRecommendednamingconventionsforobjects

Inthequantedastyleguide,werecommendtonameobjectsconsistently,e.g.:

Corpus:corp_...

Tokensobject:toks_...

Dfm:dfmat_...

Textmodel:tmod_...

UsefulRStudiokeyboardshortcuts

Insertpipeoperator(%>%):Shift+Cmd/Cntrl+M

Insertassignmentoperator(<-):Alt+-

MoreshortcutsforRStudiocanbefoundhere

30

Page 31: University of Münster (27–28 June 2019) Kenneth Benoit

Clari�cationThefollowingexpressionsresultinthesameoutput

data_corpus_inaugural%>%tokens()

tokens(data_corpus_inaugural)

31

Page 32: University of Münster (27–28 June 2019) Kenneth Benoit

32

Page 33: University of Münster (27–28 June 2019) Kenneth Benoit

Workingwithcorpusobjects

Page 34: University of Münster (27–28 June 2019) Kenneth Benoit

Corpusfunctionsinquantedacorpus()

corpus_subset()

corpus_reshape()

corpus_segment()

corpus_sample()

Corporainquanteda

data_corpus_inaugural

data_corpus_irishbudget2010

Additionalcorporainthequanteda.corporapackage

34

Page 35: University of Münster (27–28 June 2019) Kenneth Benoit

Usingmagrittr'spipedata_corpus_inaugural%>%corpus_subset(President=="Obama")%>%ndoc()

##[1]2

data_corpus_inaugural%>%corpus_subset(President=="Obama")%>%corpus_reshape(to="sentences")%>%ndoc()

##[1]198

35

Page 36: University of Münster (27–28 June 2019) Kenneth Benoit

Accessnumberoftypesandtokensofcorpusntype(data_corpus_inaugural)%>%head()

##1789-Washington1793-Washington1797-Adams1801-Jefferson##62596826717##1805-Jefferson1809-Madison##804535

ntoken(data_corpus_inaugural)%>%head()

##1789-Washington1793-Washington1797-Adams1801-Jefferson##153814725781927##1805-Jefferson1809-Madison##23811263

36

Page 37: University of Münster (27–28 June 2019) Kenneth Benoit

Overviewofdocument-levelvariableshead(docvars(data_corpus_inaugural))

##YearPresidentFirstName##1789-Washington1789WashingtonGeorge##1793-Washington1793WashingtonGeorge##1797-Adams1797AdamsJohn##1801-Jefferson1801JeffersonThomas##1805-Jefferson1805JeffersonThomas##1809-Madison1809MadisonJames

37

Page 38: University of Münster (27–28 June 2019) Kenneth Benoit

Exercise1.Basedondata_corpus_inaugural,createanobjectdata_corpus_postwar(speechessince1945).

2.Whatspeechhasmosttokens?Whatspeechhasmosttypes?

Note:Youcanfindthedocumentationandexamplesusing?followedbythenameofthefunction.

38

Page 39: University of Münster (27–28 June 2019) Kenneth Benoit

Solutiondata_corpus_postwar<-data_corpus_inaugural%>%corpus_subset(Year>1945)

#numberoftokensperspeechdata_corpus_postwar%>%ntoken()%>%sort(decreasing=TRUE)%>%head()

##1985-Reagan1981-Reagan1953-Eisenhower2009-Obama##2921279027572711##1989-Bush1949-Truman##26812513

#numberoftypesperspeechdata_corpus_postwar%>%ntype()%>%sort(decreasing=TRUE)%>%head()

##2009-Obama1985-Reagan1981-Reagan1953-Eisenhower##938925902900##2013-Obama1989-Bush##814795

39

Page 40: University of Münster (27–28 June 2019) Kenneth Benoit

Addingdocument-levelvariables#newdocvar:PresidentFulldocvars(data_corpus_inaugural,"Order")<-1:ndoc(data_corpus_inaugural)

head(docvars(data_corpus_inaugural,"Order"))

##[1]123456

#newdocvar:PresidentFulldocvars(data_corpus_inaugural,"PresidentFull")<-paste(docvars(data_corpus_inaugural,"FirstName"),docvars(data_corpus_inaugural,"President"),sep="")

head(docvars(data_corpus_inaugural))

##YearPresidentFirstNameOrderPresidentFull##1789-Washington1789WashingtonGeorge1GeorgeWashington##1793-Washington1793WashingtonGeorge2GeorgeWashington##1797-Adams1797AdamsJohn3JohnAdams##1801-Jefferson1801JeffersonThomas4ThomasJefferson##1805-Jefferson1805JeffersonThomas5ThomasJefferson##1809-Madison1809MadisonJames6JamesMadison

40

Page 41: University of Münster (27–28 June 2019) Kenneth Benoit

Exercise1.Usedata_corpus_inauguralandreshapetheentirecorpustothelevelofsentencesandstorethenewcorpus.Whatisthenumberofdocumentsofthereshapedcorpus?

2.Addadocument-levelvariabletothereshapedcorpusthatcountsthetokenspersentence.

3.Keeponlysentencesthatarelongerthan10words.

4.Reshapethecorpusbacktothelevelofdocumentsandstorethecorpusasdata_corpus_inaugural_subset.

5.Optional:findamoreefficientsolution.

41

Page 42: University of Münster (27–28 June 2019) Kenneth Benoit

Solutioncorp_sentences<-corpus_reshape(data_corpus_inaugural,to="sentences")

docvars(corp_sentences,"number_tokens")<-ntoken(corp_sentences,remove_punct=TRUE)

ndoc(corp_sentences)

##[1]5016

corp_sentence_subset<-corp_sentences%>%corpus_subset(number_tokens>10)ndoc(corp_sentence_subset)

##[1]4267

42

Page 43: University of Münster (27–28 June 2019) Kenneth Benoit

Solution(cont.)data_corpus_inaugural_subset<-corp_sentence_subset%>%corpus_reshape(to="documents")

sum(ntoken(data_corpus_inaugural))

##[1]149145

sum(ntoken(data_corpus_inaugural_subset))

##[1]142617

43

Page 44: University of Münster (27–28 June 2019) Kenneth Benoit

Keywords-in-context

Page 45: University of Münster (27–28 June 2019) Kenneth Benoit

Keywords-in-context(KWIC)Keywords-in-context(kwic())returnsalistofoneormorekeywordsanditsimmediatecontext.

kw_security<-kwic(data_corpus_inaugural,pattern="security",window=3)

head(kw_security,3)

####[1789-Washington,1497]governmentforthe|security|##[1813-Madison,321]seasandthe|security|##[1817-Monroe,1610]mayformsome|security|####oftheirunion##ofanimportant##againstthesedangers

#Checknumberofoccurrencesnrow(kw_security)

##[1]65

45

Page 46: University of Münster (27–28 June 2019) Kenneth Benoit

KWICwithmultiwordexpressionsUsephrase()tolookupmultiwordexpressions

kwic(data_corpus_inaugural,pattern=phrase("UnitedStates"),window=3)%>%head(3)

####[1789-Washington,434:435]peopleofthe|UnitedStates|##[1789-Washington,530:531]thoseofthe|UnitedStates|##[1797-Adams,525:526]Constitutionofthe|UnitedStates|####aGovernmentinstituted##.Everystep##inaforeign

46

Page 47: University of Münster (27–28 June 2019) Kenneth Benoit

Exercise1.Inwhatcontextwere"God"and"Godbless"usedinUSpresidentialinauguralspeechesafter1970?

2.Lookupkeywords-in-contextforgod,godbless(wrappingthelatterinphrase())inonekwic()call.

3.BONUS:Sendtheresultsofthekwicoutputtotextplot_xray().

47

Page 48: University of Münster (27–28 June 2019) Kenneth Benoit

Solutionkw_god<-corpus_subset(data_corpus_inaugural,Year>1970)%>%kwic("god",window=3)

kw_god_bless<-corpus_subset(data_corpus_inaugural,Year>1970)%>%kwic(phrase("godbless"),window=3)

tail(kw_god_bless)

####[2005-Bush,2304:2305]freedom.May|Godbless|you,and##[2009-Obama,2699:2700]Thankyou.|Godbless|you.And##[2009-Obama,2704:2705]you.And|Godbless|theUnitedStates##[2013-Obama,2303:2304]Thankyou.|Godbless|you,and##[2017-Trump,1652:1653]Thankyou,|Godbless|you,and##[2017-Trump,1657:1658]you,and|Godbless|America.

48

Page 49: University of Münster (27–28 June 2019) Kenneth Benoit

Solution(cont.)textplot_xray(kw_god,kw_god_bless)

49

Page 50: University of Münster (27–28 June 2019) Kenneth Benoit

Settingupaprojectandimportingtext�les

Page 51: University of Münster (27–28 June 2019) Kenneth Benoit

SettingupaprojectinRStudio1.File->NewProject

2.Choseexistingfolderorcreatenewfolder

3.Createfoldersintheproject,e.g.:code,data,output

4.Openthe.Rprojfile:noneedtospecifyworkingdirectory(setwd())!

51

Page 52: University of Münster (27–28 June 2019) Kenneth Benoit

Importmultipletext�les#loadthereadtextpackagelibrary(readtext)

#data_diristhelocationofsamplefilesonyourcomputer.data_dir<-system.file("extdata/",package="readtext")

eu_data<-readtext(paste0(data_dir,"/txt/EU_manifestos/*.txt"),docvarsfrom="filenames",docvarnames=c("unit","context","year","language","party"),dvsep="_",encoding="ISO-8859-1")

52

Page 53: University of Münster (27–28 June 2019) Kenneth Benoit

Structureofreadtextdataframehead(eu_data)

##readtextobjectconsistingof6documentsand5docvars.###Description:df[,7][6×7]##doc_idtextunitcontextyearlanguageparty##*<chr><chr><chr><chr><int><chr><chr>##1EU_euro_2004_de_PSE…"\"PES·PSE\".…EUeuro2004dePSE##2EU_euro_2004_de_V.t…"\"Gemeinsame\".…EUeuro2004deV##3EU_euro_2004_en_PSE…"\"PES·PSE\".…EUeuro2004enPSE##4EU_euro_2004_en_V.t…"\"Manifesto\n\"…EUeuro2004enV##5EU_euro_2004_es_PSE…"\"PES·PSE\".…EUeuro2004esPSE##6EU_euro_2004_es_V.t…"\"Manifesto\n\"…EUeuro2004esV

53

Page 54: University of Münster (27–28 June 2019) Kenneth Benoit

Checkencodingfile<-system.file("extdata/txt/EU_manifestos/EU_euro_2004_de_V.txt",package="readtext")

myreadtext<-readtext(file)encoding(myreadtext)

##readtextobjectconsistingof1documentand0docvars.###Description:df[,2][1×2]##doc_idtext##<chr><chr>##1EU_euro_2004_de_V.txt"\"Gemeinsame\"..."

##Probableencoding:ISO-8859-1##(butnote:detectoroftenreportsISO-8859-1whenencodingisactuallyUTF-8.)

54

Page 55: University of Münster (27–28 June 2019) Kenneth Benoit

Exercise1.SetupaRProjinanewfolder.

2.DownloadZIPfilewithinauguralspeechesbyGermanchancellors(https://tinyurl.com/corp-regierung)

3.CopythefolderintoyourRProjfolder.

4.Importalltextfilesusingreadtext().(Hint:"name_of_folder/*"readsinallfiles)

5.Createatextcorpusofthisdataframe.

55

Page 56: University of Münster (27–28 June 2019) Kenneth Benoit

Solutionlibrary(readtext)data_inaugural_ger<-readtext("data/inaugural_germany/*",encoding="UTF-8",docvarsfrom="filenames",dvsep="-",docvarnames=c("Year","Chancellor","Party"))

data_corpus_inaugural_ger<-corpus(data_inaugural_ger)

summary(data_corpus_inaugural_ger,n=4)

##Corpusconsistingof21documents,showing4documents:####TextTypesTokensSentencesYearChancellorParty##1949-Adenauer-CDU.txt213674142921949AdenauerCDU##1953-Adenauer-CDU.txt263097084031953AdenauerCDU##1957-Adenauer-CDU.txt220276082961957AdenauerCDU##1961-Adenauer-CDU.txt248487964021961AdenauerCDU####Source:/Users/kbenoit/Dropbox(Personal)/GitHub/quanteda/workshops/modules/*onx86_64bykbenoit##Created:TueJun2520:54:232019##Notes:

56

Page 57: University of Münster (27–28 June 2019) Kenneth Benoit

Exercise1.Createbetterdocnamesbypasting"Year"and"Chancellor"

2.SelectspeechesbyGerhardSchröderandAngelaMerkel.

3.Check?textplot_xray().

4.Choosewordsthatmightonlyappearinsomeofthespeeches.

57

Page 58: University of Münster (27–28 June 2019) Kenneth Benoit

Solutiondocnames(data_corpus_inaugural_ger)<-paste(docvars(data_corpus_inaugural_ger,"Year"),docvars(data_corpus_inaugural_ger,"Chancellor"),sep=",")

##alternative

docnames(data_corpus_inaugural_ger)<-paste(data_inaugural_ger$Year,data_inaugural_ger$Chancellor,sep="")

corp_schroeder_merkel<-data_corpus_inaugural_ger%>%corpus_subset(Chancellor%in%c("Merkel","Schroeder"))

plot_xray<-textplot_xray(kwic(corp_schroeder_merkel,"Arbeits*"),kwic(corp_schroeder_merkel,"Arbeitslos*"),kwic(corp_schroeder_merkel,"Migration*"))

58

Page 59: University of Münster (27–28 June 2019) Kenneth Benoit

Solution(cont.)plot_xray

59

Page 60: University of Münster (27–28 June 2019) Kenneth Benoit

Tokenization

Page 61: University of Münster (27–28 June 2019) Kenneth Benoit

Tokenizationofacorpuscorp_immig<-corpus(data_char_ukimmig2010)toks_immig<-tokens(corp_immig)

head(toks_immig[[1]],20)

##[1]"IMMIGRATION"":""AN""UNPARALLELED"##[5]"CRISIS""WHICH""ONLY""THE"##[9]"BNP""CAN""SOLVE""."##[13]"-""At""current""immigration"##[17]"and""birth""rates"","

61

Page 62: University of Münster (27–28 June 2019) Kenneth Benoit

Removepunctuationtoks_nopunct<-tokens(data_char_ukimmig2010,remove_punct=TRUE)

head(toks_nopunct[[1]],20)

##[1]"IMMIGRATION""AN""UNPARALLELED""CRISIS"##[5]"WHICH""ONLY""THE""BNP"##[9]"CAN""SOLVE""At""current"##[13]"immigration""and""birth""rates"##[17]"indigenous""British""people""are"

62

Page 63: University of Münster (27–28 June 2019) Kenneth Benoit

Removestopwordstoks_nostopw<-tokens_select(toks_immig,stopwords("en"),selection="remove")#Note:equivalenttotoks_nostopw<-tokens_remove(toks_immig,stopwords("en"))

head(toks_nostopw[[1]],20)

##[1]"IMMIGRATION"":""UNPARALLELED""CRISIS"##[5]"BNP""CAN""SOLVE""."##[9]"-""current""immigration""birth"##[13]"rates"",""indigenous""British"##[17]"people""set""become""minority"

63

Page 64: University of Münster (27–28 June 2019) Kenneth Benoit

CustomizestopwordlistStopwordsforotherlanguges:checkthestopwordspackage

Removefeaturefromstopwordlist

"will"%in%stopwords("en")

##[1]TRUE

my_stopwords_en<-stopwords("en")[!stopwords("en")%in%c("will")]

"will"%in%my_stopwords_en

##[1]FALSE

64

Page 65: University of Münster (27–28 June 2019) Kenneth Benoit

head(toks_immig[[1]],20)

##[1]"IMMIGRATION"":""AN""UNPARALLELED"##[5]"CRISIS""WHICH""ONLY""THE"##[9]"BNP""CAN""SOLVE""."##[13]"-""At""current""immigration"##[17]"and""birth""rates"","

head(toks_nopunct[[1]],20)

##[1]"IMMIGRATION""AN""UNPARALLELED""CRISIS"##[5]"WHICH""ONLY""THE""BNP"##[9]"CAN""SOLVE""At""current"##[13]"immigration""and""birth""rates"##[17]"indigenous""British""people""are"

head(toks_nostopw[[1]],20)

##[1]"IMMIGRATION"":""UNPARALLELED""CRISIS"##[5]"BNP""CAN""SOLVE""."##[9]"-""current""immigration""birth"##[13]"rates"",""indigenous""British"##[17]"people""set""become""minority"

65

Page 66: University of Münster (27–28 June 2019) Kenneth Benoit

Selectcertaintermstoks_immig_select<-toks_immig%>%tokens_select(pattern=c("immig*","migra*"),padding=TRUE)head(toks_immig_select[[1]],20)

##[1]"IMMIGRATION"""""""""##[6]""""""""""##[11]""""""""""##[16]"immigration"""""""""

toks_immig_no_pad<-toks_immig%>%tokens_select(pattern=c("immig*","migra*"),padding=FALSE)head(toks_immig_no_pad[[1]],20)

##[1]"IMMIGRATION""immigration""immigration""immigrants""immigration"##[6]"immigration""immigration""immigrants""immigrant""immigrant"##[11]"immigrants""immigrants""immigration""Immigration""immigration"##[16]"immigrants""Immigration""Immigration""Immigration""immigrants"

66

Page 67: University of Münster (27–28 June 2019) Kenneth Benoit

Selectcertaintermsandtheircontext#specifynumberofsurroundingwordsusingwindowtoks_immig_window<-tokens_select(toks_immig,pattern=c('immig*','migra*'),padding=TRUE,window=5)head(toks_immig_window[[1]],20)

##[1]"IMMIGRATION"":""AN""UNPARALLELED"##[5]"CRISIS""WHICH"""""##[9]"""""SOLVE""."##[13]"-""At""current""immigration"##[17]"and""birth""rates"","

67

IfPol
Notiz
Möglichkeit die Worte um die Suchbegriffe auszusuchen um beispielsweise die Stimmung um die Wörter "Migration" etc. zu analysieren ähnlich wie kwic
Page 68: University of Münster (27–28 June 2019) Kenneth Benoit

Compoundmultiwordsexpressionskw_multi<-kwic(data_char_ukimmig2010,phrase(c('asylumseeker*','britishcitizen*')))head(kw_multi,5)

####[BNP,1724:1725]thehonourandbenefitof|Britishcitizenship|##[BNP,1958:1959]allillegalimmigrantsandbogus|asylumseekers|##[BNP,2159:2160]regionconcerned.An'|asylumseeker|##[BNP,2192:2193]country.Becauseevery'|asylumseeker|##[BNP,2218:2219]therearecurrentlynolegal|asylumseekers|####hasgonetopeoplewho##,includingtheirdependents.##'whohascrosseddozens##'inBritainhascrossed##inBritaintoday.It

68

IfPol
Notiz
wenn er die Begriffe findet, fügt er sie zu einem token zusammen wie United Kingdom wie Europäische Union
Page 69: University of Münster (27–28 June 2019) Kenneth Benoit

CompoundmultiwordsexpressionsPreservetheseexpressionsinbag-of-wordanalysis:

toks_comp<-tokens(data_char_ukimmig2010)%>%tokens_compound(phrase(c('asylumseeker*','britishcitizen*')))

kw_comp<-kwic(toks_comp,c('asylum_seeker*','british_citizen*'))head(kw_comp,4)

####[BNP,1724]thehonourandbenefitof|British_citizenship|##[BNP,1957]allillegalimmigrantsandbogus|asylum_seekers|##[BNP,2157]regionconcerned.An'|asylum_seeker|##[BNP,2189]country.Becauseevery'|asylum_seeker|####hasgonetopeoplewho##,includingtheirdependents.##'whohascrosseddozens##'inBritainhascrossed

69

IfPol
Notiz
wenn er die Begriffe findet, fügt er sie zu einem token zusammen wie United Kingdom wie Europäische Union
Page 70: University of Münster (27–28 June 2019) Kenneth Benoit

Exercise1.Tokenizedata_corpus_irishbudget2010andcompoundthefollowingpartynames:fiannafáil,finegael,andsinnféin.

2.Selectonlythethreepartynamesandthewindowof+-10words

3.Howcanweextractonlythefullsentencesinwhichatleastoneofthepartiesismentioned?

70

Page 71: University of Münster (27–28 June 2019) Kenneth Benoit

Solution#1.Tokenize`data_corpus_irishbudget2010`andcompoundpartynamestoks_ire<-data_corpus_irishbudget2010%>%tokens()%>%tokens_compound(phrase(c("fiannafáil","finegael","sinnféin")))

nrow(kwic(toks_ire,"fianna_fáil"))

##[1]66

nrow(kwic(toks_ire,phrase("fiannafáil")))

##[1]0

71

Page 72: University of Münster (27–28 June 2019) Kenneth Benoit

Solution(cont.)#2.Selectonlythethreepartynamesandthewindowof+-10wordstoks_ire_select<-toks_ire%>%tokens_keep(c("fianna_fáil","fine_gael","sinn_féin"),window=10)

head(toks_ire_select[[2]],15)

##[1]"happening""today"".""It""is"##[6]"happening"",""however"",""because"##[11]"Fianna_Fáil""failed""to""heed""the"

Note:Togetthesentencesthatmentiononeoftheparties,callcorpus_reshape(to="sentences")first,repeatthesubsequentstepsandsetaverylargewindow(e.g.,1000).

#3.Extractonlyfullsentencesinwhich#atleastonepartyismentioned

toks_ire_sentences<-data_corpus_irishbudget2010%>%corpus_reshape(to="sentences")%>%#<--importantstep!tokens()%>%tokens_compound(phrase(c("fiannafáil","finegael","sinnféin")))%>%tokens_keep(c("fianna_fáil","fine_gael","sinn_féin"),window=1000)

72

IfPol
Notiz
corpus wird auf satzebene gespalten eine liste, die nur die sätze auflistet
IfPol
Notiz
"trick" wenn das fenster größer als der Satz ist, gibt er einem den ganzen Satz raus so kann man alle Sätze auswählen, die die Ausdrücke verwenden
Page 73: University of Münster (27–28 June 2019) Kenneth Benoit

N-grams#Unigramstokens("insurgentskilledinongoingfighting")

##tokensfrom1document.##text1:##[1]"insurgents""killed""in""ongoing""fighting"

#Bigramstokens("insurgentskilledinongoingfighting")%>%tokens_ngrams(n=2)

##tokensfrom1document.##text1:##[1]"insurgents_killed""killed_in""in_ongoing"##[4]"ongoing_fighting"

73

Page 74: University of Münster (27–28 June 2019) Kenneth Benoit

Skipgrams#Skipgramstokens("insurgentskilledinongoingfighting")%>%tokens_skipgrams(n=2,skip=0:1)

##tokensfrom1document.##text1:##[1]"insurgents_killed""insurgents_in""killed_in"##[4]"killed_ongoing""in_ongoing""in_fighting"##[7]"ongoing_fighting"

74

IfPol
Notiz
Verwendung in deep learning und pattern zu finden x
IfPol
Notiz
im ersten Lauf lässt er kein Wort aus und dann ein Wort
Page 75: University of Münster (27–28 June 2019) Kenneth Benoit

Lookuptokensfromadictionarytoks<-tokens(data_char_ukimmig2010)

dict<-dictionary(list(refugee=c('refugee*','asylum*'),worker=c('worker*','employee*')))print(dict)

##Dictionaryobjectwith2keyentries.##-[refugee]:##-refugee*,asylum*##-[worker]:##-worker*,employee*

dict_toks<-tokens_lookup(toks,dictionary=dict)head(dict_toks,2)

##tokensfrom2documents.##BNP:##[1]"refugee""worker""refugee""refugee""refugee""refugee""refugee"##[8]"refugee""refugee""refugee""refugee""refugee""refugee""worker"####Coalition:##[1]"refugee"

75

IfPol
Notiz
key: values (val1, val2, ...) categories with word which idicate the categorie positive: good, great, super, .... negative: bad, awful, ....
IfPol
Notiz
stopwords can be tought of as a dictionary approach
IfPol
Notiz
ein dictionary erstellen eine liste mit name
IfPol
Notiz
glob als pattern match valuetype müsste sonst noch weoter spezifizerit werden glob is default
Page 76: University of Münster (27–28 June 2019) Kenneth Benoit

Thetransitionfromtokens()todfm()dfm(dict_toks)

##Document-featurematrixof:9documents,2features(38.9%sparse).##9x2sparseMatrixofclass"dfm"##features##docsrefugeeworker##BNP122##Coalition10##Conservative00##Greens41##Labour44##LibDem60##PC30##SNP10##UKIP60

76

Page 77: University of Münster (27–28 June 2019) Kenneth Benoit

Summaryoftokensfunctionstokens()

tokens_tolower()/tokens_toupper()

tokens_wordstem()

tokens_compound()

tokens_lookup()

tokens_ngrams()

tokens_skipgrams()

tokens_select()/tokens_remove()/tokens_keep()

tokens_replace()

tokens_sample()

tokens_subset()

Recalltouse?toreadthemanualandexamplesforeachfunction.

77

IfPol
Notiz
mit einem Dictionary zum Beispiel englische schreibweisen mit amerikanischen ersetzen oder ähnliches niknames von menschen angleichen
Page 78: University of Münster (27–28 June 2019) Kenneth Benoit

Exercise1.Tokenizedata_corpus_irishbudget2010

2.Convertthetokensobject,removepunctuation,changetolowercase,removestopwords,andstem

3.Getthenumberoftypesandtokensperspeech

78

Page 79: University of Münster (27–28 June 2019) Kenneth Benoit

Solutiontoks_ire<-data_corpus_irishbudget2010%>%tokens(remove_punct=TRUE)%>%tokens_remove(stopwords("en"))%>%tokens_tolower()%>%tokens_wordstem()

ire_ntype<-ntype(toks_ire)ire_ntoken<-ntoken(toks_ire)

79

Page 80: University of Münster (27–28 June 2019) Kenneth Benoit

Document-featurematrix

Page 81: University of Münster (27–28 June 2019) Kenneth Benoit

Exercise1.Createadocument-featurematrixfromthetokensobjectabove.

2.Getthe50mostfrequenttermsusingtopfeatures().

81

Page 82: University of Münster (27–28 June 2019) Kenneth Benoit

Solutiontoks_ire<-tokens(data_corpus_irishbudget2010)

dfmat_ire<-dfm(toks_ire)#dfm()transformstolowercasebydefault

topfeatures(dfmat_ire)

##the.to,ofandinaisthat##35982371163315481537135912311013868804

##alternativeapproachwithouttokens()--lesscontrol!dfmat_ire2<-data_corpus_irishbudget2010%>%dfm(remove_punct=TRUE,remove=stopwords("en"),stem=TRUE)

topfeatures(dfmat_ire2)

##peoplbudgetgovernministyeartaxpubliceconomicut##273272271204201195179172172##job##148

82

Page 83: University of Münster (27–28 June 2019) Kenneth Benoit

Selectfeaturesbasedonfrequenciesnfeat(dfmat_ire)

##[1]5140

dfmat_ire_trim1<-dfm_trim(dfmat_ire,min_termfreq=5)nfeat(dfmat_ire_trim1)

##[1]1265

dfmat_ire_trim2<-dfm_trim(dfmat_ire,min_docfreq=5)nfeat(dfmat_ire_trim2)

##[1]864

dfmat_ire_trim3<-dfm_trim(dfmat_ire,max_docfreq=0.1,docfreq_type="prop")nfeat(dfmat_ire_trim3)

##[1]2716

83

IfPol
Notiz
Seltene Wörter entfernen, da diese die Matrix deutlich vergrößern ohne viel Informationen hinzuzufügen sehr rarely used words sind "noise" da es falsch geschrieben wurde oder unwichtig sind außer man sieht sie als wichtiges Signal Beispiel: irische wörter in den Reden
IfPol
Notiz
Wörter entfernt, die weniger als 5 mal verwendet wurden
Page 84: University of Münster (27–28 June 2019) Kenneth Benoit

Exercise1.Createadfmfromthetwodocumentsbelow,andweightitusingdfm_weight().

2.Fordfm_weight()tryouttheschemearguments"count","boolean"and"prop".

3.Whatareadvantagesandproblemsofeachweightingscheme?

texts<-c("appleisbetterthanbanana","bananabananaapplemuchbetter")

84

IfPol
Notiz
normalizin document
IfPol
Notiz
optimized for sparse matrix
IfPol
Notiz
default dfm ist schon worthäufigkeiten
IfPol
Notiz
boolean recode all non-zero counts as 1
Page 85: University of Münster (27–28 June 2019) Kenneth Benoit

Solutiondfmat_fruits<-dfm(texts)dfm_weight(dfmat_fruits,scheme="count")

##Document-featurematrixof:2documents,6features(25.0%sparse).##2x6sparseMatrixofclass"dfm"##features##docsappleisbetterthanbananamuch##text1111110##text2101021

dfm_weight(dfmat_fruits,scheme="boolean")#

##Document-featurematrixof:2documents,6features(25.0%sparse).##2x6sparseMatrixofclass"dfm"##features##docsappleisbetterthanbananamuch##text1111110##text2101011

85

Page 86: University of Münster (27–28 June 2019) Kenneth Benoit

Solution(cont.)dfm_weight(dfmat_fruits,scheme="prop")

##Document-featurematrixof:2documents,6features(25.0%sparse).##2x6sparseMatrixofclass"dfm"##features##docsappleisbetterthanbananamuch##text10.20.20.20.20.20##text20.200.200.40.2

86

Page 87: University of Münster (27–28 June 2019) Kenneth Benoit

Sentimentanalysis

Page 88: University of Münster (27–28 June 2019) Kenneth Benoit

WhyareweanalyzingIrishbudgetspeeches?

88

Page 89: University of Münster (27–28 June 2019) Kenneth Benoit

SentimentanalysisSentimentanalysisusingtheLexicoderSentimentDictionary(data_dictionary_LSD2015)

summary(data_dictionary_LSD2015)

##LengthClassMode##negative2858-none-character##positive1709-none-character##neg_positive1721-none-character##neg_negative2860-none-character

docvars(data_corpus_irishbudget2010,"gov_opp")<-ifelse(docvars(data_corpus_irishbudget2010,"party")%in%c("FF","Green"),"Government","Opposition")

#tokenizeandapplydictionarytoks_dict<-data_corpus_irishbudget2010%>%tokens()%>%tokens_lookup(dictionary=data_dictionary_LSD2015)

#transformtoadfmdfmat_dict<-dfm(toks_dict)

89

IfPol
Notiz
Voting of MPs is in line with party affiliation but they speak differently so interessting to analyse speeches
Page 90: University of Münster (27–28 June 2019) Kenneth Benoit

SentimentAnalysisprint(dfmat_dict)

##Document-featurematrixof:14documents,4features(12.5%sparse).##14x4sparseMatrixofclass"dfm"##features##docsnegativepositiveneg_positiveneg_negative##Lenihan,Brian(FF)18839721##Bruton,Richard(FG)16314752##Burton,Joan(LAB)22526683##Morgan,Arthur(SF)26024952##Cowen,Brian(FF)15036812##Kenny,Enda(FG)10414633##ODonnell,Kieran(FG)498440##Gilmore,Eamon(LAB)16417630##Higgins,Michael(LAB)374210##Quinn,Ruairi(LAB)344011##Gormley,John(Green)175600##Ryan,Eamon(Green)247812##Cuffe,Ciaran(Green)385600##OCaolain,Caoimhghin(SF)15414511

90

Page 91: University of Münster (27–28 June 2019) Kenneth Benoit

Converttoadataframeusingconvert()dict_output<-convert(dfmat_dict,to="data.frame")

91

Page 92: University of Münster (27–28 June 2019) Kenneth Benoit

Estimatesentimentdict_output$sent_score<-log((dict_output$positive+dict_output$neg_negative+0.5)/(dict_output$negative+dict_output$neg_positive+0.5))

dict_output<-cbind(dict_output,docvars(data_corpus_irishbudget2010))

92

Page 93: University of Münster (27–28 June 2019) Kenneth Benoit

Plottingwithggplot2Specifydata(adata.frame)

Specifytheaestetics(x-axis,y-axis,colours,shapesetc.)

Chooseageometricobjects(e.g.scatterplot,boxplot)

#discovermtcarsdatasethead(mtcars,3)

##mpgcyldisphpdratwtqsecvsamgearcarb##MazdaRX421.061601103.902.62016.460144##MazdaRX4Wag21.061601103.902.87517.020144##Datsun71022.84108933.852.32018.611141

#specifydataandaxeslibrary(ggplot2)myplot<-ggplot(data=mtcars,aes(x=mpg,y=hp))

93

Page 94: University of Münster (27–28 June 2019) Kenneth Benoit

myplot

94

Page 95: University of Münster (27–28 June 2019) Kenneth Benoit

myplot+geom_point()

95

Page 96: University of Münster (27–28 June 2019) Kenneth Benoit

myplot+geom_point(aes(colour=factor(cyl)))+scale_colour_discrete(name="Cylinders")+scale_shape_discrete(name="Cylinders")+labs(x="Milespergallon",y="Horsepower")

96

Page 97: University of Münster (27–28 June 2019) Kenneth Benoit

Plotsentimenttplot_sent<-ggplot(dict_output,aes(x=reorder(name,sent_score),y=sent_score,colour=gov_opp))+geom_point(size=3)+coord_flip()+labs(x="Speaker",y="Estimatedsentiment")

97

IfPol
Notiz
Hier trägt das Problem, dass ir als begatives Wort gelistet ist daher government mitglieder negativer als sie eigentlich sind
Page 98: University of Münster (27–28 June 2019) Kenneth Benoit

tplot_sent

98

Page 99: University of Münster (27–28 June 2019) Kenneth Benoit

Textualstatistics

Page 100: University of Münster (27–28 June 2019) Kenneth Benoit

Simplefrequencyanalysislibrary("ggplot2")ire_dfm<-data_corpus_irishbudget2010%>%dfm(remove=stopwords("en"),remove_punct=TRUE,remove_symbols=TRUE)

ire_freqplot<-ire_dfm%>%textstat_frequency(n=15)%>%ggplot(aes(x=reorder(feature,frequency),y=frequency))+geom_point()+coord_flip()+labs(x=NULL,y="Frequency")

100

Page 101: University of Münster (27–28 June 2019) Kenneth Benoit

ire_freqplot

101

Page 102: University of Münster (27–28 June 2019) Kenneth Benoit

Frequencyanalysisforgroupsire_dfm_group<-ire_dfm%>%dfm_group(groups="party")

102

Page 103: University of Münster (27–28 June 2019) Kenneth Benoit

#plotawordcloud(notrecommended...)set.seed(34)textplot_wordcloud(ire_dfm_group,comparison=TRUE,max_words=40)

103

Page 104: University of Münster (27–28 June 2019) Kenneth Benoit

Relativefrequencyanalysis(keyness)"Keyness"assignsscorestofeaturesthatoccurdifferentiallyacrossdifferentcategories

docvars(data_corpus_irishbudget2010,"gov_opp")<-ifelse(docvars(data_corpus_irishbudget2010,"party")%in%c("FF","Green"),"Government","Opposition")

#comparegovernmenttooppositionpartiesbychi^2dfmat_key<-data_corpus_irishbudget2010%>%dfm(groups="gov_opp",remove=stopwords("english"),remove_punct=TRUE,remove_symbols=TRUE)

tstat_key<-textstat_keyness(dfmat_key,target="Opposition")

tplot_key<-textplot_keyness(tstat_key,margin=0.2,n=10)

104

Page 105: University of Münster (27–28 June 2019) Kenneth Benoit

tplot_key

105

Page 106: University of Münster (27–28 June 2019) Kenneth Benoit

Termfrequency-inversedocumentfrequency(tf-idf )Example:Wehave politicalpartymanifestos,eachwith1000words.Thefirstdocumentcontains16instancesoftheword"environment"; ofthemanifestoscontaintheword"environment".

Thetermfrequencyis

Thedocumentfrequencyis ,or

Thetf-idfwillthenbe

Ifthewordhadonlyappearedonlyin ofthe manifestos,thenthetf-idfwouldbe

tf-idffiltersoutcommonterms

10040

16

100/40 = 2.5 log(2.5) = 0.398

16 × 0.398 = 6.37

15 10013.18

106

IfPol
Notiz
in PolSci nicht wirklich nutzbar da viel verwendete Terms rausgenommen werden bzw. ein niedriges gewicht bekommen
Page 107: University of Münster (27–28 June 2019) Kenneth Benoit

Applytf-idfweightingdfmat_ire_tfidf<-data_corpus_irishbudget2010%>%dfm(remove_punct=TRUE,remove_numbers=TRUE,remove_symbols=TRUE)%>%dfm_group(groups="party")%>%dfm_tfidf()

107

Page 108: University of Münster (27–28 June 2019) Kenneth Benoit

Plottf-idftstat_freq_tfidf<-dfmat_ire_tfidf%>%textstat_frequency(n=10,groups="party",force=TRUE)

#plotfrequenciestplot_tfidf<-ggplot(data=tstat_freq_tfidf,aes(x=factor(nrow(tstat_freq_tfidf):1),y=frequency))+geom_point()+facet_wrap(~group,scales="free_y")+coord_flip()+scale_x_discrete(breaks=factor(nrow(tstat_freq_tfidf):1),labels=tstat_freq_tfidf$feature)+labs(x=NULL,y="tf-idf")

108

Page 109: University of Münster (27–28 June 2019) Kenneth Benoit

tplot_tfidf

109

Page 110: University of Münster (27–28 June 2019) Kenneth Benoit

Collocationanalysis:astringsofwordsapproachtstat_col<-data_corpus_irishbudget2010%>%tokens()%>%textstat_collocations(min_count=20,tolower=TRUE)

head(tstat_col,7)

##collocationcountcount_nestedlengthlambdaz##1itis188023.68333836.89575##2willbe142024.13230135.59424##3thisbudget107024.29506231.81840##4wehave119023.38272129.50924##5morethan56026.29716728.82909##6socialwelfare70028.08114328.82286##7inthe347021.85505227.87367

110

IfPol
Notiz
the higher the value the lower the chance this occured by chance
IfPol
Notiz
sollte das eine multi-word expression sein?
Page 111: University of Münster (27–28 June 2019) Kenneth Benoit

Exercise1.Repeatthestepabove,butremovestopwords,andstemthetokensobject.

2.Comparethemostfrequentcollocations.Whathaschanged?

3.Optional:ConductacollocationanalysisfortheGermaninauguralspeeches.Doesitworkwell?

111

Page 112: University of Münster (27–28 June 2019) Kenneth Benoit

Solutiontoks_ire_adjusted<-data_corpus_irishbudget2010%>%tokens()%>%tokens_remove(pattern=stopwords("en"))%>%tokens_wordstem()

tstat_col_adjusted<-toks_ire_adjusted%>%textstat_collocations(min_count=5,tolower=FALSE)

head(tstat_col_adjusted,7)

##collocationcountcount_nestedlengthlambdaz##1publicservic68025.46713226.76130##2socialwelfar65027.16804625.84665##3childbenefit36026.87543121.50678##4perweek25025.82873119.40465##5nextyear34025.20099418.86080##6publicsector30024.08902817.70144##7LabourParti23027.48637517.12913

nrow(tstat_col_adjusted)

##[1]224

112

Page 113: University of Münster (27–28 June 2019) Kenneth Benoit

CompoundcollocationsBeforecompounding

toks_ire<-toks_ire_adjustedkwic(toks_ire,phrase("GreenParty"))%>%head(4)

##[1]pattern##<0rows>(or0-lengthrow.names)

113

Page 114: University of Münster (27–28 June 2019) Kenneth Benoit

CompoundcollocationsCompoundbasedoncollocations

#get50mostfrequentcollocations

#compoundbasedoncollocationstoks_ire_comp<-toks_ire%>%tokens_compound(tstat_col_adjusted[tstat_col_adjusted$z>3])

Aftercompounding

nrow(kwic(toks_ire_comp,phrase("FiannaFáil")))

##[1]0

nrow(kwic(toks_ire_comp,phrase("Fianna_Fáil*")))

##[1]70

114

Page 115: University of Münster (27–28 June 2019) Kenneth Benoit

CompoundcollocationsCheckcompoundedwords

toks_ire_comp%>%tokens_keep(pattern="*_*")%>%dfm()%>%topfeatures()

##fianna_fáilpublic_servicsocial_welfarnext_yearchild_benefit##6147413830##minist_financper_weekpublic_sectoryoung_peopl€_4_billion##3025252320

115

Page 116: University of Münster (27–28 June 2019) Kenneth Benoit

SimilaritybetweentextsCosinesimilarity

(dfmat<-dfm(c("XYYZZZ","XXXXYYYYYZZZZZZ")))

##Document-featurematrixof:2documents,3features(0.0%sparse).##2x3sparseMatrixofclass"dfm"##features##docsxyz##text1123##text2456

textstat_simil(dfmat,method="cosine")

##textstat_similobject;method="cosine"##text1text2##text11.0000.975

cos(A,B) =A ⋅ B

||A|| ⋅ ||B||

cos(A,B) =∑AiBi

√∑A2i√∑B2

i

116

IfPol
Notiz
similarity and distance between texts man muss auf die länge der texte nicht normalisieren weil sonst längere texte mehr "different" sind
Page 117: University of Münster (27–28 June 2019) Kenneth Benoit

Cosinesimilaritylibrary(quanteda)dfmat<-dfm(c("Iusethissentencealmosttwice.","Iincludethisalmostsentencetwice.","Andherewehaveirrelevantcontent."))

tstat_simil<-textstat_simil(dfmat,method="cosine")tstat_simil

##textstat_similobject;method="cosine"##text1text2text3##text11.0000.8570.143##text20.8571.0000.143##text30.1430.1431.000

Whyisthesimilaritybetweentext1/text2andtext3not0?

117

IfPol
Notiz
zum clustern von texten verwenden
IfPol
Notiz
es gibt keine wörter die der satz drei hat, die satz 1 oder 2 haben warum sind diese nicht maximal unterschiedlich? Satzzeichen Punkt Daher sind texte niemals maximal unterschiedlich: füllwörter, satzzeichen...
Page 118: University of Münster (27–28 June 2019) Kenneth Benoit

BONUS:Transformmatrixintodyadicdataframetstat_simil_matrix<-as.matrix(tstat_simil)tstat_simil_matrix[lower.tri(tstat_simil_matrix)]<-NA

tstat_simil_df<-tstat_simil_matrix%>%reshape2::melt()%>%subset(Var1!=Var2)%>%subset(!is.na(value))

tstat_simil_df

##Var1Var2value##4text1text20.8571429##7text1text30.1428571##8text2text30.1428571

118

IfPol
Notiz
es wird eine funktion geben, die das automatisch macht nicht länger notwendig
IfPol
Notiz
wenn man die daten in ggplot oder dpllyr machen möchte
Page 119: University of Münster (27–28 June 2019) Kenneth Benoit

Exercise1.Createadfmfromdata_corpus_irishbudget2010

2.Groupitbypartyandruntextstat_simil().

3.Whatpartiesaremostsimilar?

4.Dothesubstantiveconclusionchangewhenremovingstopwordsandpunctuation,andstemmingthedfm?

119

Page 120: University of Münster (27–28 June 2019) Kenneth Benoit

Solution#similaritiesfordocumentststat_simil1<-data_corpus_irishbudget2010%>%dfm()%>%dfm_group(groups="party")%>%textstat_simil(method="cosine",margin="documents")

tstat_simil1

##textstat_similobject;method="cosine"##FFFGGreenLABSF##FF1.0000.9540.9450.9630.968##FG0.9541.0000.9500.9830.979##Green0.9450.9501.0000.9510.938##LAB0.9630.9830.9511.0000.980##SF0.9680.9790.9380.9801.000

120

IfPol
Notiz
keine Messung von ideologischer Nähe, sondern Ähnlichkeit von Texten
Page 121: University of Münster (27–28 June 2019) Kenneth Benoit

Solutiontstat_simil2<-data_corpus_irishbudget2010%>%dfm(remove=stopwords("en"),stem=TRUE,remove_punct=TRUE)%>%dfm_group(groups="party")%>%textstat_simil(method="cosine",margin="documents")

tstat_simil2

##textstat_similobject;method="cosine"##FFFGGreenLABSF##FF1.0000.6210.6570.6660.695##FG0.6211.0000.6450.7950.791##Green0.6570.6451.0000.6550.651##LAB0.6660.7950.6551.0000.810##SF0.6950.7910.6510.8101.000

121

IfPol
Notiz
beide deutlichsten Oppositionsparteien
IfPol
Notiz
bessere Messung ohne Stopwörter und Punktuation
IfPol
Notiz
output sieht anders aus tstat_simil2 %>% as.matrix()
Page 122: University of Münster (27–28 June 2019) Kenneth Benoit

Lexicaldiversitydfmat_ire<-data_corpus_irishbudget2010%>%dfm(remove_punct=TRUE,remove_numbers=TRUE,remove_symbols=TRUE)

#estimateType-TokenRatiotstat_lexdiv<-textstat_lexdiv(dfmat_ire,measure="TTR")

df_lexdiv<-cbind(tstat_lexdiv,docvars(dfmat_ire))

head(df_lexdiv,4)

##documentTTRyeardebatenumber##Lenihan,Brian(FF)Lenihan,Brian(FF)0.21424872010BUDGET01##Bruton,Richard(FG)Bruton,Richard(FG)0.23643122010BUDGET02##Burton,Joan(LAB)Burton,Joan(LAB)0.25831872010BUDGET03##Morgan,Arthur(SF)Morgan,Arthur(SF)0.22652362010BUDGET04##forennamepartygov_opp##Lenihan,Brian(FF)BrianLenihanFFGovernment##Bruton,Richard(FG)RichardBrutonFGOpposition##Burton,Joan(LAB)JoanBurtonLABOpposition##Morgan,Arthur(SF)ArthurMorganSFOpposition

122

IfPol
Notiz
Documentation mit den verschiedenen measures TTR: Type-Token-Ratio C: R: ... in der documentation mit zitation sehr umfangreich
IfPol
Notiz
statistics
IfPol
Notiz
measure = c("all") gibt dann alle meausres wieder
Page 123: University of Münster (27–28 June 2019) Kenneth Benoit

PlotType-TokenRatiotplot_ttr<-ggplot(df_lexdiv,aes(x=reorder(document,TTR),y=TTR))+geom_point()+coord_flip()+labs(x=NULL)

123

Page 124: University of Münster (27–28 June 2019) Kenneth Benoit

tplot_ttr

124

IfPol
Notiz
kürzere texte haben eine höhere diversity als längere texte bei längeren texten wir man sich öfter wiederholen
Page 125: University of Münster (27–28 June 2019) Kenneth Benoit

Readabilitytstat_read<-data_corpus_inaugural%>%textstat_readability(measure=c("Flesch.Kincaid","Dale.Chall.old"))

tplot_read<-ggplot(tstat_read,aes(x=Flesch.Kincaid,y=Dale.Chall.old))+geom_smooth()+geom_point()

125

IfPol
Notiz
also meant for spoken word
Page 126: University of Münster (27–28 June 2019) Kenneth Benoit

Plotrelationshipbetweenreadabilitymeasurestplot_read

126

IfPol
Notiz
0-25 25-50 50-65 over 65 the higher the easyer
Page 127: University of Münster (27–28 June 2019) Kenneth Benoit

Correlationbetweenreadabilitymeasurescor.test(tstat_read$Flesch.Kincaid,tstat_read$Dale.Chall.old)

####Pearson'sproduct-momentcorrelation####data:tstat_read$Flesch.Kincaidandtstat_read$Dale.Chall.old##t=19.714,df=56,p-value<2.2e-16##alternativehypothesis:truecorrelationisnotequalto0##95percentconfidenceinterval:##0.89202470.9611137##sampleestimates:##cor##0.9349112

127

Page 128: University of Münster (27–28 June 2019) Kenneth Benoit

Readabilityofinauguralspeechessince1945tstat_read_subset<-data_corpus_inaugural%>%corpus_subset(Year>1948)%>%textstat_readability(measure="Flesch.Kincaid")

tplot_read_us<-ggplot(tstat_read_subset,aes(x=reorder(document,Flesch.Kincaid),y=Flesch.Kincaid))+geom_point()+coord_flip()+labs(x=NULL,y="Readability(Flesch-Kincaid)")

128

Page 129: University of Münster (27–28 June 2019) Kenneth Benoit

tplot_read_us

SeeextensivelyBenoitetal.(2019)

129

Page 130: University of Münster (27–28 June 2019) Kenneth Benoit

Exercise:Wrappingitallup1.Usedata_corpus_guardianfromthequanteda.corporapackageusing:data_corpus_guardian<-quanteda.corpora::download('data_corpus_guardian').

2.Createatokensobjectandremovestopwordsandpunctuation

3.Keeponlytheterm"Brexit*"andawindowof5words.

4.Usefcm()andcreateafeatureco-occurencematrix.

5.Usetextplot_network()toplotanetworkofco-occurencesofthe40mostfrequentterms.

130

Page 131: University of Münster (27–28 June 2019) Kenneth Benoit

SolutionImportGuardiancorpususingquanteda.corpora'sdownload()function.

data_corpus_guardian<-download('data_corpus_guardian')

toks_brexit<-data_corpus_guardian%>%tokens(remove_punct=TRUE,remove_symbols=TRUE)%>%tokens_remove(stopwords("en"),padding=FALSE)%>%tokens_keep(pattern="Brexit*",window=5,case_insensitive=TRUE)%>%tokens_tolower()

fcmat_brexit<-toks_brexit%>%fcm(context="window")

feat<-names(topfeatures(fcmat_brexit,40))

fcmat_brexit_subset<-fcmat_brexit%>%fcm_select(feat)

tplot_brexit<-textplot_network(fcmat_brexit_subset)

##RegisteredS3methodoverwrittenby'network':##methodfrom##summary.characterquanteda 131

IfPol
Notiz
feature coocurence matrix
Page 132: University of Münster (27–28 June 2019) Kenneth Benoit

set.seed(134)tplot_brexit

132

IfPol
Notiz
random element deswegen set.seed um gleiches ergebnis zu erhalten plotting erhält random element es ist eine wolke, je nachdem wie man darauf schaut, sieht es anders aus
Page 133: University of Münster (27–28 June 2019) Kenneth Benoit

Supervisedtextclassi�cation

Page 134: University of Münster (27–28 June 2019) Kenneth Benoit

Supervisedclassi�cation:separatedocumentsintopre-existingcategoriesWeneed:

1.Hand-codeddataset(labeled),tobesplitintoatrainingset(usedtotraintheclassifier)andavalidation/testset(usedtovalidatetheclassifier)

2.Methodtoextrapolatefromhandcodingtounlabeleddocuments(classifier):NaiveBayes,regularizedregression,SVM,K-nearestneighbours,ensemblemethods

3.Approachtovalidateclassifier:cross-validation

4.Performancemetrictochoosebestclassifierandavoidoverfitting:confusionmatrix,accuracy,precision,recall,F1scores

134

IfPol
Notiz
part of quanteda
Page 135: University of Münster (27–28 June 2019) Kenneth Benoit

Goalsofsupervisedclassi�cation1.Flexibility

2.Efficiency

3.Interpretability

135

IfPol
Notiz
prediction zu welcher kategorie ein text gehört
Page 136: University of Münster (27–28 June 2019) Kenneth Benoit

Unsupervisedclassi�cationUnsupervisedmethodsscaledocumentsbasedonpatternsofsimilarityfromtheterm-documentmatrix,withoutrequiringatrainingstep

Examples:Wordfish,topicmodels(tobecoveredsoon)

Relativeadvantage:Youdonothavetoknowthedimensionbeingscaled(alsoadisadvantage!)

136

IfPol
Notiz
unsupervised um dinge zu entdecken supervised um antworten zu bekommen
Page 137: University of Münster (27–28 June 2019) Kenneth Benoit

Basicprinciplesofsupervisedlearning1.Generalisation:Aclassifierlearnstocorrectlypredictoutputfromgiveninputsnotonlyinpreviouslyseensamplesbutalsoinpreviouslyunseensamples

2.Overfitting:Aclassifierlearnstocorrectlypredictoutputfromgiveninputsinpreviouslyseensamplesbutfailstodosoinpreviouslyunseensamples.Thiscausespoorprediction/generalisation.

137

Page 138: University of Münster (27–28 June 2019) Kenneth Benoit

Howtoassesstheperformanceofaclassi�er?

138

Page 139: University of Münster (27–28 June 2019) Kenneth Benoit

Importantmeasures

Accuracy:

Precision:

Recall:

F1score:

=Correctly classified

Total number of cases

true positives + true negatives

Total number of cases

true positives

true positives + false positives

true positives

true positives + false negatives

2 × Precision × RecallPrecision + Recall

139

Page 140: University of Münster (27–28 June 2019) Kenneth Benoit

Howdowegetalabelledtrainingset?1.Externalsourcesofannotation

2.Expertannotation

3.Crowd-sourcedcoding

Wisdomofcrowds--aggregatedjudgmentsbynon-expertsconvergetojudgmentsofexperts,dependingondifficultyofclassificationtaks(Benoitetal.2016)

Platforms:Figure8(previouslycalledCrowdFlower)orAmazon'sMTurk

140

Page 141: University of Münster (27–28 June 2019) Kenneth Benoit

NaiveBayes:ProbabilisticlearningIntuition:Ifweobservetheterm"fantastic""inatext,howlikelyisthistextapositivereview?

1.Determinefrequencyofterminpositiveandnegativereviews(prior).

2.Assessproabilityoffeaturesgivenaparticularclass.

3.Getprobabiliyofadocumentbelongingtoeachclass(posterior).

4.Whichposteriorishighest?

141

Page 142: University of Münster (27–28 June 2019) Kenneth Benoit

NaiveBayesAdvantages

Simple,fast,effective

Relativelysmalltrainingsetrequired(ifclassesnotooimbalanced)

Easytoobtainprobabilities

Disadvantages

Assumptionofconditionalindependenceisproblematic

Iffeatureisnotintrainingset,itisdisregardedfortheclassification

142

IfPol
Notiz
die annahmen werden von sprache gebrochen, aber es funktioniert trotzdem daher der name naive
Page 143: University of Münster (27–28 June 2019) Kenneth Benoit

NaiveBayesinquanteda:Example1#1.gettrainingsetdfmat_train<-dfm(c("positivebadnegativehorrible","greatfantasticnice"))class<-c("neg","pos")

#2.trainmodeltmod_nb<-textmodel_nb(dfmat_train,class)

#3.getunlabelledtestsetdfmat_test<-dfm(c("badhorriblenegativeawful","nicebadgreat","greatfantastic"))

#4.predictclasspredict(tmod_nb,dfmat_test,force=TRUE)

##Warning:1featureinnewdatanotusedinprediction.

##text1text2text3##negpospos##Levels:negpos

143

Page 144: University of Münster (27–28 June 2019) Kenneth Benoit

NaiveBayesinquanteda:Example2Goal:predictwhetherUSinauguralspeechwasdeliveredbeforeorafter1945

#createuniqueIDdocvars(data_corpus_inaugural,"id")<-1:ndoc(data_corpus_inaugural)

#createdummy(before/after1945)docvars(data_corpus_inaugural,"pre_post1945")<-ifelse(docvars(data_corpus_inaugural,"Year")>1945,"Post1945","Pre1945")

#sample35speechesset.seed(123)speeches_select<-sample(1:58,size=35,replace=FALSE)

#createtrainingandtestsetsdfmat_train<-data_corpus_inaugural%>%corpus_subset(id%in%speeches_select)%>%#speechiniddfm()

dfmat_test<-data_corpus_inaugural%>%corpus_subset(!id%in%speeches_select)%>%#speechnotiniddfm()

144

Page 145: University of Münster (27–28 June 2019) Kenneth Benoit

NaiveBayesinquanteda:Example2#numberofdocumentsintrainingsetndoc(dfmat_train)

##[1]35

#trainNaiveBayesmodeltmod_nb<-textmodel_nb(dfmat_train,y=docvars(dfmat_train,"pre_post1945"))

#numberofdocumentsintestsetndoc(dfmat_test)

##[1]23

#predictclassofheld-outtestsettmod_nb_predict<-predict(tmod_nb,newdata=dfmat_test,force=TRUE)#setforcetoTRUEorusedfm_match

##Warning:1268featuresinnewdatanotusedinprediction.

145

Page 146: University of Münster (27–28 June 2019) Kenneth Benoit

NaiveBayesinquanteda:Example2#createcross-tabletab<-table(actual=docvars(dfmat_test,"pre_post1945"),predicted=tmod_nb_predict)

print(tab)

##predicted##actualPost1945Pre1945##Post194560##Pre1945314

146

Page 147: University of Münster (27–28 June 2019) Kenneth Benoit

NaiveBayesinquanteda:Example2Assessaccuracyusingthecaretpackage

library(caret)performance<-caret::confusionMatrix(tab)

#getmeasuresofclassificationperformanceperformance$byClass

##SensitivitySpecificityPosPredValue##0.66666671.00000001.0000000##NegPredValuePrecisionRecall##0.82352941.00000000.6666667##F1PrevalenceDetectionRate##0.80000000.39130430.2608696##DetectionPrevalenceBalancedAccuracy##0.26086960.8333333

147

Page 148: University of Münster (27–28 June 2019) Kenneth Benoit

Topicmodels

Page 149: University of Münster (27–28 June 2019) Kenneth Benoit

Requiredpackages:topicmodelsandstminstall.packages("stm")install.packages("topicmodels")

packageVersion("topicmodels")

##[1]'0.2.8'

packageVersion("stm")

##[1]'1.3.3'

149

IfPol
Notiz
quanteda zur Textverarbeitung um es dann an ein anderes paket zu "füttern"
Page 150: University of Münster (27–28 June 2019) Kenneth Benoit

Whataretopicmodels?Algorithmtofindmostimportant"themes/frames/topics"inacorpus

Furtherinformationortrainingdatanotnecessary(entirelyunsupervisedmethod)

Researcheronlyneedstospecifynumberoftopics(moredifficultthanitsounds!)

150

Page 151: University of Münster (27–28 June 2019) Kenneth Benoit

Examplesofastudyusingtopicmodels

Page 152: University of Münster (27–28 June 2019) Kenneth Benoit

Catalinac(2016)AmyCatalinac(2016).FromPorktoPolicy:TheRiseofProgrammaticCampaigninginJapaneseElections.TheJournalofPolitics78(1):1–18.

Japaneseelectoralsystem

Priorto1994:Single-nontransferable-voteinmultimemberdistricts(SNTV-MMD)

Since1994:mixed-membermajoritarian(MMM)electoralsystem

Expectatations:

Morepork-barrelpoliticsunderSNTV-MMD

UnderSNTV-MMD,candidatesfacinghigherlevelsofintrapartycompetitionreliedmoreonpork-barrelpolitics

152

Page 153: University of Münster (27–28 June 2019) Kenneth Benoit

Catalinac(2016):ResearchdesignAJapanesecandidatemanifesto

153

Page 154: University of Münster (27–28 June 2019) Kenneth Benoit

Catalinac(2016):Results

154

Page 155: University of Münster (27–28 June 2019) Kenneth Benoit

Catalinac(2016):Results

155

Page 156: University of Münster (27–28 June 2019) Kenneth Benoit

Catalinac(2016):Results

156

Page 157: University of Münster (27–28 June 2019) Kenneth Benoit

AssumptionsEachdocumentconsistsofamixtureoftopics(drawnfromLDAdistribution)

Eachtopiciscorrelatedmorestronglywithcertainterms

Mosttopicmodelsrelyonbag-of-words:orderofwordswithindocumentisirrelevant

157

Page 158: University of Münster (27–28 June 2019) Kenneth Benoit

Assumptions

158

Page 159: University of Münster (27–28 June 2019) Kenneth Benoit

ExamplefromBlei(2012)

159

Page 160: University of Münster (27–28 June 2019) Kenneth Benoit

LatentDirichletAllocationThemostcommonformoftopicmodelingisLatentDirichletAllocationorLDA.LDAworksasfollows:

1.SpecifcyK(numberoftopics).2Eachwordinthedfmisassignedrandomlytooneofthetopics(involvesaDirichletdistribution).Numbersassignedacrossthetopicsaddupto1.

2.Topicassignmentsforeachwordareupdatedinaniterativefashionbyupdatingtheprevalenceofthewordacrossthetopics,aswellastheprevalenceofthetopicsinthedocument.

3.Assignmentstopsafteruser-specifiedthreshold,orwheniterationsbegintohavelittelimpactontheprobabilitiesassignedtowordsinthecorpus.

4.Output:1.Identifywordsthataremostfrequentlyassociatedwitheachofthetopics.

2.Probabilitythateachdocumentwithinthecorpusisassociatedwitheachofthetopics.Probabilitiesaddupto1.

Source:TutorialbyChrisBail.

160

Page 161: University of Münster (27–28 June 2019) Kenneth Benoit

Example:CMU2008PoliticalBlogCorpusURL:http://www.sailing.cs.cmu.edu/main/?page_id=710

#loadpackageslibrary(quanteda)library(readtext)library(stm)

#urlofblogpostsurl_blogs<-"https://uclspp.github.io/datasets/data/poliblogs2008.zip"

#downloadblogpostsdat_blogs<-readtext(url_blogs)

#createacorpuscorp_blogs<-corpus(dat_blogs,text_field="documents")

#sample1,500documentsset.seed(134)corp_blogs_sample<-corpus_sample(corp_blogs,size=1500)

161

Page 162: University of Münster (27–28 June 2019) Kenneth Benoit

STMexample:createcorpussummary(corp_blogs_sample,6)

##Corpusconsistingof1500documents,showing6documents:####TextTypesTokensSentencestextdocname##poliblogs2008.csv.131321623221113132tpm0834400_0.text##poliblogs2008.csv.8362196324108362ha0831900_11.text##poliblogs2008.csv.8618246449148618ha0834700_3.text##poliblogs2008.csv.834419353083at0801200_4.text##poliblogs2008.csv.7206169266117206ha0821900_5.text##poliblogs2008.csv.111244167212711124tp0829400_9.text##ratingdayblog##Liberal344tpm##Conservative319ha##Conservative347ha##Conservative12at##Conservative219ha##Liberal294tp####Source:/Users/kbenoit/Dropbox(Personal)/GitHub/quanteda/workshops/modules/*onx86_64bykbenoit##Created:TueJun2520:54:562019##Notes:

162

Page 163: University of Münster (27–28 June 2019) Kenneth Benoit

STMexample:createdfmandremovefeaturesdfmat_blogs<-dfm(corp_blogs_sample,stem=TRUE,remove=stopwords("english"),remove_punct=TRUE,remove_numbers=TRUE)

dfmat_blogs_trim<-dfm_trim(dfmat_blogs,min_termfreq=10,min_docfreq=5)

163

IfPol
Notiz
bei topic models ist die Textverarbeitung super wichtig stopwords punctuation selten genutzte Worte (trimming)
Page 164: University of Münster (27–28 June 2019) Kenneth Benoit

STMexample:converttostm#convertdfmtostmformatstm_blogs<-convert(dfmat_blogs_trim,to="stm",docvars=docvars(dfmat_blogs))

164

IfPol
Notiz
convert function in quanteda format for topic models data.frame ..etc.
Page 165: University of Münster (27–28 June 2019) Kenneth Benoit

RunningtheSTMUsethestm()function.Kspecifiesthenumberoftopics

WithK=0thestmpackageselectsthenumberoftopicsautomatically(basedonalgorithmdevelopedbyLeeandMimno(2014))

Specifycovariateswithprevalence

165

Page 166: University of Münster (27–28 June 2019) Kenneth Benoit

RunningtheSTM#runstructuraltopicmodelstm_object<-stm(documents=stm_blogs$documents,vocab=stm_blogs$vocab,data=stm_blogs$meta,prevalence=~rating+s(day),K=20,seed=12345)

##BeginningSpectralInitialization##Calculatingthegrammatrix...##Findinganchorwords...##....................##Recoveringinitialization...##........................................##Initializationcomplete.##...................................................................................................##CompletedE-Step(0seconds).##CompletedM-Step.##CompletingIteration1(approx.perwordbound=-7.352)##...................................................................................................##CompletedE-Step(0seconds).##CompletedM-Step.##CompletingIteration2(approx.perwordbound=-7.136,relativechange=2.942e-02)##...................................................................................................##CompletedE-Step(0seconds).##CompletedM-Step.##CompletingIteration3(approx.perwordbound=-7.098,relativechange=5.204e-03)

166

Page 167: University of Münster (27–28 June 2019) Kenneth Benoit

plot(stm_object)

167

Page 168: University of Münster (27–28 June 2019) Kenneth Benoit

Estimatee�ects#pickandlabeltopicsofinteresttopics_of_interest<-c(3,6,16,18)topic_labels<-c("Obama","FinancialCrisis","Bush","Iraq")

168

Page 169: University of Münster (27–28 June 2019) Kenneth Benoit

IdentifytopicsofinterestlabelTopics(stm_object,c(3,6))

##Topic3TopWords:##HighestProb:media,say,think,peopl,go,know,get##FREX:media,matthew,rush,fox,beck,chris,guy##Lift:bachmann,limbaugh,beck,hardbal,matthew,luntz,o'reilli##Score:bachmann,beck,matthew,limbaugh,clinton,hillari,o'reilli##Topic6TopWords:##HighestProb:time,report,year,new,news,global,warm##FREX:warm,climat,ice,global,coverag,newspap,ap##Lift:sheet,sulzberg,temperatur,solar,warm,murdoch,pinch##Score:sheet,climat,ice,warm,temperatur,sulzberg,solar

169

Page 170: University of Münster (27–28 June 2019) Kenneth Benoit

IdentifytopicsofinterestlabelTopics(stm_object,c(16,18))

##Topic16TopWords:##HighestProb:bush,iraq,war,presid,administr,said,year##FREX:bush,saddam,war,hussein,iraq,georg,administr##Lift:gonzal,wmd,saddam,shoe,hussein,powel,colin##Score:bush,iraq,gonzal,saddam,powel,iraqi,rice##Topic18TopWords:##HighestProb:one,like,can,time,just,think,even##FREX:film,love,movi,dowd,rememb,allen,exit##Lift:pole,filmmak,film,movi,documentari,tuzla,ala##Score:pole,film,dowd,movi,allen,mom,hollywood

170

Page 171: University of Münster (27–28 June 2019) Kenneth Benoit

MeaningsofOutputHighestProbability:Wordswithhighestprobabiliyofoccuringintopic

FREX:"FrequencyandExclusive"--wordsthatdistinguishtopicfromallothertopics

Lift&Score:Indicatorsfromlda/maptpxpackages

171

Page 172: University of Münster (27–28 June 2019) Kenneth Benoit

Findrepresentativedocumentsoftopic#note:outputtoolongforslidefindThoughts(stm_object,texts(corp_blogs_sample),topics=topics_of_interest,n=3)

172

Page 173: University of Münster (27–28 June 2019) Kenneth Benoit

Estimatee�ectofcovariates#estimateeffectstm_effects<-estimateEffect(topics_of_interest~rating+s(day),stmobj=stm_object,meta=stm_blogs$meta,uncertainty="Global")

summary(stm_effects)

173

Page 174: University of Münster (27–28 June 2019) Kenneth Benoit

Plotin�uenceofdiscretevariableplot(stm_effects,covariate="rating",topics=topics_of_interest,model=stm_object,method="difference",cov.value1="Liberal",cov.value2="Conservative",xlim=c(-0.1,0.1),labeltype="custom",custom.labels=topic_labels,#payattentiontolabelling!main="EffectofConservativevs.Liberal",xlab="MoreConservative...MoreLiberal")

plot_effect_libcon<-recordPlot()plot.new()

174

Page 175: University of Münster (27–28 June 2019) Kenneth Benoit

plot_effect_libcon

175

Page 176: University of Münster (27–28 June 2019) Kenneth Benoit

Plote�ectofcontinuousvariableplot(stm_effects,covariate="day",method="continuous",topics=topics_of_interest,model=stm_object,labeltype="custom",custom.labels=topic_labels,xaxt="n",xlab="Time",main="TopicPrevalenceOverTime")#setthex-axislabelsasmonthnamesblog_dates<-seq(from=as.Date("2008-01-01"),to=as.Date("2008-12-01"),by="month")

axis(1,blog_dates-min(blog_dates),labels=months(blog_dates))

plot_effect_time<-recordPlot()plot.new()

176

Page 177: University of Münster (27–28 June 2019) Kenneth Benoit

plot_effect_time

177

Page 178: University of Münster (27–28 June 2019) Kenneth Benoit

Create(somewhat)better/tidyierplotstidystm:https://github.com/mikajoh/tidystm

stminsights:https://github.com/cschwem2er/stminsights

178

Page 179: University of Münster (27–28 June 2019) Kenneth Benoit

ReferencesandusefuldescriptionsMargaretRoberts(2017):StructuralTopicModels

ChrisBail(2018):TopicModelling

179

Page 180: University of Münster (27–28 June 2019) Kenneth Benoit

InstallpackagesWewillworkwiththedplyrandggplot2packages.

install.packages("dplyr")install.packages("ggplot2")

Havinginstalledthepackage,weneedtoloadthemintheRenvironment.

library(dplyr)library(ggplot2)

180

Page 181: University of Münster (27–28 June 2019) Kenneth Benoit

Gapminder:oursampledatasetWewillusethegapminderdatasetthroughoutthiscourse.

install.packages("gapminder")

library(gapminder)

181

Page 182: University of Münster (27–28 June 2019) Kenneth Benoit

GettingtoknowadatasetUnderstandthestructureandvariablecodingofadataset

str(gapminder)

##Classes'tbl_df','tbl'and'data.frame':1704obs.of6variables:##$country:Factorw/142levels"Afghanistan",..:1111111111...##$continent:Factorw/5levels"Africa","Americas",..:3333333333...##$year:int1952195719621967197219771982198719921997...##$lifeExp:num28.830.3323436.1...##$pop:int84253339240934102670831153796613079460148803721288181613867957163179212##$gdpPercap:num779821853836740...

182

Page 183: University of Münster (27–28 June 2019) Kenneth Benoit

Printthe�rstsixrowshead(gapminder)

###Atibble:6x6##countrycontinentyearlifeExppopgdpPercap##<fct><fct><int><dbl><int><dbl>##1AfghanistanAsia195228.88425333779.##2AfghanistanAsia195730.39240934821.##3AfghanistanAsia196232.010267083853.##4AfghanistanAsia196734.011537966836.##5AfghanistanAsia197236.113079460740.##6AfghanistanAsia197738.414880372786.

Wecanfindoutmoreinformationonthedatausing?gapminder.

183

Page 184: University of Münster (27–28 June 2019) Kenneth Benoit

DatawranglingwithdpylrImportantfunctions:

select:pickcolumnsbyname

filter:keeprowsmatchingcriteria

arrange:reorderrows

mutate:addnewvariables

group_by:grouprowsbycategory

summarise:reducevariablestovalues

Datavisualisationwithggplot2

184

Page 185: University of Münster (27–28 June 2019) Kenneth Benoit

Datawrangling:Sampledataframe(df<-data.frame(colour=c("green","black","black","green","black","red"),value=1:6))

##colourvalue##1green1##2black2##3black3##4green4##5black5##6red6

185

Page 186: University of Münster (27–28 June 2019) Kenneth Benoit

Filter/subsetadataframefilter(df,colour=="green")

##colourvalue##1green1##2green4

186

Page 187: University of Münster (27–28 June 2019) Kenneth Benoit

Applyingthepipe(%>%)operatordf%>%filter(colour!="red")%>%filter(value>=5)

##colourvalue##1black5

df%>%filter(colour!="red"&value>=5)

##colourvalue##1black5

187

Page 188: University of Münster (27–28 June 2019) Kenneth Benoit

Exercise1.Loadgapminderdataset(library(gapminder).

2.Type?gapminderandView(gapminder)toinspectthedata.

3.FilteronlyobservationsfromNorway(callnewobjectdat_nor)

4.FilteronlyobservationsfromEuropeorAfrica(gap_europe_africa)

188

Page 189: University of Münster (27–28 June 2019) Kenneth Benoit

Solution#loadgapminderdatalibrary(gapminder)

#variablenamesnames(gapminder)

##[1]"country""continent""year""lifeExp""pop""gdpPercap"

#numberofrowsandcolumnsdim(gapminder)

##[1]17046

189

Page 190: University of Münster (27–28 June 2019) Kenneth Benoit

Solutiondat_nor<-gapminder%>%filter(country=="Norway")

nrow(dat_nor)

##[1]12

gap_europe_africa<-gapminder%>%filter(continent%in%c("Europe","Africa"))

190

Page 191: University of Münster (27–28 June 2019) Kenneth Benoit

Datavisualisationwithggplot2Basicstructure:

1.Callggplot()

2.Specifydatawithdata

3.Specifycoordinatesystem/axeswithaes()

plot1<-ggplot(data=gapminder,aes(x=year,y=lifeExp))

191

Page 192: University of Münster (27–28 June 2019) Kenneth Benoit

plot1

192

Page 193: University of Münster (27–28 June 2019) Kenneth Benoit

plot1+geom_point()

193

Page 194: University of Münster (27–28 June 2019) Kenneth Benoit

Hands-onExercise1.Usegapminder,createdataframewithonlywithobservationsfromEuropeorAfricaintheyear2007

2.Createaboxplotwithcontinentonx-axisandlifeExpony-axis

194

Page 195: University of Münster (27–28 June 2019) Kenneth Benoit

Solutiondat_europe_africa_2007<-gapminder%>%filter(continent%in%c("Europe","Africa")&year==2007)

plot_boxplot<-ggplot(dat_europe_africa_2007,aes(x=continent,y=lifeExp))+geom_boxplot()

195

Page 196: University of Münster (27–28 June 2019) Kenneth Benoit

plot_boxplot

196

Page 197: University of Münster (27–28 June 2019) Kenneth Benoit

plot_boxplot+labs(x="Continent",y="LifeExpectancyin2007")

197

Page 198: University of Münster (27–28 June 2019) Kenneth Benoit

GeneraladviceSaveplotsasPDF(vectorgraphicandsmall):

Settheggplot2themeglobally

#exampleforsavingaplotggsave("boxplot.pdf",plot_boxplot,width=5,height=7)

#exampleforsettingathemeggplot2::theme_set(theme_minimal())

198

Page 199: University of Münster (27–28 June 2019) Kenneth Benoit

Densityplotsplot_density<-ggplot(dat_europe_africa_2007,aes(x=lifeExp,fill=continent))+geom_density(alpha=0.7)+labs(x="LifeExpectancyin2007")

199

Page 200: University of Münster (27–28 June 2019) Kenneth Benoit

plot_density

200

Page 201: University of Münster (27–28 June 2019) Kenneth Benoit

plot_density+scale_x_continuous(limits=c(0,100))+scale_fill_discrete(name="Continent")+theme(legend.position="bottom")

201

Page 202: University of Münster (27–28 June 2019) Kenneth Benoit

Boxplotswithgroupsgapminder_post_1980<-gapminder%>%filter(year>1980&continent!="Oceania")

plot_boxes<-ggplot(data=gapminder_post_1980,aes(x=as.factor(year),y=lifeExp))+geom_boxplot()+labs(x="Year",y="Lifeexpectancy")

202

Page 203: University of Münster (27–28 June 2019) Kenneth Benoit

plot_boxes

203

Page 204: University of Münster (27–28 June 2019) Kenneth Benoit

Addfacetsforafactorvariableplot_boxes+facet_wrap(~continent)

204

Page 205: University of Münster (27–28 June 2019) Kenneth Benoit

Arrange/reorderadatasetgap_arranged<-gapminder%>%arrange(lifeExp)#ascending

head(gap_arranged)

###Atibble:6x6##countrycontinentyearlifeExppopgdpPercap##<fct><fct><int><dbl><int><dbl>##1RwandaAfrica199223.67290203737.##2AfghanistanAsia195228.88425333779.##3GambiaAfrica195230284320485.##4AngolaAfrica195230.042320953521.##5SierraLeoneAfrica195230.32143249880.##6AfghanistanAsia195730.39240934821.

205

Page 206: University of Münster (27–28 June 2019) Kenneth Benoit

Arrangedatasetgap_arranged_desc<-gapminder%>%arrange(-lifeExp)#descending

head(gap_arranged_desc)

###Atibble:6x6##countrycontinentyearlifeExppopgdpPercap##<fct><fct><int><dbl><int><dbl>##1JapanAsia200782.612746797231656.##2HongKong,ChinaAsia200782.2698041239725.##3JapanAsia20028212706584128605.##4IcelandEurope200781.830193136181.##5SwitzerlandEurope200781.7755466137506.##6HongKong,ChinaAsia200281.5676247630209.

206

Page 207: University of Münster (27–28 June 2019) Kenneth Benoit

Selectvariablesgapminder_select<-gapminder%>%select(year,country,lifeExp)

head(gapminder_select,2)

###Atibble:2x3##yearcountrylifeExp##<int><fct><dbl>##11952Afghanistan28.8##21957Afghanistan30.3

207

Page 208: University of Münster (27–28 June 2019) Kenneth Benoit

Reordervariableswithindataframegapminder_reordered<-gapminder%>%select(pop,year,everything())

head(gapminder_reordered,2)

###Atibble:2x6##popyearcountrycontinentlifeExpgdpPercap##<int><int><fct><fct><dbl><dbl>##184253331952AfghanistanAsia28.8779.##292409341957AfghanistanAsia30.3821.

208

Page 209: University of Münster (27–28 June 2019) Kenneth Benoit

Exercise1.Selectcountry,continent,year,lifeExpandpop

2.Arrangegapminderascendingbyyearandwithinyeardescendingbypopulation.

209

Page 210: University of Münster (27–28 June 2019) Kenneth Benoit

Solutiongapminder_arranged<-gapminder%>%select(-gdpPercap)%>%arrange(year,-pop)

#checkView(gapminder_arranged)

210

Page 211: University of Münster (27–28 June 2019) Kenneth Benoit

Groupdatasetandsummarisewithingroupsdf

##colourvalue##1green1##2black2##3black3##4green4##5black5##6red6

#getsumofvaluevariabledf%>%summarise(sum=sum(value))

##sum##121

211

Page 212: University of Münster (27–28 June 2019) Kenneth Benoit

Addgroupingvariabledf%>%group_by(colour)%>%summarise(sum=sum(value))

###Atibble:3x2##coloursum##<fct><int>##1black10##2green5##3red6

212

Page 213: University of Münster (27–28 June 2019) Kenneth Benoit

Groupandsummarisedataframedf%>%group_by(colour)%>%summarise(total=n(),sum=sum(value))

###Atibble:3x3##colourtotalsum##<fct><int><int>##1black310##2green25##3red16

213

Page 214: University of Münster (27–28 June 2019) Kenneth Benoit

Exercise1.Usegapminder.

2.Groupbyyear.

3.CalculatetheaveragelifeexpectancyandthemaximumGDPperyear.

214

Page 215: University of Münster (27–28 June 2019) Kenneth Benoit

Solutiondata_gap_summarised<-gapminder%>%group_by(year)%>%summarise(life_exp_mean=mean(lifeExp),gdp_max=max(gdpPercap))

data_gap_summarised

###Atibble:12x3##yearlife_exp_meangdp_max##<int><dbl><dbl>##1195249.1108382.##2195751.5113523.##3196253.695458.##4196755.780895.##5197257.6109348.##6197759.659265.##7198261.533693.##8198763.231541.##9199264.234933.##10199765.041283.##11200265.744684.##12200767.049357.

215

Page 216: University of Münster (27–28 June 2019) Kenneth Benoit

Createnewvariableswithmutatenames(gapminder)

##[1]"country""continent""year""lifeExp""pop""gdpPercap"

gapminder<-gapminder%>%group_by(year)%>%mutate(life_exp_mean=mean(lifeExp),diff_country_mean=lifeExp-life_exp_mean)summary(gapminder$diff_country_mean)

##Min.1stQu.MedianMean3rdQu.Max.##-40.561-9.5831.2930.00010.24023.612

216

Page 217: University of Münster (27–28 June 2019) Kenneth Benoit

FurtherinformationWickham,HadleyandGarrettGrolemund(2017).RforDataScience:Import,Tidy,Transform,Visualize,andModelData.Sebastopol:O’Reilly.

Healy,Kieran(2019).DataVisualization:APracticalIntroduction.Princeton:PrincetonUniversityPress.

217

Page 218: University of Münster (27–28 June 2019) Kenneth Benoit

CourseresourcesDocumentation:

https://quanteda.io

https://readtext.quanteda.io

https://spacyr.quanteda.org

Tutorials:https://tutorials.quanteda.io

Cheatsheet:https://www.rstudio.com/resources/cheatsheets/

KennethBenoit,KoheiWatanabe,HaiyanWang,PaulNulty,AdamObeng,StefanMüller,andAkitakaMatsuo.2018."quanteda:AnRPackagefortheQuantitativeAnalysisofTextualData."JournalofOpenSourceSoftware3(30):774.

218