Practical Text Mining With Perl
Transcript of Practical Text Mining With Perl
Practical Text Mining With Perl
Roger Bilisoly Department of Mathematical Sciences Central Connecticut State University
WILEY
A JOHN WILEY & SONS, INC., PUBLICATION
Contents
List of Figures xiii
List of Tables xv
Preface xvii
Acknowledgments xxiii
1 Introduction 1
1.1 Overview of this Book 1
1.2 Text Mining and Related Fields 2
1.2.1 Chapter 2: Pattern Matching 2
1.2.2 Chapter 3: Data Structures 3
1.2.3 Chapter 4: Probability 3
1.2.4 Chapter 5: Information Retrieval 3
1.2.5 Chapter 6: Corpus Linguistics 4
1.2.6 Chapter 7: Multivariate Statistics 4
1.2.7 Chapter 8: Clustering 5
1.2.8 Chapter 9: Three Additional Topics 5
1.3 Advice for Reading this Book 5
vii
viii CONTENTS
Text Patterns 7
2.1 Introduction 7
2.2 Regular Expressions 8
2.2.1 First Regex: Finding the Word Cat 8
2.2.2 Character Ranges and Finding Telephone Numbers 10
2.2.3 Testing Regexes with Perl 12
2.3 Finding Words in a Text 15
2.3.1 Regex Summary 15
2.3.2 Nineteenth-Century Literature 17
2.3.3 Perl Variables and the Function s p l i t 17
2.3.4 Match Variables 20
2.4 Decomposing Poe's "The Tell-Tale Heart" into Words 21
2.4.1 Dashes and String Substitutions 23
2.4.2 Hyphens 24
2.4.3 Apostrophes 27
2.5 A Simple Concordance 28
2.5.1 Command Line Arguments 33
2.5.2 Writing to Files 33
2.6 First Attempt at Extracting Sentences 34
2.6.1 Sentence Segmentation Preliminaries 35
2.6.2 Sentence Segmentation for A Christmas Carol 37
2.6.3 Leftmost Greediness and Sentence Segmentation 41
2.7 Regex Odds and Ends 46
2.7.1 Match Variables and Backreferences 47
2.7.2 Regular Expression Operators and Their Output 48
2.7.3 Lookaround 50
2.8 References 52
Problems 52
Quantitative Text Su m maries 59
3.1 Introduction 59
3.2 Scalars, Interpolation, and Context in Perl 59
3.3 Arrays and Context in Perl 60
3.4 Word Lengths in Poe's "The Tell-Tale Heart" 64
3.5 Arrays and Functions 66
3.5.1 Adding and Removing Entries from Arrays 66
3.5.2 Selecting Subsets of an Array 69
3.5.3 Sorting an Array 69
3.6 Hashes 73
3.6.1 Using a Hash 74
3.7 Two Text Applications 77
CONTENTS JX
3.7.1 Zipf s Law for A Christmas Carol 77
3.7.2 Perl for Word Games 83
3.7.2.1 An Aid to Crossword Puzzles 83
3.7.2.2 Word Anagrams 84
3.7.2.3 Finding Words in a Set of Letters 85
3.8 Complex Data Structures 86
3.8.1 References and Pointers 87
3.8.2 Arrays of Arrays and Beyond 90
3.8.3 Application: Comparing the Words in Two Poe Stories 92
3.9 References 96
3.10 First Transition 97
Problems 97
Probability and Text Sampling 105
4.1 Introduction 105
4.2 Probability 105
4.2.1 Probability and Coin Flipping 106
4.2.2 Probabilities and Texts 108
4.2.2.1 Estimating Letter Probabilities for Poe and Dickens 109
4.2.2.2 Estimating Letter Bigram Probabilities 112
4.3 Conditional Probability 115
4.3.1 Independence 117
4.4 Mean and Variance of Random Variables 118
4.4.1 Sampling and Error Estimates 120
4.5 The Bag-of-Words Model for Poe's "The Black Cat" 123
4.6 The Effect of Sample Size 124
4.6.1 Tokens vs. Types in Poe's "Hans Pfaall" 124
4.7 References 128
Problems 129
Applying Information Retrieval to Text Mining 133
5.1 Introduction 133
5.2 Counting Letters and Words 134
5.2.1 Counting Letters in Poe with Perl 134
5.2.2 Counting Pronouns Occurring in Poe 136
5.3 Text Counts and Vectors 138
5.3.1 Vectors and Angles for Two Poe Stories 139
5.3.2 Computing Angles between Vectors 140
5.3.2.1 Subroutines in Perl 140
5.3.2.2 Computing the Angle between Vectors 143
5.4 The Term-Document Matrix Applied to Poe 143
X CONTENTS
5.5 Matrix Multiplication 147
5.5.1 Matrix Multiplication Applied to Poe 148
5.6 Functions of Counts 150
5.7 Document Similarity 152
5.7.1 Inverse Document Frequency 153
5.7.2 Poe Story Angles Revisited 154
5.8 References 157
Problems 157
Concordance Lines and Corpus Linguistics 161
6.1 Introduction 161
6.2 Sampling 162
6.2.1 Statistical Survey Sampling 162
6.2.2 Text Sampling 163
6.3 Corpus as Baseline 164
6.3.1 Function vs. Content Words in Dickens, London, and Shelley 168
6.4 Concordancing 169
6.4.1 Sorting Concordance Lines 170
6.4.1.1 Code for Sorting Concordance Lines 171
6.4.2 Application: Word Usage Differences between London and
Shelley 172
6.4.3 Application: Word Morphology of Adverbs 176
6.5 Collocations and Concordance Lines 179
6.5.1 More Ways to Sort Concordance Lines 179
6.5.2 Application: Phrasal Verbs in The Call of the Wild 181
6.5.3 Grouping Words: Colors in The Call of the Wild 184
6.6 Applications with References 185
6.7 Second Transition 187
Problems 188
Multivariate Techniques with Text 191
7.1 Introduction 191
7.2 Basic Statistics 192
7.2.1 z-Scores Applied to Poe 193
7.2.2 Word Correlations among Poe's Short Stories 195
7.2.3 Correlations and Cosines 199
7.2.4 Correlations and Covariances 201
7.3 Basic linear algebra 202
7.3.1 2 by 2 Correlation Matrices 202
7.4 Principal Components Analysis 205
7.4.1 Finding the Principal Components 206
CONTENTS Xi
7.4.2 PCA Applied to the 68 Poe Short Stories 206
7.4.3 Another PCA Example with Poe's Short Stories 209
7.4.4 Rotations 209
7.5 Text Applications 211
7.5.1 A Word on Factor Analysis 211
7.6 Applications and References 211
Problems 212
Text
8.1 8.2
8.3
8.4 8.5
Clustering
Introduction
Clustering
8.2.1 Two-Variable Example of &-Means
8.2.2 &-Means with R
8.2.3 He versus She in Poe's Short Stories
8.2.4 Poe Clusters Using Eight Pronouns
8.2.5 Clustering Poe Using Principal Components
8.2.6 Hierarchical Clustering of Poe's Short Stories
A Note on Classification
8.3.1 Decision Trees and Overfitting
References
Last Transition
Problems
A Sample of Additional Topics
9.1 9.2
9.3 9.4
9.5
Introduction
Perl Modules
9.2.1 Modules for Number Words
9.2.2 The StopWords Module
9.2.3 The Sentence Segmentation Module
9.2.4 An Object-Oriented Module for Tagging
9.2.5 Miscellaneous Modules
Other Languages: Analyzing Goethe in German
Permutation Tests
9.4.1 Runs and Hypothesis Testing
9.4.2 Distribution of Character Names in Dickens and London
References
ndix A: Overview of Perl for Text Mining
A.l
A.2
Basic Data Structures
A. 1.1 Special Variables and Arrays
Operators
219
219
220
220
223
224
229
230
234
235
235
236
236
236
243
243
243
244
245
245
247
248
248
251
252
254
258
259
259
262
263
XII CONTENTS
A.3 Branching and Looping 266
A.4 A Few Perl Functions 270 A.5 Introduction to Regular Expressions 271
Appendix B: Summary of R used in this Book 275 B.l Basics of R 275
B.l.l Data Entry 276 B.l.2 Basic Operators 277
B.1.3 Matrix Manipulation 278 B.2 This Book's R Code 279
References 283
Index 291