Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting...
Transcript of Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting...
![Page 1: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/1.jpg)
Text Mining
Department of Computer Science
University of Liverpool
February 27, 2019
comp 527: Text Mining February 27, 2019 1 / 49
![Page 2: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/2.jpg)
Overview
1 IntroductionSome examplesDefinition and ChallengesSteps in text mining
2 PreprocessingTokenisationStemmingStopword RemovalSentence Segmentation
3 Part-Of-Speech (POS) TaggingRule-based MethodsProbabilistic Models
4 References
comp 527: Text Mining February 27, 2019 2 / 49
![Page 3: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/3.jpg)
Simple Question: Why do dogs howl at the moon?
comp 527: Text Mining February 27, 2019 3 / 49
![Page 4: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/4.jpg)
Text Mining Around Us - Sentiment Analysis
source: https://www.jellyfish.co.uk/news-and-views/update-eu-referendum-campaigns-seem-to-be-causing-little-impact
comp 527: Text Mining February 27, 2019 4 / 49
![Page 5: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/5.jpg)
Text Mining Around Us - Opinion Mining
source: https://phys.org/news/2018-04-brexit-debate-twitter-driven-economic.html
comp 527: Text Mining February 27, 2019 5 / 49
![Page 6: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/6.jpg)
Text Mining Around Us - Movie Recommendation Systems
comp 527: Text Mining February 27, 2019 6 / 49
![Page 7: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/7.jpg)
Text Mining Around Us - Document Summarization
comp 527: Text Mining February 27, 2019 7 / 49
![Page 8: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/8.jpg)
Text Mining - Definition and Challenges
Text miningprocess of extracting interesting and non-trivial patterns or knowledgefrom unstructured text documents [Tan et al., 1999].a.k .a text data mining [Hearst, 1997],knowledge discovery from textual databases[Feldman and Dagan, 1995]text analytics - application to solve business problems
comp 527: Text Mining February 27, 2019 8 / 49
![Page 9: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/9.jpg)
Text Mining - Challenges
Unorganized form of datasemi-structured or unstructured
Deriving semantics from contentambiguities at di↵erent levels - lexical, syntactic, semantic andpragmaticText has multiple interpretationsTeacher Strikes Idle KidsViolinist linked to JAL crash blossomsWord sense ambiguityRed Tape Holds Up New Bridges
Non-standard Englishlanguage in TweetsSOO PROUD of what U accomp.
comp 527: Text Mining February 27, 2019 9 / 49
![Page 10: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/10.jpg)
Text Mining - Challenges
New Words850 new words added dictionary at Merriam-Webster.com in 2018CryptocurrencyChiweenie - a cross between a Chihuahua and a dachshundDumpster fire - a disastrous event
Idiomsdark horse; get cold feet
Combining information from multi-lingual texts
Integrate domain knowledge
comp 527: Text Mining February 27, 2019 10 / 49
![Page 11: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/11.jpg)
Steps in Text Mining
source: http://openminted.eu/text-mining-101/comp 527: Text Mining February 27, 2019 11 / 49
![Page 12: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/12.jpg)
Text Mining - Preprocessing Steps
Tokenisation
Stemming
Stopword Removal
Sentence Segmentation
comp 527: Text Mining February 27, 2019 12 / 49
![Page 13: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/13.jpg)
Tokenisation
Process of splitting text into words
What is a word?string of contiguous alphanumeric characters with space on either side;may include hyphens and apostrophes, but no other punctuationmarks [Kucera and Francis, 1967].
Useful clue - space or tab (English)
comp 527: Text Mining February 27, 2019 13 / 49
![Page 14: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/14.jpg)
Tokenisation - Problems
Periodsusually helps if we remove thembut useful to retain in certain cases such as $22.50; Ed.,
hypenationuseful to retain in some cases e.g., state-of-the-artbetter to remove in other cases e.g., gold-import ban, 50-year-old
Single apostrophesuseful to remove them e.g., is’nt, didn’t
space may not be a useful clue all the time
sometimes we want to use words separated by space as ‘single’ word
For example:San FranciscoUniversity of LiverpoolDanushka Bollegala
comp 527: Text Mining February 27, 2019 14 / 49
![Page 15: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/15.jpg)
Regular Expressions for Tokenisation
Regular Expressions Cheatsheet
comp 527: Text Mining February 27, 2019 15 / 49
![Page 16: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/16.jpg)
Regular Expressions for Tokenisation
comp 527: Text Mining February 27, 2019 16 / 49
![Page 17: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/17.jpg)
Stanford Parser for Tokenisation
comp 527: Text Mining February 27, 2019 17 / 49
![Page 18: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/18.jpg)
Tokenisation
Tokenisation turns out to be more di�cult than one expects
No single solution works well
Decide what counts as a token depending on the application domain
comp 527: Text Mining February 27, 2019 18 / 49
![Page 19: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/19.jpg)
spaCy (https://spacy.io/)
spaCy - a relatively new package for “Industrial strength NLP inPython”.
Developed by Matt Honnibal at Explosion AI
Designed with applied data scientist in mind
spaCy supports:TokenisationLemmatisationPart-of-speech taggingEntity recognitionDependency parsingSentence recognitionWord-to-vector transformations
comp 527: Text Mining February 27, 2019 19 / 49
![Page 20: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/20.jpg)
spaCy - Feature Comparison
source: https://spacy.io/usage/facts-figurescomp 527: Text Mining February 27, 2019 20 / 49
![Page 21: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/21.jpg)
spaCy - Benchmarks
source: https://spacy.io/usage/facts-figures
comp 527: Text Mining February 27, 2019 21 / 49
![Page 22: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/22.jpg)
spaCy - Detailed Speed Comparison
source: https://spacy.io/usage/facts-figures
comp 527: Text Mining February 27, 2019 22 / 49
![Page 23: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/23.jpg)
Tokenization in spaCy
Tokenizes text into words, puntuations and so on.
Applies rules specific to each language
Step 1: Split raw text based on whitespace characters (text.split(‘ ’))
Step 2: Processes each substring from left to right and performs twochecks:
Does the substring match a tokenizer exception rulee.g., “don’t” ==> no whitespace ==> but split into two tokens “do”and “nt“U.K.” ==> remain as one token
comp 527: Text Mining February 27, 2019 23 / 49
![Page 24: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/24.jpg)
Tokenization in spaCy
source: https://spacy.io/usage/spacy-101
comp 527: Text Mining February 27, 2019 24 / 49
![Page 25: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/25.jpg)
Tokenization in spaCy
source: https://spacy.io/usage/spacy-101
comp 527: Text Mining February 27, 2019 25 / 49
![Page 26: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/26.jpg)
Tokenization in spaCy
source: https://spacy.io/usage/spacy-101comp 527: Text Mining February 27, 2019 26 / 49
![Page 27: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/27.jpg)
Stemming
Removal of inflectional ending from words (strip o↵ any a�xes)connections, connecting, connect, connected ! connect
ProblemsCan conflate semantically di↵erent words
Gallery and gall may both be stemmed to gall
Lemmatization: a further step to ensure that the resulting form is aword present in a dictionary
comp 527: Text Mining February 27, 2019 27 / 49
![Page 28: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/28.jpg)
Regular Expressions for Stemming
note that the star operator is “greedy”
the .* part of expression tries to consume as much as the input aspossible
for non-greedy version of the star operator = *?
comp 527: Text Mining February 27, 2019 28 / 49
![Page 29: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/29.jpg)
Regular Expressions for Stemming
comp 527: Text Mining February 27, 2019 29 / 49
![Page 30: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/30.jpg)
Regular Expressions for Stemming
ProblemsRE removes ‘s’ from ‘ponds’, but also from ‘is’ and ‘basis’produces some non-words like ‘distribut’, ‘deriv’
comp 527: Text Mining February 27, 2019 30 / 49
![Page 31: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/31.jpg)
NLTK Stemmers
NLTK provides several o↵-the-shelf stemmers
Porter and Lancaster stemmers have their own rules for strippinga�xes
comp 527: Text Mining February 27, 2019 31 / 49
![Page 32: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/32.jpg)
Is stemming useful?
Provides some improvement for IR performance (especially for smallerdocuments).
Very useful for some queries, but on an average does not help much.
Since improvement is very minimal, often IR engines does not usestemming.
comp 527: Text Mining February 27, 2019 32 / 49
![Page 33: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/33.jpg)
Stopword Removal
Removal of high frequency words
Most common words such as articles, prepositions, and pronouns etc.does not help in identifying meaning
Figure: A stop list of 25 semantically non-selective words which are common inReuters-RCV1
comp 527: Text Mining February 27, 2019 33 / 49
![Page 34: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/34.jpg)
Methods for stopword removal
Classic Methodremoving stop-words using pre-compiled lists
Zipf’s law (Z-methods)frequency of a word is inversely proportional to its rank in thefrequency tableremove most frequent words
Mutual Information Methodsupervised method that computes mutual information between a giventerm and a document classlow mutual information suggests low discrimination power of the termand hence should be removed
comp 527: Text Mining February 27, 2019 34 / 49
![Page 35: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/35.jpg)
Sentence Segmentation
Divide text into sentences
Involves identifying sentence boundaries between words in di↵erentsentences
a.k .a sentence boundary detection, sentence boundarydisambiguation, sentence boundary recognition
Useful and necessary for various NLP tasks such assentiment analysisrelation extractionquestion answering systemsknowledge extraction
comp 527: Text Mining February 27, 2019 35 / 49
![Page 36: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/36.jpg)
Sentence boundary detection algorithms
Heuristic methods
Statistical classification trees [Riley, 1989]probability of a word occurring before or after a boundary, case andlength of words
Neural Networks [Palmer and Hearst, 1997]POS distribution of preceding and following words
Maximum entropy model [Mikheev 1998]
comp 527: Text Mining February 27, 2019 36 / 49
![Page 37: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/37.jpg)
Sentence Segmentation - Using spaCy
comp 527: Text Mining February 27, 2019 37 / 49
![Page 38: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/38.jpg)
Sentence Segmentation - Using spaCy
comp 527: Text Mining February 27, 2019 38 / 49
![Page 39: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/39.jpg)
Part-of-Speech Tagging (POS)
Task of tagging POS tags (Nouns, Verbs, Adjectives, Adverbs, ...) forwords
POS tags provide lot of information about a wordknowing whether a word is noun or verb gives information aboutneighbouring wordsnouns are preceded by determiners; adjectives and verbs by nounsuseful for Named entity recognition; Machine Translation; Parsing;Word sense disambiguation
Given a word, we assume it can belong to only of the POS tags.
POS Tagging problemGiven a sentence S = w
1
w2
....wn consisting of n words, determine thecorresponding tag sequence P = P
1
P2
....Pn
comp 527: Text Mining February 27, 2019 39 / 49
![Page 40: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/40.jpg)
POS Tagging - Challenges
comp 527: Text Mining February 27, 2019 40 / 49
![Page 41: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/41.jpg)
POS Tagging - Tagset
Figure: Penn Treebank POS Tags
comp 527: Text Mining February 27, 2019 41 / 49
![Page 42: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/42.jpg)
POS Tagging - Brown Corpus
Brown Corpus - standard corpus used for POS tagging task
first text corpus of American English
published in 1963-1964 by Francis and Kucera
consists of 1 million words (500 samples of 2000+ words each)
Brown corpus is PoS tagged with Penn TreeBank tagset.
⇡ 11% of the word types are ambiguous with regard to POS
⇡ 40% of the word tokens are ambiguous
ambiguity for common words. e.g. thatI know that he is honest = preposition (IN)Yes, that play was nice = determiner (DT)You can’t to that far = adverb (RB)
comp 527: Text Mining February 27, 2019 42 / 49
![Page 43: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/43.jpg)
Automatic POS Tagging
SymbolicRule-basedTransformation-based
ProbabilisticHidden Markov ModelsMaximum Entropy Markov ModelsConditional Random Fields
comp 527: Text Mining February 27, 2019 43 / 49
![Page 44: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/44.jpg)
Automatic POS Tagging - Brill Tagger
comp 527: Text Mining February 27, 2019 44 / 49
![Page 45: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/45.jpg)
Automatic POS Tagging - Brill Tagger
comp 527: Text Mining February 27, 2019 45 / 49
![Page 46: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/46.jpg)
Automatic POS Tagging: Brill Tagger - Example
comp 527: Text Mining February 27, 2019 46 / 49
![Page 47: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/47.jpg)
Automatic POS Tagging: Brill Tagger - Sample Final Rules
comp 527: Text Mining February 27, 2019 47 / 49
![Page 48: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan](https://reader033.fdocuments.us/reader033/viewer/2022041510/5e27bbb674ca7008bd544110/html5/thumbnails/48.jpg)
Next Week
Probabilistic Models for POS Tagging
Relation Extraction
Question and Answering
comp 527: Text Mining February 27, 2019 48 / 49