Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non...

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TTEXTEXT MMINING INING AnAn OverviewOverview of of ConceptsConcepts, , TechniquesTechniques and and ApplicationsApplications

Ing.Ing. Andrea Andrea TagarelliTagarelli

Workshop Data Workshop Data WarehousingWarehousing and Data and Data MiningMining


• Introduce you to major aspects of the Knowledge Discovery Process, when available data is textual and unstructured

• Provide a systematization to the many many concepts around this area, according to the following lines– the process– the methods– the applications

• Important issues that will be not covered in this tutorial:– The problem of Dimensionality Reduction– Text categorization

– Learning techniques– Evaluation

– Visualization techniques– Tools for Text Mining– etc.

TutorialTutorial goalsgoals


Tutorial OutlineTutorial Outline

1. Introduction– Motivations– Basic concepts in Knowledge Discovery from textual data

2. Deeper into Text Mining: Text Representation– Functions– Models– Storage techniques– Index term Identification– Index term Weighting

3. Applications





3. Applications


TheThe reason for Text Miningreason for Text Mining

0

20

40

60

80

100

Per centage

Amount of informat ion

Collections ofTextStr uctur edData


Text sourcesText sources

– Web pages– E-books– Email – News articles– Insurance claims– Patent portfolios– IRC

– Technical documents– Scientific articles– Customer complaint letters– Contracts– Transcripts of phone calls

with customers

• A non-exhaustive list:


ProblemsProblems withwith textualtextual data (I)data (I)

• The known KDD problems and challenges extend totextual data– large (textual) data collections– high dimensionality– overfitting– changing data and knowledge– noisy data– understandability of mined patterns– etc.


ProblemsProblems withwith textualtextual data (II)data (II)

• But there are new problems– Text is not designed to be used by computers– Complex and poorly defined structure and semantics– But much harder, ambiguity

– in speech, morphology, syntax, semantics, pragmatics– Multilingualism

– lack of reliable and general translation tools


What is Text Mining?What is Text Mining?

• Peoples’ first thought: – Make it easier to find things on the Web.– But this is information retrieval!

• The foundation of most commercial “text mining” products is all this “unproperly named” stuff– Information retrieval engine– Web spider/search– Text classification– Information extraction (only sometimes)

• The metaphor of extracting ore from rock:– Does make sense for extracting documents of interest from a huge pile.– But does not reflect notions of Data Mining in practice. Rather:

– finding patterns across large collections– discovering heretofore unknown information


Text Mining: definitionsText Mining: definitions

• Text mining mainly is about somehow extracting the information and knowledge from text

• 2 definitions:– Any operation related to gathering and analyzing text from

external sources for business intelligence purposes– Discovery of knowledge previously unknown to the user in

text• Text mining is the process of:

compiling, organizing, and analyzing large document collections to support the delivery of targeted types of information to analysts and decision makers and to discover relationships between related facts that span wide domains of inquiry.


Text Mining: contributing areasText Mining: contributing areas

Data Data MiningMining

InformationInformationExtractionExtraction

InformationInformationRetrievalRetrieval

NaturalNatural LanguageLanguage ProcessingProcessing

TextTextMiningMining


TutorialTutorial OutlineOutline

1. Introduction and basic concepts– Motivations– Basic concepts in Knowledge Discovery from textual data

2. Deeper into Text Mining: Text Representation– Functions– Models– Storage techniques– Index term Identification – Index term Weighting

3. Applications


TextText RepresentationRepresentationMajor Major FunctionsFunctions

• Indicative– reveals elements of the content, upon which the

relevancy of the complete original text can be decided– useful for

– Document browsing systems– Document retrieval systems

• Informative– represents a real substitute of the content of the full-text

(or part), without references to its original text– useful for

– Question-answering systems– Document retrieval systems


TextText RepresentationRepresentationBrowsingBrowsing systemssystems

• Usually part of hypertext and hypermedia systems• Allow users to skim text collections in the search for

valuable information• Users need not to

– generate descriptions of what they want or – specify in advance the topics of interest

but can just indicate documents they find relevant• Useful when a user

– has no clear need– cannot express his need accurately– or is a casual user of the information


TextText RepresentationRepresentationQuestionQuestion--answeringanswering systemssystems

• Also known as Information Extraction systems• Retrieve specific information from the documents, by extracting

or inferring answers from text representation• Template types:

– Slots in template typically filled by a substring from the document– Some slots may have a fixed set of pre-specified possible fillers that may not

occur in the text itself– Some slots may allow multiple fillers– Assumes slots always in a fixed order

• Extraction patterns:– Specify an item to extract for a slot, e.g. using a regular expression pattern– May require preceding (pre-filler) pattern to identify proper context, and

succeeding (post-filler) pattern to identify the end of the filler


TextText RepresentationRepresentationInformationInformation RetrievalRetrieval systemssystems (I)(I)

• Select documents from a collection in response to a user’s query– Search request is formulated in natural language

• Rank these documents according to their relevance to the query – Matching between document representation and query

representation• Return a list of possible relevant texts, the

representations of which best match the requestrepresentation


TextText RepresentationRepresentationInformationInformation RetrievalRetrieval systemssystems (II)(II)

• Retrieval models– Boolean model– Vector space model– Probabilistic model– Network model– Logic-based model

• Differences with respect to– representation of textual contents, – representation of information needs, – and their matching





3. Applications


TextText RepresentationRepresentationBooleanBoolean model (I)model (I)

• Compares the boolean query statement with the termsets used to identify the textual content (index terms)

• Query has the form of an expression containing– index terms– boolean operators (AND, OR and NOT) defined upon

the terms• Document matches condition or not• This model is employed in many commercial systems

– Professional searchers still like boolean queries: you know exactly what you’re getting


TextText RepresentationRepresentationBooleanBoolean model (II)model (II)

• Which plays of Shakespeare contain the words Brutus AND Caesar but NOTCalpurnia?

• Term-document incidence matrix

• To answer query:– Idea: query satisfaction = overlap measure

– take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND– 110100 AND 110111 AND 101111 = 100100

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

1 if play contains word, 0 otherwise


TextText RepresentationRepresentationBooleanBoolean model (III): model (III): ProblemsProblems

• Very rigid: AND means all; OR means any• Difficult to express complex user requests• Difficult to control the number of documents retrieved

– All matched documents will be returned• Difficult to rank output

– All matched documents logically satisfy the query• Difficult to perform relevance feedback

– If a document is identified by the user as relevant or irrelevant, how should the query be modified?


TextText RepresentationRepresentationBooleanBoolean model (IV): model (IV): ProblemsProblems

• Overlap measure doesn’t consider:– term frequency in document– term scarcity in collection (document mention

frequency)– length of documents


TextText RepresentationRepresentationVectorVector Space model (I)Space model (I)

• Documents and queries are represented in a m-dimensional vector space– Index term set of the collection: V = {w1, …, wm}– Document: d = [d1, d2,…, dm]

– binary or weighted components

– Collection: D = {d1, d2,…, dN}• The relevancy of a document to a query is computed

as a proximity measure


TextText RepresentationRepresentationVectorVector Space model (II)Space model (II)

• Desiderata for proximity– If d1 is near d2, then d2 is near d1

– If d1 is near d2, and d2 near d3, then d1 is not far from d3

– No document is closer to d than d itself• Distance between vectors d1 and d2 is the length of the vector

|d1 - d2|– Euclidean distance

• Why is this not a great idea?• We still haven’t dealt with the issue of length normalization

– Long documents would be more similar to each other by virtue of length, not topic

• However, we can implicitly normalize by looking at angles


TextText RepresentationRepresentationVectorVector Space model (III)Space model (III)

• Cosine similarity– Distance between vectors d1 and d2 captured by the

cosine of the angle θ between them.– Note that this is similarity, not distance

– The denominator involves the lengths of the vectors– So the cosine measure is also known as the normalized

inner product

t 1

d2

d1

t 3

t 2

θ

∑∑∑

==

=

⋅

⋅=

⋅

⋅=

n

i k,in

i j,i

n

i k,ij,i

kj

kjkj

ww

ww)d,d(sim

12

12

1

dddd


TextText RepresentationRepresentationVectorVector Space model (IV)Space model (IV)

• A vector can be normalized (given a length of 1) by dividing each of its components by the vector's length

• This maps vectors onto the unit circle:

• Then, longer documents don’t get more weight• For normalized vectors, the cosine is simply the dot

product:

11

2 == ∑ =

n

i j,ij wd

kjkj ),cos( dddd ⋅=


TextText RepresentationRepresentationVectorVector Space model (V): Space model (V): AdvantagesAdvantages

• Allows simple and efficient implementation for large document collections

• Query becomes a vector in the same space as the documents

• Can consider both local (tf) and global (idf) word occurrence frequencies

• Provides partial matching and natural measure of scores/ranking – no longer Boolean

• Tends to work quite well in practice despite the simplifying assumptions


TextText RepresentationRepresentationVectorVector Space model (VI): Space model (VI): ProblemsProblems

• Missing syntactic information (e.g., phrase structure, word order, proximity information)

• Missing semantic information (e.g., word sense)• Assumption of term independence

– “Bag-of-words” model• Assumption that term vectors are pair-wise orthogonal• Lacks the control of a Boolean model (e.g., requiring a

term to appear in a document)– Given a two-term query “A B”, may prefer a document containing

A frequently but not B, over a document that contains both A andB, but both less frequently





3. Applications


TextText RepresentationRepresentationStorageStorage of of texttext representationsrepresentations (I)(I)

• Desiderata for a data structure:– Ability to represent concepts and relationships– Ability to support the location of these concepts in the

document collection• Inverted index

– For each term, stores the ids of all documents that are indexed by that term

– The complete inverted index is– first represented by an array of indexed documents– then transposed (to obtain a term- document matrix)


I did enact JuliusCaesar I was killed

i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The noble

Brutus hath told youCaesar was ambitious

Doc 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

TextText RepresentationRepresentationStorageStorage of of texttext representationsrepresentations (II)(II)

• Each document is parsed to extract words, and these are saved with the Document ID


• After all documents have been parsed the inverted file is sorted by terms

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

TextText RepresentationRepresentationStorageStorage of of texttext representationsrepresentations (III)(III)

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2


• Multiple term entries in a single document are merged

• (Local) frequency information are added

Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

TextText RepresentationRepresentationStorageStorage of of texttext representationsrepresentations (IV)(IV)

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2


• The file is commonly split into a Dictionary and a Postings fileDoc # Freq

2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1

Term N docs Tot Freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1

TextText RepresentationRepresentationStorageStorage of of texttext representationsrepresentations (V)(V)

Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1


TextText RepresentationRepresentationStorageStorage of of texttext representationsrepresentations (VII)(VII)

• n-gram structure– breaks terms into smaller string units of n characters– allows searching morphologically different terms

• Signature file– contains signatures (bit patterns) representing the index

terms– documents are split into logical blocks each containing

a fixed number of index terms– Hashed word-signature in the same block are OR’ed together– Block signatures are then concatenated to create the document

signature





3. Applications


TextText RepresentationRepresentationIndexIndex termterm IdentificationIdentification (I)(I)

• What terms in a document do we index?– All words or only “important” ones?– A few words are very common

– 2 most frequent words (e.g., “the”, “of”) can account for about 10% of word occurrences

– Most words are very rare– Half the words in a corpus appear only once, called hapax legomena

(Greek for “read only once”)

• Search for terms that capture text semantics– avoiding intensive manual processing (hand-coding)


TextText RepresentationRepresentationIndexIndex termterm IdentificationIdentification (II)(II)

• Involves:– Feature definition

– Text is usually represented as a “bag of words”, i.e. a collection of independent concepts

• concepts are words, word stems, word phrases• concept weights can be binary or frequency-based

– Paragraph, sentence, word order is disrupted– Syntactic structure is broken

– Feature selection and extraction– reduce concept space dimensionality


TextText RepresentationRepresentationIndexIndex termterm IdentificationIdentification (III)(III)

• Feature selection and extraction– Lessical and Morphological analysis

– processing of punctuation, numbers, case folding, etc.– removal of stopwords– stemming, lemmatization– part-of-speech tagging

– Discourse Semantics analysis: Anaphora– literal anaphor/pronominal anaphor

• “This notebook weights even less than its predecessor”– textual ellipsis– referential meronymy

– Pragmatics– asservative, commissive, directive, declarative, expressive,

interrogative sentences


TextText RepresentationRepresentationSelectionSelection of of naturalnatural languagelanguage indexindex termsterms (I)(I)

Cooper’s vs. Cooper vs. Coopers Full-text vs. full text vs. {full, text} vs. fulltextrésumé vs. resume

• Punctuation– Ne’er: use language-specific, handcrafted “locale” to normalize– State-of-the-art: break up hyphenated sequence– U.S.A. vs. USA - use locale

• Numbers– Generally, don’t index as text– Creation dates for docs

– 3/12/91 Mar. 12, 1991 55 B.C. B-52 100.2.86.144

• Case folding– Reduce all letters to lower case

– exception: upper case in mid-sentence– General Motors Fed vs. fed SAIL vs. sail

Cooper’s concordance of Wordsworth was published in 1911. The applications of full-text retrieval are legion: they include résumé scanning, litigation support and searching published journals on-line.


TextText RepresentationRepresentationSelectionSelection of of naturalnatural languagelanguage indexindex termsterms (II)(II)

• Spell correction– Look for all words within edit distance (maximum) k

(Insert/Delete/Replace) at query time– Data Minino → Data Mining (edit distance: 1)– eterogeneiti → heterogeneity (edit distance: 2)

– Expensive and slows the query (upto a factor of 100)– Invoke only when index returns zero matches– What if documents contain mis- spellings?


TextText RepresentationRepresentationSelectionSelection of of naturalnatural languagelanguage indexindex termsterms (III)(III)

• Removal of Stopwords– Terms that are so common that they’re ignored for indexing– Function words that serve grammatical purposes and don’t refer

to objects or concepts.– e.g., the, a, an, of, to …

– language-specific


TextText RepresentationRepresentationSelectionSelection of of naturalnatural languagelanguage indexindex termsterms (IV)(IV)

• Stemming– Reduce terms to their “roots” before indexing

– mainly eliminate plurals, tenses, gerund forms, prefixes and suffixes

– automate(s), automatic, automation all reduced to automat– Language- dependent

for example compressed and compression are both accepted as equivalent to compress.

for exampl compres andcompres are both acceptas equival to compres.


TextText RepresentationRepresentationSelectionSelection of of naturalnatural languagelanguage indexindex termsterms (V)(V)

• Porter’s stemmer algorithm– Commonest algorithm for stemming English– Conventions + 5 phases of reductions

– phases applied sequentially– each phase consists of a set of commands– sample convention:

of the rules in a compound command, select the one that applies to the longest suffix

– sample rule:if word ends with “ation” replace with “ate”

– sses → ss ies → i– ational → ate tional → tion


TextText RepresentationRepresentationSelectionSelection of of naturalnatural languagelanguage indexindex termsterms (VI)(VI)

• Lemmatization– Reduce inflectional/variant forms to base form

– am, are, is → be– car, cars, car's, cars' → car

– the boy's cars are different colors → the boy car be different color


TextText RepresentationRepresentationSelectionSelection of of naturalnatural languagelanguage indexindex termsterms (VII)(VII)

• Part-of-speech Tagging– Labelling each word in a sentence with its proper

grammatical categoryThe representative put chairs on the table →The/AT representative/NN put/VBD chairs/NNS on/IN the/AT table/NN

AT=determiner article, NN =noun singular, VBD=verb past tense, NNS=noun plural, IN=preposition

– There are several tag sets ranging mainly in granularity and complexity:

– Brown tag set, Penn Treebank tag set

– and several approaches:– Markov models, Transformation-based learning, Decision trees


TextText RepresentationRepresentationSelectionSelection of of naturalnatural languagelanguage indexindex termsterms (VIII)(VIII)• Part-of-speech Tagging

– (some) Public taggers– Eric Brill’s tagger: http://www.cs.jhu.edu/~brill/

(Perl and C implementations)– TreeTagger: http://www.ims.uni-

stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html– MontyTagger: http://web.media.mit.edu/~hugo/montytagger/

(Python and Java implementations)– QTAG: http://web.bham.ac.uk/O.Mason/software/tagger/

(Java implementation)– Makes sense as an intermediate task for others

– e.g. with shallow parsing for– creating linguistically motivated index terms– detecting slot-filler candidates in Information Extraction– detecting answer candidates in Question Answering


TextText RepresentationRepresentationAssignmentAssignment of of controlledcontrolled languagelanguage indexindex termsterms• Thesaurus

– Generalizes terms that have related meaning, but unrelated surface forms, into more uniform index terms

– Puts words that are synonyms and are intersubstitutable into equivalence classes

• Words may have many senses: polysemous words– Word Sense Disambiguation techniques are needed

• Index such equivalences, or expand query?





3. Applications


TextText RepresentationRepresentationIndexIndex TermTerm WeightingWeighting (I)(I)

• Distribution patterns of words give significant information about the property of being content bearing

• Zipf’s law– Rank (r): the numerical position of a word in a list sorted by

decreasing frequency (f )– Zipf (1949) “discovered” that:

– If probability of word of rank r is pr and N is the total number of word occurrences:

rf 1∝

10.ArA

Nfpr ≈==


TextText RepresentationRepresentationIndexIndex TermTerm WeightingWeighting (II)(II)

• Luhn (1958) suggested that both extremely common and extremely uncommon words were not very useful for indexing

• Most discriminative concepts have low to medium frequency


TextText RepresentationRepresentationIndexIndex TermTerm WeightingWeighting (III)(III)

• Need for considering frequency of a word in a document

• Weighting term frequency (tf)– It still doesn’t consider:

– Term scarcity in collection (document mention frequency)– Length of documents and queries (not normalized)

• Weighting should depend on the term overall – Suggest looking at collection frequency (cf)– but document frequency (df) may be better


TextText RepresentationRepresentationIndexIndex TermTerm WeightingWeighting (IV)(IV)

• tf x idf measure combines:– term frequency (tf)

– measure of term density in a document– inverse document frequency (idf)

– measure of informativeness of term: its rarity across the whole corpus

– could just be raw count of number of documents the term occurs in (idfi = 1/dfi)

– but by far the most commonly used version is: )df/nlog(idf ii =


TextText RepresentationRepresentationIndexIndex TermTerm WeightingWeighting (V)(V)

• Assign a tf.idf weight to each term i in each document d

– increases with the number of occurrences within a document

– increases with the rarity of the term across the whole corpus

)/log(,, ididi dfntfw ×=

rmcontain te that documents ofnumber thedocuments ofnumber total

document in termoffrequency ,

idfn

ditf

i

di

==

=


• Length normalization– Documents have different sizes– Long and verbose texts usually

– use the same terms repeatedly– have numerous different terms

– Variations in length can be normalized to compensate the effectthat

– the tf factors are large for long texts and small for short onesobscuring the real term importance

TextText RepresentationRepresentationIndexIndex TermTerm WeightingWeighting (VI)(VI)

djj

di

tftf

,

,

max ∑j

dj

di

tf

tf2

,

,

)( ∑ ⋅

⋅

jjdj

idi

idftf

idftf2

,

,

)(


TutorialTutorial OutlineOutline



3. Applications


ApplicationsApplications (I)(I)

• The main application areas cover two aspects:– Knowledge discovery

– mining proper– Information distillation

– mining on the basis of some pre- established documentstructure, to identify documents relevant to a target information

• Typical usage:– Extract relevant information from documents – Classify and manage documents according to their

content– Organize repositories of document-related meta-

information for search and retrieval


ApplicationsApplications (II)(II)

• Text summarization• Word sense disambiguation• Hierarchical categorization of Web pages• Text filtering

– CRM & marketing (e.g., cross-selling, recommendation)– Product recommendation

– Information delivery at organizations for Knowledge Management

– Personalizing information access– Filtering news items in Usenet newsgroups– Detecting spam messages


ApplicationsApplicationsTextText SummarizationSummarization

• Generate a summary of a text’s content– short text: essential and coherent– Use profiles to structure the important content in semantically well-

defined fields • Mostly applied to ease information access, e.g.

– most useful keywords are extracted from a set of documents (e.g., a cluster) to describe it

– documents in a collection are abstracted to avoid reading the full content

– documents retrieved from search are summarized to allow the user a faster identification of those relevant to the query

• High-level summary or survey of all main points?• Approaches based on size of the text unit used in the summary

– Keyword summaries– Sentence summaries


ApplicationsApplicationsWord Word SenseSense DisambiguationDisambiguation

• Assign a word with the right sense with respect to the context in which the word appears

• An effective approach:– Choosing word meanings from an existing sense inventory by

exploiting measures of semantic relatedness• WSD is an example of the more general issue of resolving

natural language ambiguities• For instance:

– “bank” may have (at least) two senses in English:– “the Bank of England” (a financial institution)– “the bank of river Thames” (a hydraulic engineering artefact)

– which of above senses the occurrence of “bank” has in“last week I borrowed some money from the bank”


ApplicationsApplicationsHierarchicalHierarchical CategorizationCategorization of Web of Web pagespages

• Under hierarchical catalogues (hosted by popular Web portals), a searcher may– first navigate in the hierarchy of categories– and then restrict his/her search to a particular category of interest

• Category-pivoted categorization should allow new categories to be added and obsolete ones to be deleted

• Peculiarities:– Hypertextual nature of the documents

– Hyperlink analysis– Hierarchical structure of the category set

– Decomposing the classification as a branching decisionat an internal node


ApplicationsApplicationsTextText FilteringFiltering (I)(I)

• Classify a stream of incoming documents dispatchedin an asynchronous way by an information producer toan information consumer– typical case: a newsfeed (producer: news agency,

consumer: newspaper)• Desiderata of a filtering system

– should block the delivery of the documents the consumer is likely not interested in

– filtering can be seen as a case of single- labeled TC– may be installed at the producer end

– to route the documents to the interested consumer only– Builds and updates a “profile” for each consumer


ApplicationsApplicationsTextText FilteringFiltering (II)(II)

– or at consumer end– to block the delivery of documents deemed uninteresting– A single “profile” is needed

• Adaptive filtering– a profile is initially specified by the user– and is updated by using feedback information provided

by the user on the relevance of the delivered messages


ApplicationsApplicationsCustomer Relationship ManagementCustomer Relationship Management

• Incorporates both the distillation and discovery aspects of TM• Designed to specifically help companies better understand

what their customers want and what think about the company itself

• Method:1. Select a suitable set of documents and convert them to a

common standard format2. Extract relevant features and derive a database of documents

which are grouped according to the similarity of their content, by exploiting clustering techniques

3. Use categorization tools to assign new incoming customer feedback to the identified categories


ApplicationsApplicationsProductProduct RecommendationRecommendation

• Content-based– According a personal profile accounting for

– a set of categories (DVD, computer games, music, etc.) and subcategories (genres)

– Starting with preferred items– authors, titles, brands

– Recommendation of new releases– of course it is not text-content based, but on the purchasing history

• Collaborative or social– According to other customers purchases

“Customers who bought this book also bought…”– Based on

– previous annotations by other users– and generating a user segmentation

• A trend is combining both ideas


ApplicationsApplicationsDetectingDetecting SpamSpam

• Spam email is, more properly, unsolicited bulk email• It has been producing a considerable damage to

– Internet Service Providers– Internet users (connection costs)– and the whole Internet backbone

• Spam detection is a Text Categorization problem– Two classes: spam and legitimate email– It relatively easy to

– Represent messages as vectors of concept weights– Perform some feature selection– Learn a classifier

– but evaluation is not so simple because it is a problem in whichmissclassification costs and class distribution are not symmetric

Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non...

Documents

Transcript of Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non...