Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non...

33
A. A. Tagarelli Tagarelli: Text Mining : Text Mining - An Overview An Overview UNICAL, UNICAL, 21/10/2004 21/10/2004 T T EXT EXT M M INING INING An An Overview Overview of of Concepts Concepts , , Techniques Techniques and and Applications Applications Ing. Ing. Andrea Andrea Tagarelli Tagarelli Workshop Data Workshop Data Warehousing Warehousing and Data and Data Mining Mining A. A. Tagarelli Tagarelli: Text Mining : Text Mining - An Overview An Overview UNICAL, UNICAL, 21/10/2004 21/10/2004 Introduce you to major aspects of the Knowledge Discovery Process, when available data is textual and unstructured Provide a systematization to the many many concepts around this area, according to the following lines the process the methods the applications Important issues that will be not covered in this tutorial : The problem of Dimensionality Reduction Text categorization Learning techniques Evaluation Visualization techniques Tools for Text Mining etc. Tutorial Tutorial goals goals

Transcript of Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non...

Page 1: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TTEXTEXT MMINING INING AnAn OverviewOverview of of ConceptsConcepts, , TechniquesTechniques and and ApplicationsApplications

Ing.Ing. Andrea Andrea TagarelliTagarelli

Workshop Data Workshop Data WarehousingWarehousing and Data and Data MiningMining

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

• Introduce you to major aspects of the Knowledge Discovery Process, when available data is textual and unstructured

• Provide a systematization to the many many concepts around this area, according to the following lines– the process– the methods– the applications

• Important issues that will be not covered in this tutorial:– The problem of Dimensionality Reduction– Text categorization

– Learning techniques– Evaluation

– Visualization techniques– Tools for Text Mining– etc.

TutorialTutorial goalsgoals

Page 2: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

Tutorial OutlineTutorial Outline

1. Introduction– Motivations– Basic concepts in Knowledge Discovery from textual data

2. Deeper into Text Mining: Text Representation– Functions– Models– Storage techniques– Index term Identification– Index term Weighting

3. Applications

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

Tutorial OutlineTutorial Outline

1. Introduction– Motivations– Basic concepts in Knowledge Discovery from textual data

2. Deeper into Text Mining: Text Representation– Functions– Models– Storage techniques– Index term Identification– Index term Weighting

3. Applications

Page 3: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TheThe reason for Text Miningreason for Text Mining

0

20

40

60

80

100

Per centage

Amount of informat ion

Collections ofTextStr uctur edData

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

Text sourcesText sources

– Web pages– E-books– Email – News articles– Insurance claims– Patent portfolios– IRC

– Technical documents– Scientific articles– Customer complaint letters– Contracts– Transcripts of phone calls

with customers

• A non-exhaustive list:

Page 4: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

ProblemsProblems withwith textualtextual data (I)data (I)

• The known KDD problems and challenges extend totextual data– large (textual) data collections– high dimensionality– overfitting– changing data and knowledge– noisy data– understandability of mined patterns– etc.

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

ProblemsProblems withwith textualtextual data (II)data (II)

• But there are new problems– Text is not designed to be used by computers– Complex and poorly defined structure and semantics– But much harder, ambiguity

– in speech, morphology, syntax, semantics, pragmatics– Multilingualism

– lack of reliable and general translation tools

Page 5: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

What is Text Mining?What is Text Mining?

• Peoples’ first thought: – Make it easier to find things on the Web.– But this is information retrieval!

• The foundation of most commercial “text mining” products is all this “unproperly named” stuff– Information retrieval engine– Web spider/search– Text classification– Information extraction (only sometimes)

• The metaphor of extracting ore from rock:– Does make sense for extracting documents of interest from a huge pile.– But does not reflect notions of Data Mining in practice. Rather:

– finding patterns across large collections– discovering heretofore unknown information

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

Text Mining: definitionsText Mining: definitions

• Text mining mainly is about somehow extracting the information and knowledge from text

• 2 definitions:– Any operation related to gathering and analyzing text from

external sources for business intelligence purposes– Discovery of knowledge previously unknown to the user in

text• Text mining is the process of:

compiling, organizing, and analyzing large document collections to support the delivery of targeted types of information to analysts and decision makers and to discover relationships between related facts that span wide domains of inquiry.

Page 6: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

Text Mining: contributing areasText Mining: contributing areas

Data Data MiningMining

InformationInformationExtractionExtraction

InformationInformationRetrievalRetrieval

NaturalNatural LanguageLanguage ProcessingProcessing

TextTextMiningMining

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TutorialTutorial OutlineOutline

1. Introduction and basic concepts– Motivations– Basic concepts in Knowledge Discovery from textual data

2. Deeper into Text Mining: Text Representation– Functions– Models– Storage techniques– Index term Identification – Index term Weighting

3. Applications

Page 7: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationMajor Major FunctionsFunctions

• Indicative– reveals elements of the content, upon which the

relevancy of the complete original text can be decided– useful for

– Document browsing systems– Document retrieval systems

• Informative– represents a real substitute of the content of the full-text

(or part), without references to its original text– useful for

– Question-answering systems– Document retrieval systems

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationBrowsingBrowsing systemssystems

• Usually part of hypertext and hypermedia systems• Allow users to skim text collections in the search for

valuable information• Users need not to

– generate descriptions of what they want or – specify in advance the topics of interest

but can just indicate documents they find relevant• Useful when a user

– has no clear need– cannot express his need accurately– or is a casual user of the information

Page 8: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationQuestionQuestion--answeringanswering systemssystems

• Also known as Information Extraction systems• Retrieve specific information from the documents, by extracting

or inferring answers from text representation• Template types:

– Slots in template typically filled by a substring from the document– Some slots may have a fixed set of pre-specified possible fillers that may not

occur in the text itself– Some slots may allow multiple fillers– Assumes slots always in a fixed order

• Extraction patterns:– Specify an item to extract for a slot, e.g. using a regular expression pattern– May require preceding (pre-filler) pattern to identify proper context, and

succeeding (post-filler) pattern to identify the end of the filler

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationInformationInformation RetrievalRetrieval systemssystems (I)(I)

• Select documents from a collection in response to a user’s query– Search request is formulated in natural language

• Rank these documents according to their relevance to the query – Matching between document representation and query

representation• Return a list of possible relevant texts, the

representations of which best match the requestrepresentation

Page 9: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationInformationInformation RetrievalRetrieval systemssystems (II)(II)

• Retrieval models– Boolean model– Vector space model– Probabilistic model– Network model– Logic-based model

• Differences with respect to– representation of textual contents, – representation of information needs, – and their matching

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

Tutorial OutlineTutorial Outline

1. Introduction– Motivations– Basic concepts in Knowledge Discovery from textual data

2. Deeper into Text Mining: Text Representation– Functions– Models– Storage techniques– Index term Identification – Index term Weighting

3. Applications

Page 10: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationBooleanBoolean model (I)model (I)

• Compares the boolean query statement with the termsets used to identify the textual content (index terms)

• Query has the form of an expression containing– index terms– boolean operators (AND, OR and NOT) defined upon

the terms• Document matches condition or not• This model is employed in many commercial systems

– Professional searchers still like boolean queries: you know exactly what you’re getting

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationBooleanBoolean model (II)model (II)

• Which plays of Shakespeare contain the words Brutus AND Caesar but NOTCalpurnia?

• Term-document incidence matrix

• To answer query:– Idea: query satisfaction = overlap measure

– take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND– 110100 AND 110111 AND 101111 = 100100

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

1 if play contains word, 0 otherwise

Page 11: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationBooleanBoolean model (III): model (III): ProblemsProblems

• Very rigid: AND means all; OR means any• Difficult to express complex user requests• Difficult to control the number of documents retrieved

– All matched documents will be returned• Difficult to rank output

– All matched documents logically satisfy the query• Difficult to perform relevance feedback

– If a document is identified by the user as relevant or irrelevant, how should the query be modified?

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationBooleanBoolean model (IV): model (IV): ProblemsProblems

• Overlap measure doesn’t consider:– term frequency in document– term scarcity in collection (document mention

frequency)– length of documents

Page 12: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationVectorVector Space model (I)Space model (I)

• Documents and queries are represented in a m-dimensional vector space– Index term set of the collection: V = {w1, …, wm}– Document: d = [d1, d2,…, dm]

– binary or weighted components

– Collection: D = {d1, d2,…, dN}• The relevancy of a document to a query is computed

as a proximity measure

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationVectorVector Space model (II)Space model (II)

• Desiderata for proximity– If d1 is near d2, then d2 is near d1

– If d1 is near d2, and d2 near d3, then d1 is not far from d3

– No document is closer to d than d itself• Distance between vectors d1 and d2 is the length of the vector

|d1 - d2|– Euclidean distance

• Why is this not a great idea?• We still haven’t dealt with the issue of length normalization

– Long documents would be more similar to each other by virtue of length, not topic

• However, we can implicitly normalize by looking at angles

Page 13: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationVectorVector Space model (III)Space model (III)

• Cosine similarity– Distance between vectors d1 and d2 captured by the

cosine of the angle θ between them.– Note that this is similarity, not distance

– The denominator involves the lengths of the vectors– So the cosine measure is also known as the normalized

inner product

t 1

d2

d1

t 3

t 2

θ

∑∑∑

==

=

⋅=

⋅=

n

i k,in

i j,i

n

i k,ij,i

kj

kjkj

ww

ww)d,d(sim

12

12

1

dddd

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationVectorVector Space model (IV)Space model (IV)

• A vector can be normalized (given a length of 1) by dividing each of its components by the vector's length

• This maps vectors onto the unit circle:

• Then, longer documents don’t get more weight• For normalized vectors, the cosine is simply the dot

product:

11

2 == ∑ =

n

i j,ij wd

kjkj ),cos( dddd ⋅=

Page 14: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationVectorVector Space model (V): Space model (V): AdvantagesAdvantages

• Allows simple and efficient implementation for large document collections

• Query becomes a vector in the same space as the documents

• Can consider both local (tf) and global (idf) word occurrence frequencies

• Provides partial matching and natural measure of scores/ranking – no longer Boolean

• Tends to work quite well in practice despite the simplifying assumptions

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationVectorVector Space model (VI): Space model (VI): ProblemsProblems

• Missing syntactic information (e.g., phrase structure, word order, proximity information)

• Missing semantic information (e.g., word sense)• Assumption of term independence

– “Bag-of-words” model• Assumption that term vectors are pair-wise orthogonal• Lacks the control of a Boolean model (e.g., requiring a

term to appear in a document)– Given a two-term query “A B”, may prefer a document containing

A frequently but not B, over a document that contains both A andB, but both less frequently

Page 15: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

Tutorial OutlineTutorial Outline

1. Introduction– Motivations– Basic concepts in Knowledge Discovery from textual data

2. Deeper into Text Mining: Text Representation– Functions– Models– Storage techniques– Index term Identification – Index term Weighting

3. Applications

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationStorageStorage of of texttext representationsrepresentations (I)(I)

• Desiderata for a data structure:– Ability to represent concepts and relationships– Ability to support the location of these concepts in the

document collection• Inverted index

– For each term, stores the ids of all documents that are indexed by that term

– The complete inverted index is– first represented by an array of indexed documents– then transposed (to obtain a term- document matrix)

Page 16: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

I did enact JuliusCaesar I was killed

i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The noble

Brutus hath told youCaesar was ambitious

Doc 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

TextText RepresentationRepresentationStorageStorage of of texttext representationsrepresentations (II)(II)

• Each document is parsed to extract words, and these are saved with the Document ID

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

• After all documents have been parsed the inverted file is sorted by terms

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

TextText RepresentationRepresentationStorageStorage of of texttext representationsrepresentations (III)(III)

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

Page 17: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

• Multiple term entries in a single document are merged

• (Local) frequency information are added

Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

TextText RepresentationRepresentationStorageStorage of of texttext representationsrepresentations (IV)(IV)

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

• The file is commonly split into a Dictionary and a Postings fileDoc # Freq

2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1

Term N docs Tot Freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1

TextText RepresentationRepresentationStorageStorage of of texttext representationsrepresentations (V)(V)

Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

Page 18: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationStorageStorage of of texttext representationsrepresentations (VII)(VII)

• n-gram structure– breaks terms into smaller string units of n characters– allows searching morphologically different terms

• Signature file– contains signatures (bit patterns) representing the index

terms– documents are split into logical blocks each containing

a fixed number of index terms– Hashed word-signature in the same block are OR’ed together– Block signatures are then concatenated to create the document

signature

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

Tutorial OutlineTutorial Outline

1. Introduction– Motivations– Basic concepts in Knowledge Discovery from textual data

2. Deeper into Text Mining: Text Representation– Functions– Models– Storage techniques– Index term Identification – Index term Weighting

3. Applications

Page 19: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationIndexIndex termterm IdentificationIdentification (I)(I)

• What terms in a document do we index?– All words or only “important” ones?– A few words are very common

– 2 most frequent words (e.g., “the”, “of”) can account for about 10% of word occurrences

– Most words are very rare– Half the words in a corpus appear only once, called hapax legomena

(Greek for “read only once”)

• Search for terms that capture text semantics– avoiding intensive manual processing (hand-coding)

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationIndexIndex termterm IdentificationIdentification (II)(II)

• Involves:– Feature definition

– Text is usually represented as a “bag of words”, i.e. a collection of independent concepts

• concepts are words, word stems, word phrases• concept weights can be binary or frequency-based

– Paragraph, sentence, word order is disrupted– Syntactic structure is broken

– Feature selection and extraction– reduce concept space dimensionality

Page 20: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationIndexIndex termterm IdentificationIdentification (III)(III)

• Feature selection and extraction– Lessical and Morphological analysis

– processing of punctuation, numbers, case folding, etc.– removal of stopwords– stemming, lemmatization– part-of-speech tagging

– Discourse Semantics analysis: Anaphora– literal anaphor/pronominal anaphor

• “This notebook weights even less than its predecessor”– textual ellipsis– referential meronymy

– Pragmatics– asservative, commissive, directive, declarative, expressive,

interrogative sentences

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationSelectionSelection of of naturalnatural languagelanguage indexindex termsterms (I)(I)

Cooper’s vs. Cooper vs. Coopers Full-text vs. full text vs. {full, text} vs. fulltextrésumé vs. resume

• Punctuation– Ne’er: use language-specific, handcrafted “locale” to normalize– State-of-the-art: break up hyphenated sequence– U.S.A. vs. USA - use locale

• Numbers– Generally, don’t index as text– Creation dates for docs

– 3/12/91 Mar. 12, 1991 55 B.C. B-52 100.2.86.144

• Case folding– Reduce all letters to lower case

– exception: upper case in mid-sentence– General Motors Fed vs. fed SAIL vs. sail

Cooper’s concordance of Wordsworth was published in 1911. The applications of full-text retrieval are legion: they include résumé scanning, litigation support and searching published journals on-line.

Page 21: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationSelectionSelection of of naturalnatural languagelanguage indexindex termsterms (II)(II)

• Spell correction– Look for all words within edit distance (maximum) k

(Insert/Delete/Replace) at query time– Data Minino → Data Mining (edit distance: 1)– eterogeneiti → heterogeneity (edit distance: 2)

– Expensive and slows the query (upto a factor of 100)– Invoke only when index returns zero matches– What if documents contain mis- spellings?

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationSelectionSelection of of naturalnatural languagelanguage indexindex termsterms (III)(III)

• Removal of Stopwords– Terms that are so common that they’re ignored for indexing– Function words that serve grammatical purposes and don’t refer

to objects or concepts.– e.g., the, a, an, of, to …

– language-specific

Page 22: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationSelectionSelection of of naturalnatural languagelanguage indexindex termsterms (IV)(IV)

• Stemming– Reduce terms to their “roots” before indexing

– mainly eliminate plurals, tenses, gerund forms, prefixes and suffixes

– automate(s), automatic, automation all reduced to automat– Language- dependent

for example compressed and compression are both accepted as equivalent to compress.

for exampl compres andcompres are both acceptas equival to compres.

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationSelectionSelection of of naturalnatural languagelanguage indexindex termsterms (V)(V)

• Porter’s stemmer algorithm– Commonest algorithm for stemming English– Conventions + 5 phases of reductions

– phases applied sequentially– each phase consists of a set of commands– sample convention:

of the rules in a compound command, select the one that applies to the longest suffix

– sample rule:if word ends with “ation” replace with “ate”

– sses → ss ies → i– ational → ate tional → tion

Page 23: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationSelectionSelection of of naturalnatural languagelanguage indexindex termsterms (VI)(VI)

• Lemmatization– Reduce inflectional/variant forms to base form

– am, are, is → be– car, cars, car's, cars' → car

– the boy's cars are different colors → the boy car be different color

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationSelectionSelection of of naturalnatural languagelanguage indexindex termsterms (VII)(VII)

• Part-of-speech Tagging– Labelling each word in a sentence with its proper

grammatical categoryThe representative put chairs on the table →The/AT representative/NN put/VBD chairs/NNS on/IN the/AT table/NN

AT=determiner article, NN =noun singular, VBD=verb past tense, NNS=noun plural, IN=preposition

– There are several tag sets ranging mainly in granularity and complexity:

– Brown tag set, Penn Treebank tag set

– and several approaches:– Markov models, Transformation-based learning, Decision trees

Page 24: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationSelectionSelection of of naturalnatural languagelanguage indexindex termsterms (VIII)(VIII)• Part-of-speech Tagging

– (some) Public taggers– Eric Brill’s tagger: http://www.cs.jhu.edu/~brill/

(Perl and C implementations)– TreeTagger: http://www.ims.uni-

stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html– MontyTagger: http://web.media.mit.edu/~hugo/montytagger/

(Python and Java implementations)– QTAG: http://web.bham.ac.uk/O.Mason/software/tagger/

(Java implementation)– Makes sense as an intermediate task for others

– e.g. with shallow parsing for– creating linguistically motivated index terms– detecting slot-filler candidates in Information Extraction– detecting answer candidates in Question Answering

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationAssignmentAssignment of of controlledcontrolled languagelanguage indexindex termsterms• Thesaurus

– Generalizes terms that have related meaning, but unrelated surface forms, into more uniform index terms

– Puts words that are synonyms and are intersubstitutable into equivalence classes

• Words may have many senses: polysemous words– Word Sense Disambiguation techniques are needed

• Index such equivalences, or expand query?

Page 25: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

Tutorial OutlineTutorial Outline

1. Introduction– Motivations– Basic concepts in Knowledge Discovery from textual data

2. Deeper into Text Mining: Text Representation– Functions– Models– Storage techniques– Index term Identification– Index term Weighting

3. Applications

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationIndexIndex TermTerm WeightingWeighting (I)(I)

• Distribution patterns of words give significant information about the property of being content bearing

• Zipf’s law– Rank (r): the numerical position of a word in a list sorted by

decreasing frequency (f )– Zipf (1949) “discovered” that:

– If probability of word of rank r is pr and N is the total number of word occurrences:

rf 1∝

10.ArA

Nfpr ≈==

Page 26: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationIndexIndex TermTerm WeightingWeighting (II)(II)

• Luhn (1958) suggested that both extremely common and extremely uncommon words were not very useful for indexing

• Most discriminative concepts have low to medium frequency

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationIndexIndex TermTerm WeightingWeighting (III)(III)

• Need for considering frequency of a word in a document

• Weighting term frequency (tf)– It still doesn’t consider:

– Term scarcity in collection (document mention frequency)– Length of documents and queries (not normalized)

• Weighting should depend on the term overall – Suggest looking at collection frequency (cf)– but document frequency (df) may be better

Page 27: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationIndexIndex TermTerm WeightingWeighting (IV)(IV)

• tf x idf measure combines:– term frequency (tf)

– measure of term density in a document– inverse document frequency (idf)

– measure of informativeness of term: its rarity across the whole corpus

– could just be raw count of number of documents the term occurs in (idfi = 1/dfi)

– but by far the most commonly used version is: )df/nlog(idf ii =

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TextText RepresentationRepresentationIndexIndex TermTerm WeightingWeighting (V)(V)

• Assign a tf.idf weight to each term i in each document d

– increases with the number of occurrences within a document

– increases with the rarity of the term across the whole corpus

)/log(,, ididi dfntfw ×=

rmcontain te that documents ofnumber thedocuments ofnumber total

document in termoffrequency ,

idfn

ditf

i

di

==

=

Page 28: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

• Length normalization– Documents have different sizes– Long and verbose texts usually

– use the same terms repeatedly– have numerous different terms

– Variations in length can be normalized to compensate the effectthat

– the tf factors are large for long texts and small for short onesobscuring the real term importance

TextText RepresentationRepresentationIndexIndex TermTerm WeightingWeighting (VI)(VI)

djj

di

tftf

,

,

max ∑j

dj

di

tf

tf2

,

,

)( ∑ ⋅

jjdj

idi

idftf

idftf2

,

,

)(

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

TutorialTutorial OutlineOutline

1. Introduction– Motivations– Basic concepts in Knowledge Discovery from textual data

2. Deeper into Text Mining: Text Representation– Functions– Models– Storage techniques– Index term Identification– Index term Weighting

3. Applications

Page 29: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

ApplicationsApplications (I)(I)

• The main application areas cover two aspects:– Knowledge discovery

– mining proper– Information distillation

– mining on the basis of some pre- established documentstructure, to identify documents relevant to a target information

• Typical usage:– Extract relevant information from documents – Classify and manage documents according to their

content– Organize repositories of document-related meta-

information for search and retrieval

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

ApplicationsApplications (II)(II)

• Text summarization• Word sense disambiguation• Hierarchical categorization of Web pages• Text filtering

– CRM & marketing (e.g., cross-selling, recommendation)– Product recommendation

– Information delivery at organizations for Knowledge Management

– Personalizing information access– Filtering news items in Usenet newsgroups– Detecting spam messages

Page 30: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

ApplicationsApplicationsTextText SummarizationSummarization

• Generate a summary of a text’s content– short text: essential and coherent– Use profiles to structure the important content in semantically well-

defined fields • Mostly applied to ease information access, e.g.

– most useful keywords are extracted from a set of documents (e.g., a cluster) to describe it

– documents in a collection are abstracted to avoid reading the full content

– documents retrieved from search are summarized to allow the user a faster identification of those relevant to the query

• High-level summary or survey of all main points?• Approaches based on size of the text unit used in the summary

– Keyword summaries– Sentence summaries

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

ApplicationsApplicationsWord Word SenseSense DisambiguationDisambiguation

• Assign a word with the right sense with respect to the context in which the word appears

• An effective approach:– Choosing word meanings from an existing sense inventory by

exploiting measures of semantic relatedness• WSD is an example of the more general issue of resolving

natural language ambiguities• For instance:

– “bank” may have (at least) two senses in English:– “the Bank of England” (a financial institution)– “the bank of river Thames” (a hydraulic engineering artefact)

– which of above senses the occurrence of “bank” has in“last week I borrowed some money from the bank”

Page 31: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

ApplicationsApplicationsHierarchicalHierarchical CategorizationCategorization of Web of Web pagespages

• Under hierarchical catalogues (hosted by popular Web portals), a searcher may– first navigate in the hierarchy of categories– and then restrict his/her search to a particular category of interest

• Category-pivoted categorization should allow new categories to be added and obsolete ones to be deleted

• Peculiarities:– Hypertextual nature of the documents

– Hyperlink analysis– Hierarchical structure of the category set

– Decomposing the classification as a branching decisionat an internal node

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

ApplicationsApplicationsTextText FilteringFiltering (I)(I)

• Classify a stream of incoming documents dispatchedin an asynchronous way by an information producer toan information consumer– typical case: a newsfeed (producer: news agency,

consumer: newspaper)• Desiderata of a filtering system

– should block the delivery of the documents the consumer is likely not interested in

– filtering can be seen as a case of single- labeled TC– may be installed at the producer end

– to route the documents to the interested consumer only– Builds and updates a “profile” for each consumer

Page 32: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

ApplicationsApplicationsTextText FilteringFiltering (II)(II)

– or at consumer end– to block the delivery of documents deemed uninteresting– A single “profile” is needed

• Adaptive filtering– a profile is initially specified by the user– and is updated by using feedback information provided

by the user on the relevance of the delivered messages

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

ApplicationsApplicationsCustomer Relationship ManagementCustomer Relationship Management

• Incorporates both the distillation and discovery aspects of TM• Designed to specifically help companies better understand

what their customers want and what think about the company itself

• Method:1. Select a suitable set of documents and convert them to a

common standard format2. Extract relevant features and derive a database of documents

which are grouped according to the similarity of their content, by exploiting clustering techniques

3. Use categorization tools to assign new incoming customer feedback to the identified categories

Page 33: Text Mining - An Overview (Workshop DM-DW, 21Ott04) - non ...staff.icar.cnr.it/manco/Teaching/2005/datamining/extra/tagarelli.pdf · Tutorial Outline 1. Introduction – Motivations

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

ApplicationsApplicationsProductProduct RecommendationRecommendation

• Content-based– According a personal profile accounting for

– a set of categories (DVD, computer games, music, etc.) and subcategories (genres)

– Starting with preferred items– authors, titles, brands

– Recommendation of new releases– of course it is not text-content based, but on the purchasing history

• Collaborative or social– According to other customers purchases

“Customers who bought this book also bought…”– Based on

– previous annotations by other users– and generating a user segmentation

• A trend is combining both ideas

A.A. TagarelliTagarelli: Text Mining : Text Mining -- An OverviewAn Overview UNICAL,UNICAL, 21/10/200421/10/2004

ApplicationsApplicationsDetectingDetecting SpamSpam

• Spam email is, more properly, unsolicited bulk email• It has been producing a considerable damage to

– Internet Service Providers– Internet users (connection costs)– and the whole Internet backbone

• Spam detection is a Text Categorization problem– Two classes: spam and legitimate email– It relatively easy to

– Represent messages as vectors of concept weights– Perform some feature selection– Learn a classifier

– but evaluation is not so simple because it is a problem in whichmissclassification costs and class distribution are not symmetric