New Predictive Analysis of Text · 2020. 4. 20. · Predictive Analysis example of relational data:...

Predictive Analysis of Text:Concepts, Features, and Instances

Sandeep [email protected]

Bias vs. Variance

Bias vs. Variance

Fig source: http://scott.fortmann-roe.com/docs/BiasVariance.html

Overfitting

How would you fit a curve/line through these points?

Overfitting

Fig source: https://algotrading101.com/blog/1543426/what-is-curve-fitting-overfitting-in-trading-optimization

Curse of Dimensionality

Fig source: https://medium.freecodecamp.org/the-curse-of-dimensionality-how-we-can-save-big-data-from-itself-d9fa0f872335

Optimization

Fig source: http://vrand.com/education.html

8

• Objective: developing and evaluating computer programs that automatically detect a particular concept in natural language text

Predictive Analysis of Text

9

Predictive Analysisbasic ingredients

1. Training data: a set of positive and negative examples of the concept we want to automatically recognize

2. Representation: a set of features that we believe are useful in recognizing the desired concept

3. Learning algorithm: a computer program that uses the training data to learn a predictive model of the concept

10

Predictive Analysisbasic ingredients

4. Model: a function that describes a predictive relationship between feature values and the presence of the concept

5. Test data: a set of previously unseen examples used to estimate the model’s effectiveness

6. Performance metrics: a set of statistics used to measure the predictive effectiveness of the model

11

Predictive Analysistraining and testing

machine learning

algorithmmodel

labeled examples

new, unlabeled examples

model

predictions

training

testing

12

color size #slides equalsides ... label

red big 3 no ... yes

green big 3 yes ... yes

blue small inf yes ... no

blue small 4 yes ... no....

........

........

....

red big 3 yes ... yes

Predictive Analysisconcept, instances, and features

conceptfeatures

inst

ance

s

13

Predictive Analysistraining and testing

machine learning

algorithmmodel

labeled examples

new, unlabeled examples

model

predictions

training

testing

color size sides equalsides ... label





........

........

....



red big 3 no ... ???

green big 3 yes ... ???

blue small inf yes ... ???

blue small 4 yes ... ???....

........

........ ???

red big 3 yes ... ???






........

........

....


14

• Is a particular concept appropriate for predictive analysis?

• What should the unit of analysis be?

• How should I divide the data into training and test sets?

• What is a good feature representation for a task?

• What type of learning algorithm should I use?

• How should I evaluate my model’s performance?

Predictive Analysisquestions

15

• Learning algorithms can recognize some concepts better than others

• What are some properties of concepts that are easier to recognize?

Predictive Analysisconcepts

16

• Option 1: can a human recognize the concept?


17


• Option 2: can two or more humans recognize the concept independently and do they agree?


18


• Option 2: can two or more humans recognize the concept independently and do they agree?

• Option 2 is better.

• In fact, models are sometimes evaluated as an independent assessor

• How does the model’s performance compare to the performance of one assessor with respect to another?

‣ One assessor produces the “ground truth” and the other produces the “predictions”


19

Predictive Analysismeasures agreement: percent agreement

yes no

yes A B

no C D

• Percent agreement: percentage of instances for which both assessors agree that the concept occurs or does not occur

(?+?)

(?+?+?+?)

20

yes no

yes A B

no C D


(A+D)

(A+B+C+D)


21



yes no

yes 5 5 10

no 15 75 90

20 80

%agreement= ???

22



yes no

yes 5 5 10

no 15 75 90

20 80

%agreement= (5+75)/100=80%

23

• Problem: percent agreement does not account for agreement due to random chance.

• How can we compute the expected agreement due to random chance?

• Option 1: assume unbiased assessors

• Option 2: assume biased assessors


24

• Option 1: unbiased assessors

Predictive Analysiskappa agreement: chance-corrected % agreement

yes no

yes ?? ?? 50

no ?? ?? 50

50 50

25



yes no

yes 25 25 50

no 25 25 50

50 50

26



yes no

yes 25 25 50

no 25 25 50

50 50

randomchance%agreement= ???

27



yes no

yes 25 25 50

no 25 25 50

50 50

randomchance%agreement= (25+25)/100=50%

28


• Kappa agreement: percent agreement after correcting for the expected agreement due to random chance

• P(a)=percentofobservedagreement

• P(e)=percentofagreementduetorandomchance

29

yes noyes 5 5 10no 15 75 90

20 80


• Kappa agreement: percent agreement after correcting for the expected agreement due to unbiased chance

yes noyes 25 25 50no 25 25 50

50 50

30

• Option 2: biased assessors


yes no

yes 5 5 10

no 15 75 90

20 80

biasedchance%agreement= ???

Predictive AnalysisProbability calculation

• When throwing a die twice, what is the probability that the first is even number and the second is a multiple of 3?

1st 2nd

Even numbers: 2, 4, 6 Multiples of 3: 3, 6

31Slide borrowed from Heejun Kim

Predictive AnalysisProbability calculation

• When throwing a die twice, what is the probability that the first is even number and the second is a multiple of 3?

1st 2nd

Even numbers: 2, 4, 6!"

= #$

Multiples of 3: 3, 6$"

= #!

12×13=16


33

yes noyes 5 5 10no 15 75 90

20 80


• Kappa agreement: percent agreement after correcting for the expected agreement due to biased chance

34

yes noyes 5 5 10no 15 75 90

20 80


• Kappa agreement: percent agreement after correcting for the expected agreement due to biased chance

35

• INPUT: unlabeleddata,annotators,codingmanual• OUTPUT: labeleddata

1. usingthelatestcodingmanual,haveall annotatorslabelsomepreviouslyunseenportionofthedata(~10%)

2. measureinter-annotatoragreement(Kappa)

3. IF agreement<X,THEN:‣ refinecodingmanualusingdisagreementstoresolve

inconsistenciesandclarifydefinitions

‣ returnto1

• ELSE‣ haveannotatorslabeltheremainderofthedata

independently andEXIT

Predictive Analysisdata annotation process

36

• What is good (Kappa) agreement?

• It depends on who you ask

• According to Landis and Koch, 1977:

‣ 0.81 - 1.00: almost perfect

‣ 0.61 - 0.70: substantial

‣ 0.41 - 0.60: moderate

‣ 0.21 - 0.40: fair

‣ 0.00 - 0.20: slight

‣ < 0.00: no agreement



Case P(a) P(e) Kappa

1 0.5 0.1

2 0.5 0.2

3 0.5 0.3

4 0.5 0.4

5 0.5 0.5

• Kappa agreement simulation



Case P(a) P(e) Kappa

1 0.5 0.1 0.44

2 0.5 0.2 0.375

3 0.5 0.3 0.29

4 0.5 0.4 0.17

5 0.5 0.5 0

• Kappa agreement simulation


39

• Question: requests information about the course content

• Answer: contributes information in response to a question

• Issue: expresses a problem with the course management

• Issue Resolution: attempts to resolve a previously raised issue

• Positive Ack: positive sentiment about a previous post

• Negative Ack: negative sentiment about a previous post

• Other: serves a different purpose


40


41



• What is a good feature representation for this task?





42

• For many text-mining applications, turning the data into instances for training and testing is fairly straightforward

• Easy case: instances are self-contained, independent units of analysis

‣ topic categorization: instances = documents

‣ opinion mining: instances = product reviews

‣ bias detection: instances = political blog posts

‣ emotion detection: instances = support group posts

Predictive Analysisturning data into (training and test) instances

43

w_1 w_2 w_3 ... w_n label

1 1 0 ... 0 health

0 0 0 ... 0 other

0 0 0 ... 0 other

0 1 0 ... 1 other....

........ ...

0 ....

1 0 0 ... 1 health

conceptfeatures

inst

ance

sTopic Categorization

predicting health-related documents

44


1 1 0 ... 0 positive

0 0 0 ... 0 negative

0 0 0 ... 0 negative

0 1 0 ... 1 negative....

........ ...

0 ....

1 0 0 ... 1 positive

conceptfeatures

inst

ance

sOpinion Mining

predicting positive/negative movie reviews

45


1 1 0 ... 0 liberal

0 0 0 ... 0 conservative

0 0 0 ... 0 conservative

0 1 0 ... 1 conservative....

........ ...

0 ....

1 0 0 ... 1 liberal

conceptfeatures

inst

ance

sBias Detection

predicting liberal/conservative blog posts

46

• A not-so-easy case: relational data

• The concept to be learned is a relation between pairs of objects


47

Predictive Analysisexample of relational data: Brother(X,Y)

(exampleborrowedandmodifiedfromWittenet al. textbook)

48

name_1 gender_1 mother_1 father_1 name_2 gender_2 mother_2 father_2 brother

steven male peggy peter graham male peggy peter yes

Ian male grace ray brian male grace ray yes

anna female pam ian nikki female pam ian no

pippa female grace ray brian male grace ray no

steven male peggy peter brian male grace ray no....

........

........

........

........

anna female pam ian brian male grace ray no

conceptfeatures

inst

ance

sPredictive Analysis

example of relational data: Brother(X,Y)

49


• Each instance should correspond to an object pair (which may or may not share the relation of interest)

• May require features that characterize properties of the pair


50

conceptfeatures

inst

ance



(can we think of a better feature representation?)

name_1 gender_1 mother_1 father_1 name_2 gender_2 mother_2 father_2 brother

steven male peggy peter graham male peggy peter yes

Ian male grace ray brian male grace ray yes

anna female pam ian nikki female pam ian no

pippa female grace ray brian male grace ray no

steven male peggy peter brian male grace ray no....

........

........

........

........

anna female pam ian brian male grace ray no

51

gender_1 gender_2 sameparents brother

male male yes yes

male male yes yes

female female no no

female male yes no

male male no no....

........

....

female male no no

conceptfeatures

inst

ance



52


• There is still an issue that we’re not capturing! Any ideas?

• Hint: In this case, should the predicted labels really be independent?


53

Brother(A,B)=yes

Brother(B,C)=yes

Brother(A,C)=no


54

• In this case, what we would really want is:

‣ a method that does joint prediction on the test set

‣ a method whose joint predictions satisfy a set of known properties about the data as a whole (e.g., transitivity)


55

• There are learning algorithms that incorporate relational constraints between predictions

• However, they are beyond the scope of this class

• We’ll be covering algorithms that make independent predictions on instances

• That said, many algorithms output prediction confidence values

• Heuristics can be used to disfavor inconsistencies


56

• Examples of relational data in text-mining:

‣ information extraction: predicting that a word-sequence belongs to a particular class (e.g., person, location)

‣ topic segmentation: segmenting discourse into topically coherent chunks


57

• Examples of relational data in text-mining:

‣ information extraction: predicting that a word-sequence belongs to a particular class (e.g., person, location)

‣ topic segmentation: segmenting discourse into topically coherent chunks


e.g., President Barack Obama gives farewell speech in Chicago – as it happened

58








59

• We want our model to “learn” to recognize a concept

• So, what does it mean to learn?

Predictive Analysistraining and test data

60

• The machine learning definition of learning:

A machine learns with respect to a particular task T, performance metric P, and experience E, if the system improves its performance P at task T following experience E. -- Tom Mitchell


61

• We want our model to improve its generalizationperformance!

• That is, its performance on previously unseen data!

• Generalize: to derive or induce a general conception or principle from particulars. -- Merriam-Webster

• In order to test generalization performance, the training and test data cannot be the same.

• Why?


62

Training data + Representationwhat could possibly go wrong?

x

63

• While we don’t want to test on training data, models usually perform the best when the training and test set are derived from the same “probability distribution”.

• What does that mean?


64


Data

positive instancesnegative instances

Test DataTraining Data

? ?

65


Data



• Is this a good partitioning? Why or why not?

66


Data



RandomSample

RandomSample

67


Data



• On average, random sampling should produce comparable data for training and testing

68


• If you want to predict stock price by analyzing tweets, how the training and test data should be separated?

𝑡- 𝑡# 𝑡$ 𝑡! 𝑡.

Slide borrowed from Heejun Kim

69


• If you want to predict stock price by analyzing tweets, how the training and test data should be separated?

𝑡-

Test data

Training data

𝑡# 𝑡$ 𝑡! 𝑡.


70

• Models usually perform the best when the training and test set have:

‣ a similar proportion of positive and negative examples

‣ a similar co-occurrence of feature-values and each target class value


71


• Caution: in some situations, partitioning the data randomly might inflate performance in an unrealistic way!

• How the data is split into training and test sets determines what we can claim about generalization performance

• The appropriate split between training and test sets is usually determined on a case-by-case basis

72

• Spam detection: should the training and test sets contain email messages from the same sender, same recipient, and/or same timeframe?

• Topic segmentation: should the training and test sets contain potential boundaries from the same discourse?

• Opinion mining for movie reviews: should the training and test sets contain reviews for the same movie?

• Sentiment analysis: should the training and test sets contain blog posts from the same discussion thread?

Predictive Analysisdiscussion

73








74

• Linear classifiers

• Decision tree classifiers

• Instance-based classifiers

Predictive Analysisthree types of classifiers

75

• All types of classifiers learn to make predictions based on the input feature values

• However, different types of classifiers combine the input feature values in different ways

• Chapter 3 in the book refers to a trained model as knowledge representation


76

Predictive Analysislinear classifiers: perceptron algorithm

77


parameters learned by the modelpredicted value (e.g., 1 = positive, 0 = negative)

78

f_1 f_2 f_3

0.5 1 0.2

model weightstest instancew_0 w_1 w_2 w_3

2 -5 2 1


What is the output?predicted value (e.g., 1 = positive, 0 = negative)

79

f_1 f_2 f_3

0.5 1 0.2

model weightstest instancew_0 w_1 w_2 w_3

2 -5 2 1

output=2.0 +(0.50 x-5.0)+(1.0 x2.0)+(0.2 x1.0)

output=1.7

outputprediction=positive


80


(two-featureexampleborrowedfromWittenet al. textbook)

81


(source:http://en.wikipedia.org/wiki/File:Svm_separating_hyperplanes.png)

82

Predictive Analysislinear classifiers: logistic regression

(source:https://en.wikipedia.org/wiki/Logistic_regression#/media/File:Logistic-curve.svg)

when


83


x2

x1• Would a linear classifier do well on positive (black) and

negative (white) data that looks like this?

0.5 1.0

0.5

1.0

84





85

Predictive Analysisexample of decision tree classifier: Brother(X,Y)

sameparents

gender_1

gender_2

no

noyes

male female

male female

no

noyes

86

Predictive Analysisdecision tree classifiers

x2

x1• Draw a decision tree that would perform perfectly on

this training data!

0.5 1.0

0.5

1.0

87

Predictive Analysisexample of decision tree classifier

X1>0.5

X2>0.5 X2>0.5

noyes

whiteblack

noyes no

white black

yes

x2

x10.5 1.0

0.5

1.0


88





89

Predictive Analysisinstance-based classifiers

x2

x1• predict the class associated with the most similar

training examples

0.5 1.0

0.5

1.0

?

90


x2

x1• predict the class associated with the most similar

training examples

0.5 1.0

0.5

1.0

?

91

• Assumption: instances with similar feature values should have a similar label

• Given a test instance, predict the label associated with its nearest neighbors

• There are many different similarity metrics for computing distance between training/test instances

• There are many ways of combining labels from multiple training instances


92








Topics posted• Fake News detection

• Gender bias on Twitter

• News to predict stock price

• Categorize posts on reddit forums

• Topic categorization for emails

• Chapel Hill restaurant recommender

• Detect cyber-bullying on social media

• Tweet sentiment

• Personality trait recognition

New Predictive Analysis of Text · 2020. 4. 20. · Predictive Analysis example of relational data:...

Documents

Transcript of New Predictive Analysis of Text · 2020. 4. 20. · Predictive Analysis example of relational data:...