New Predictive Analysis of Text · 2020. 4. 20. · Predictive Analysis example of relational data:...

93
Predictive Analysis of Text: Concepts, Features, and Instances Sandeep Avula [email protected]

Transcript of New Predictive Analysis of Text · 2020. 4. 20. · Predictive Analysis example of relational data:...

  • Predictive Analysis of Text:Concepts, Features, and Instances

    Sandeep [email protected]

  • Bias vs. Variance

  • Bias vs. Variance

    Fig source: http://scott.fortmann-roe.com/docs/BiasVariance.html

  • Overfitting

    How would you fit a curve/line through these points?

  • Overfitting

    Fig source: https://algotrading101.com/blog/1543426/what-is-curve-fitting-overfitting-in-trading-optimization

  • Curse of Dimensionality

    Fig source: https://medium.freecodecamp.org/the-curse-of-dimensionality-how-we-can-save-big-data-from-itself-d9fa0f872335

  • Optimization

    Fig source: http://vrand.com/education.html

  • 8

    • Objective: developing and evaluating computer programs that automatically detect a particular concept in natural language text

    Predictive Analysis of Text

  • 9

    Predictive Analysisbasic ingredients

    1. Training data: a set of positive and negative examples of the concept we want to automatically recognize

    2. Representation: a set of features that we believe are useful in recognizing the desired concept

    3. Learning algorithm: a computer program that uses the training data to learn a predictive model of the concept

  • 10

    Predictive Analysisbasic ingredients

    4. Model: a function that describes a predictive relationship between feature values and the presence of the concept

    5. Test data: a set of previously unseen examples used to estimate the model’s effectiveness

    6. Performance metrics: a set of statistics used to measure the predictive effectiveness of the model

  • 11

    Predictive Analysistraining and testing

    machine learning

    algorithmmodel

    labeled examples

    new, unlabeled examples

    model

    predictions

    training

    testing

  • 12

    color size #slides equalsides ... label

    red big 3 no ... yes

    green big 3 yes ... yes

    blue small inf yes ... no

    blue small 4 yes ... no....

    ........

    ........

    ....

    red big 3 yes ... yes

    Predictive Analysisconcept, instances, and features

    conceptfeatures

    inst

    ance

    s

  • 13

    Predictive Analysistraining and testing

    machine learning

    algorithmmodel

    labeled examples

    new, unlabeled examples

    model

    predictions

    training

    testing

    color size sides equalsides ... label

    red big 3 no ... yes

    green big 3 yes ... yes

    blue small inf yes ... no

    blue small 4 yes ... no....

    ........

    ........

    ....

    red big 3 yes ... yes

    color size sides equalsides ... label

    red big 3 no ... ???

    green big 3 yes ... ???

    blue small inf yes ... ???

    blue small 4 yes ... ???....

    ........

    ........ ???

    red big 3 yes ... ???

    color size sides equalsides ... label

    red big 3 no ... yes

    green big 3 yes ... yes

    blue small inf yes ... no

    blue small 4 yes ... no....

    ........

    ........

    ....

    red big 3 yes ... yes

  • 14

    • Is a particular concept appropriate for predictive analysis?

    • What should the unit of analysis be?

    • How should I divide the data into training and test sets?

    • What is a good feature representation for a task?

    • What type of learning algorithm should I use?

    • How should I evaluate my model’s performance?

    Predictive Analysisquestions

  • 15

    • Learning algorithms can recognize some concepts better than others

    • What are some properties of concepts that are easier to recognize?

    Predictive Analysisconcepts

  • 16

    • Option 1: can a human recognize the concept?

    Predictive Analysisconcepts

  • 17

    • Option 1: can a human recognize the concept?

    • Option 2: can two or more humans recognize the concept independently and do they agree?

    Predictive Analysisconcepts

  • 18

    • Option 1: can a human recognize the concept?

    • Option 2: can two or more humans recognize the concept independently and do they agree?

    • Option 2 is better.

    • In fact, models are sometimes evaluated as an independent assessor

    • How does the model’s performance compare to the performance of one assessor with respect to another?

    ‣ One assessor produces the “ground truth” and the other produces the “predictions”

    Predictive Analysisconcepts

  • 19

    Predictive Analysismeasures agreement: percent agreement

    yes no

    yes A B

    no C D

    • Percent agreement: percentage of instances for which both assessors agree that the concept occurs or does not occur

    (?+?)

    (?+?+?+?)

  • 20

    yes no

    yes A B

    no C D

    • Percent agreement: percentage of instances for which both assessors agree that the concept occurs or does not occur

    (A+D)

    (A+B+C+D)

    Predictive Analysismeasures agreement: percent agreement

  • 21

    • Percent agreement: percentage of instances for which both assessors agree that the concept occurs or does not occur

    Predictive Analysismeasures agreement: percent agreement

    yes no

    yes 5 5 10

    no 15 75 90

    20 80

    %agreement= ???

  • 22

    • Percent agreement: percentage of instances for which both assessors agree that the concept occurs or does not occur

    Predictive Analysismeasures agreement: percent agreement

    yes no

    yes 5 5 10

    no 15 75 90

    20 80

    %agreement= (5+75)/100=80%

  • 23

    • Problem: percent agreement does not account for agreement due to random chance.

    • How can we compute the expected agreement due to random chance?

    • Option 1: assume unbiased assessors

    • Option 2: assume biased assessors

    Predictive Analysismeasures agreement: percent agreement

  • 24

    • Option 1: unbiased assessors

    Predictive Analysiskappa agreement: chance-corrected % agreement

    yes no

    yes ?? ?? 50

    no ?? ?? 50

    50 50

  • 25

    • Option 1: unbiased assessors

    Predictive Analysiskappa agreement: chance-corrected % agreement

    yes no

    yes 25 25 50

    no 25 25 50

    50 50

  • 26

    • Option 1: unbiased assessors

    Predictive Analysiskappa agreement: chance-corrected % agreement

    yes no

    yes 25 25 50

    no 25 25 50

    50 50

    randomchance%agreement= ???

  • 27

    • Option 1: unbiased assessors

    Predictive Analysiskappa agreement: chance-corrected % agreement

    yes no

    yes 25 25 50

    no 25 25 50

    50 50

    randomchance%agreement= (25+25)/100=50%

  • 28

    Predictive Analysiskappa agreement: chance-corrected % agreement

    • Kappa agreement: percent agreement after correcting for the expected agreement due to random chance

    • P(a)=percentofobservedagreement

    • P(e)=percentofagreementduetorandomchance

  • 29

    yes noyes 5 5 10no 15 75 90

    20 80

    Predictive Analysiskappa agreement: chance-corrected % agreement

    • Kappa agreement: percent agreement after correcting for the expected agreement due to unbiased chance

    yes noyes 25 25 50no 25 25 50

    50 50

  • 30

    • Option 2: biased assessors

    Predictive Analysiskappa agreement: chance-corrected % agreement

    yes no

    yes 5 5 10

    no 15 75 90

    20 80

    biasedchance%agreement= ???

  • Predictive AnalysisProbability calculation

    • When throwing a die twice, what is the probability that the first is even number and the second is a multiple of 3?

    1st 2nd

    Even numbers: 2, 4, 6 Multiples of 3: 3, 6

    31Slide borrowed from Heejun Kim

  • Predictive AnalysisProbability calculation

    • When throwing a die twice, what is the probability that the first is even number and the second is a multiple of 3?

    1st 2nd

    Even numbers: 2, 4, 6!"

    = #$

    Multiples of 3: 3, 6$"

    = #!

    12×13=16

    32Slide borrowed from Heejun Kim

  • 33

    yes noyes 5 5 10no 15 75 90

    20 80

    Predictive Analysiskappa agreement: chance-corrected % agreement

    • Kappa agreement: percent agreement after correcting for the expected agreement due to biased chance

  • 34

    yes noyes 5 5 10no 15 75 90

    20 80

    Predictive Analysiskappa agreement: chance-corrected % agreement

    • Kappa agreement: percent agreement after correcting for the expected agreement due to biased chance

  • 35

    • INPUT: unlabeleddata,annotators,codingmanual• OUTPUT: labeleddata

    1. usingthelatestcodingmanual,haveall annotatorslabelsomepreviouslyunseenportionofthedata(~10%)

    2. measureinter-annotatoragreement(Kappa)

    3. IF agreement<X,THEN:‣ refinecodingmanualusingdisagreementstoresolve

    inconsistenciesandclarifydefinitions

    ‣ returnto1

    • ELSE‣ haveannotatorslabeltheremainderofthedata

    independently andEXIT

    Predictive Analysisdata annotation process

  • 36

    • What is good (Kappa) agreement?

    • It depends on who you ask

    • According to Landis and Koch, 1977:

    ‣ 0.81 - 1.00: almost perfect

    ‣ 0.61 - 0.70: substantial

    ‣ 0.41 - 0.60: moderate

    ‣ 0.21 - 0.40: fair

    ‣ 0.00 - 0.20: slight

    ‣ < 0.00: no agreement

    Predictive Analysisdata annotation process

  • Predictive Analysiskappa agreement: chance-corrected % agreement

    Case P(a) P(e) Kappa

    1 0.5 0.1

    2 0.5 0.2

    3 0.5 0.3

    4 0.5 0.4

    5 0.5 0.5

    • Kappa agreement simulation

    37Slide borrowed from Heejun Kim

  • Predictive Analysiskappa agreement: chance-corrected % agreement

    Case P(a) P(e) Kappa

    1 0.5 0.1 0.44

    2 0.5 0.2 0.375

    3 0.5 0.3 0.29

    4 0.5 0.4 0.17

    5 0.5 0.5 0

    • Kappa agreement simulation

    38Slide borrowed from Heejun Kim

  • 39

    • Question: requests information about the course content

    • Answer: contributes information in response to a question

    • Issue: expresses a problem with the course management

    • Issue Resolution: attempts to resolve a previously raised issue

    • Positive Ack: positive sentiment about a previous post

    • Negative Ack: negative sentiment about a previous post

    • Other: serves a different purpose

    Predictive Analysisdata annotation process

  • 40

    Predictive Analysisdata annotation process

  • 41

    • Is a particular concept appropriate for predictive analysis?

    • What should the unit of analysis be?

    • What is a good feature representation for this task?

    • How should I divide the data into training and test sets?

    • What type of learning algorithm should I use?

    • How should I evaluate my model’s performance?

    Predictive Analysisquestions

  • 42

    • For many text-mining applications, turning the data into instances for training and testing is fairly straightforward

    • Easy case: instances are self-contained, independent units of analysis

    ‣ topic categorization: instances = documents

    ‣ opinion mining: instances = product reviews

    ‣ bias detection: instances = political blog posts

    ‣ emotion detection: instances = support group posts

    Predictive Analysisturning data into (training and test) instances

  • 43

    w_1 w_2 w_3 ... w_n label

    1 1 0 ... 0 health

    0 0 0 ... 0 other

    0 0 0 ... 0 other

    0 1 0 ... 1 other....

    ........ ...

    0 ....

    1 0 0 ... 1 health

    conceptfeatures

    inst

    ance

    sTopic Categorization

    predicting health-related documents

  • 44

    w_1 w_2 w_3 ... w_n label

    1 1 0 ... 0 positive

    0 0 0 ... 0 negative

    0 0 0 ... 0 negative

    0 1 0 ... 1 negative....

    ........ ...

    0 ....

    1 0 0 ... 1 positive

    conceptfeatures

    inst

    ance

    sOpinion Mining

    predicting positive/negative movie reviews

  • 45

    w_1 w_2 w_3 ... w_n label

    1 1 0 ... 0 liberal

    0 0 0 ... 0 conservative

    0 0 0 ... 0 conservative

    0 1 0 ... 1 conservative....

    ........ ...

    0 ....

    1 0 0 ... 1 liberal

    conceptfeatures

    inst

    ance

    sBias Detection

    predicting liberal/conservative blog posts

  • 46

    • A not-so-easy case: relational data

    • The concept to be learned is a relation between pairs of objects

    Predictive Analysisturning data into (training and test) instances

  • 47

    Predictive Analysisexample of relational data: Brother(X,Y)

    (exampleborrowedandmodifiedfromWittenet al. textbook)

  • 48

    name_1 gender_1 mother_1 father_1 name_2 gender_2 mother_2 father_2 brother

    steven male peggy peter graham male peggy peter yes

    Ian male grace ray brian male grace ray yes

    anna female pam ian nikki female pam ian no

    pippa female grace ray brian male grace ray no

    steven male peggy peter brian male grace ray no....

    ........

    ........

    ........

    ........

    anna female pam ian brian male grace ray no

    conceptfeatures

    inst

    ance

    sPredictive Analysis

    example of relational data: Brother(X,Y)

  • 49

    • A not-so-easy case: relational data

    • Each instance should correspond to an object pair (which may or may not share the relation of interest)

    • May require features that characterize properties of the pair

    Predictive Analysisturning data into (training and test) instances

  • 50

    conceptfeatures

    inst

    ance

    sPredictive Analysis

    example of relational data: Brother(X,Y)

    (can we think of a better feature representation?)

    name_1 gender_1 mother_1 father_1 name_2 gender_2 mother_2 father_2 brother

    steven male peggy peter graham male peggy peter yes

    Ian male grace ray brian male grace ray yes

    anna female pam ian nikki female pam ian no

    pippa female grace ray brian male grace ray no

    steven male peggy peter brian male grace ray no....

    ........

    ........

    ........

    ........

    anna female pam ian brian male grace ray no

  • 51

    gender_1 gender_2 sameparents brother

    male male yes yes

    male male yes yes

    female female no no

    female male yes no

    male male no no....

    ........

    ....

    female male no no

    conceptfeatures

    inst

    ance

    sPredictive Analysis

    example of relational data: Brother(X,Y)

  • 52

    • A not-so-easy case: relational data

    • There is still an issue that we’re not capturing! Any ideas?

    • Hint: In this case, should the predicted labels really be independent?

    Predictive Analysisturning data into (training and test) instances

  • 53

    Brother(A,B)=yes

    Brother(B,C)=yes

    Brother(A,C)=no

    Predictive Analysisturning data into (training and test) instances

  • 54

    • In this case, what we would really want is:

    ‣ a method that does joint prediction on the test set

    ‣ a method whose joint predictions satisfy a set of known properties about the data as a whole (e.g., transitivity)

    Predictive Analysisturning data into (training and test) instances

  • 55

    • There are learning algorithms that incorporate relational constraints between predictions

    • However, they are beyond the scope of this class

    • We’ll be covering algorithms that make independent predictions on instances

    • That said, many algorithms output prediction confidence values

    • Heuristics can be used to disfavor inconsistencies

    Predictive Analysisturning data into (training and test) instances

  • 56

    • Examples of relational data in text-mining:

    ‣ information extraction: predicting that a word-sequence belongs to a particular class (e.g., person, location)

    ‣ topic segmentation: segmenting discourse into topically coherent chunks

    Predictive Analysisturning data into (training and test) instances

  • 57

    • Examples of relational data in text-mining:

    ‣ information extraction: predicting that a word-sequence belongs to a particular class (e.g., person, location)

    ‣ topic segmentation: segmenting discourse into topically coherent chunks

    Predictive Analysisturning data into (training and test) instances

    e.g., President Barack Obama gives farewell speech in Chicago – as it happened

  • 58

    • Is a particular concept appropriate for predictive analysis?

    • What should the unit of analysis be?

    • How should I divide the data into training and test sets?

    • What is a good feature representation for this task?

    • What type of learning algorithm should I use?

    • How should I evaluate my model’s performance?

    Predictive Analysisquestions

  • 59

    • We want our model to “learn” to recognize a concept

    • So, what does it mean to learn?

    Predictive Analysistraining and test data

  • 60

    • The machine learning definition of learning:

    A machine learns with respect to a particular task T, performance metric P, and experience E, if the system improves its performance P at task T following experience E. -- Tom Mitchell

    Predictive Analysistraining and test data

  • 61

    • We want our model to improve its generalizationperformance!

    • That is, its performance on previously unseen data!

    • Generalize: to derive or induce a general conception or principle from particulars. -- Merriam-Webster

    • In order to test generalization performance, the training and test data cannot be the same.

    • Why?

    Predictive Analysistraining and test data

  • 62

    Training data + Representationwhat could possibly go wrong?

    x

  • 63

    • While we don’t want to test on training data, models usually perform the best when the training and test set are derived from the same “probability distribution”.

    • What does that mean?

    Predictive Analysistraining and test data

  • 64

    Predictive Analysistraining and test data

    Data

    positive instancesnegative instances

    Test DataTraining Data

    ? ?

  • 65

    Predictive Analysistraining and test data

    Data

    positive instancesnegative instances

    Test DataTraining Data

    • Is this a good partitioning? Why or why not?

  • 66

    Predictive Analysistraining and test data

    Data

    positive instancesnegative instances

    Test DataTraining Data

    RandomSample

    RandomSample

  • 67

    Predictive Analysistraining and test data

    Data

    positive instancesnegative instances

    Test DataTraining Data

    • On average, random sampling should produce comparable data for training and testing

  • 68

    Predictive Analysistraining and test data

    • If you want to predict stock price by analyzing tweets, how the training and test data should be separated?

    𝑡- 𝑡# 𝑡$ 𝑡! 𝑡.

    Slide borrowed from Heejun Kim

  • 69

    Predictive Analysistraining and test data

    • If you want to predict stock price by analyzing tweets, how the training and test data should be separated?

    𝑡-

    Test data

    Training data

    𝑡# 𝑡$ 𝑡! 𝑡.

    Slide borrowed from Heejun Kim

  • 70

    • Models usually perform the best when the training and test set have:

    ‣ a similar proportion of positive and negative examples

    ‣ a similar co-occurrence of feature-values and each target class value

    Predictive Analysistraining and test data

  • 71

    Predictive Analysistraining and test data

    • Caution: in some situations, partitioning the data randomly might inflate performance in an unrealistic way!

    • How the data is split into training and test sets determines what we can claim about generalization performance

    • The appropriate split between training and test sets is usually determined on a case-by-case basis

  • 72

    • Spam detection: should the training and test sets contain email messages from the same sender, same recipient, and/or same timeframe?

    • Topic segmentation: should the training and test sets contain potential boundaries from the same discourse?

    • Opinion mining for movie reviews: should the training and test sets contain reviews for the same movie?

    • Sentiment analysis: should the training and test sets contain blog posts from the same discussion thread?

    Predictive Analysisdiscussion

  • 73

    • Is a particular concept appropriate for predictive analysis?

    • What should the unit of analysis be?

    • How should I divide the data into training and test sets?

    • What type of learning algorithm should I use?

    • What is a good feature representation for this task?

    • How should I evaluate my model’s performance?

    Predictive Analysisquestions

  • 74

    • Linear classifiers

    • Decision tree classifiers

    • Instance-based classifiers

    Predictive Analysisthree types of classifiers

  • 75

    • All types of classifiers learn to make predictions based on the input feature values

    • However, different types of classifiers combine the input feature values in different ways

    • Chapter 3 in the book refers to a trained model as knowledge representation

    Predictive Analysisthree types of classifiers

  • 76

    Predictive Analysislinear classifiers: perceptron algorithm

  • 77

    Predictive Analysislinear classifiers: perceptron algorithm

    parameters learned by the modelpredicted value (e.g., 1 = positive, 0 = negative)

  • 78

    f_1 f_2 f_3

    0.5 1 0.2

    model weightstest instancew_0 w_1 w_2 w_3

    2 -5 2 1

    Predictive Analysislinear classifiers: perceptron algorithm

    What is the output?predicted value (e.g., 1 = positive, 0 = negative)

  • 79

    f_1 f_2 f_3

    0.5 1 0.2

    model weightstest instancew_0 w_1 w_2 w_3

    2 -5 2 1

    output=2.0 +(0.50 x-5.0)+(1.0 x2.0)+(0.2 x1.0)

    output=1.7

    outputprediction=positive

    Predictive Analysislinear classifiers: perceptron algorithm

  • 80

    Predictive Analysislinear classifiers: perceptron algorithm

    (two-featureexampleborrowedfromWittenet al. textbook)

  • 81

    Predictive Analysislinear classifiers: perceptron algorithm

    (source:http://en.wikipedia.org/wiki/File:Svm_separating_hyperplanes.png)

  • 82

    Predictive Analysislinear classifiers: logistic regression

    (source:https://en.wikipedia.org/wiki/Logistic_regression#/media/File:Logistic-curve.svg)

    when

    Slide borrowed from Heejun Kim

  • 83

    Predictive Analysislinear classifiers: perceptron algorithm

    x2

    x1• Would a linear classifier do well on positive (black) and

    negative (white) data that looks like this?

    0.5 1.0

    0.5

    1.0

  • 84

    • Linear classifiers

    • Decision tree classifiers

    • Instance-based classifiers

    Predictive Analysisthree types of classifiers

  • 85

    Predictive Analysisexample of decision tree classifier: Brother(X,Y)

    sameparents

    gender_1

    gender_2

    no

    noyes

    male female

    male female

    no

    noyes

  • 86

    Predictive Analysisdecision tree classifiers

    x2

    x1• Draw a decision tree that would perform perfectly on

    this training data!

    0.5 1.0

    0.5

    1.0

  • 87

    Predictive Analysisexample of decision tree classifier

    X1>0.5

    X2>0.5 X2>0.5

    noyes

    whiteblack

    noyes no

    white black

    yes

    x2

    x10.5 1.0

    0.5

    1.0

    Slide borrowed from Heejun Kim

  • 88

    • Linear classifiers

    • Decision tree classifiers

    • Instance-based classifiers

    Predictive Analysisthree types of classifiers

  • 89

    Predictive Analysisinstance-based classifiers

    x2

    x1• predict the class associated with the most similar

    training examples

    0.5 1.0

    0.5

    1.0

    ?

  • 90

    Predictive Analysisinstance-based classifiers

    x2

    x1• predict the class associated with the most similar

    training examples

    0.5 1.0

    0.5

    1.0

    ?

  • 91

    • Assumption: instances with similar feature values should have a similar label

    • Given a test instance, predict the label associated with its nearest neighbors

    • There are many different similarity metrics for computing distance between training/test instances

    • There are many ways of combining labels from multiple training instances

    Predictive Analysisinstance-based classifiers

  • 92

    • Is a particular concept appropriate for predictive analysis?

    • What should the unit of analysis be?

    • How should I divide the data into training and test sets?

    • What is a good feature representation for this task?

    • What type of learning algorithm should I use?

    • How should I evaluate my model’s performance?

    Predictive Analysisquestions

  • Topics posted• Fake News detection

    • Gender bias on Twitter

    • News to predict stock price

    • Categorize posts on reddit forums

    • Topic categorization for emails

    • Chapel Hill restaurant recommender

    • Detect cyber-bullying on social media

    • Tweet sentiment

    • Personality trait recognition