Natural Language Processing to Improve Student Engagement featuring Dr. Rebecca Passonneau

36
Natural Language Processing to Improve Student Engagement Becky Passonneau November 30, 2017

Transcript of Natural Language Processing to Improve Student Engagement featuring Dr. Rebecca Passonneau

Natural Language Processing

to Improve Student Engagement

Becky Passonneau

November 30, 2017

CollaboratorsPyramid content evaluation: Ani Nenkova, (Columbia, U Penn; 2004)

Automated scoring by unigram overlap: Ani Nenkova, Aaron Harnly, Owen Rambow (Columbia; 2005)

Automated scoring by distributional semantics: Emily Chen, Ananya Poddar, Guarav Gite (Columbia; 2013

- 2016)

Comparison to educational rubric (main ideas): Dolores Perin (Teachers; 2013 - 2016)

Automated pyramid and scoring by triple extraction and similarity graphs based on WordNet: Qian Yang

(Tsinghua; PSU; 2016), Alisa Krivokapic (Columbia; 2016)

Automated pyramid and scoring by parsing, distributional semantics, and novel bin packing algorithm:

Yanjun Gao (Penn State; 2017)

2

Asking students to summarize the main ideas of a text helps

their reading and writing

3

Psychologists have posited three cognitive processes involved

in summarization:

● selection of important ideas

● generalization to omit detail

● inference of implicit connections

4

Summaries that are equally good will have some ideas in

common, and some differences

Very much like a Venn diagram

5

Summaries that are equally good will have some ideas in

common, and some content differences

Idea 2

Idea 3

Idea 1

Idea 4

Idea 8

Idea 7

Idea 9

Idea 11

Idea 10

Idea 12

Idea 5

Idea 4

Idea 6

Very much like a Venn diagram

6

Designing a reliable rubric to measure how many important ideas each

summary contains is labor intensive and potentially subjective

7

Summaries are concise

● Each idea is expressed once○ Selection of important ideas

○ Omission of unnecessary detail

● Content evaluation task has two steps

○ Define a standard from expert summaries -- the distinct ideas weighted by

importance

○ Compare the summaries to the standard -- quantify the proportion of important

ideas

8

Pyramid summary content annotation builds a content model of distinct ideas

from summaries written by a wise crowd (size N)

9

CU 2

CU 3

CU 1

CU 4

CU 8

CU 7

CU 9

CU 11

CU 10

CU 12

CU 5

CU 4

CU 6

What is pyramid content analysis?

Importance of ideas (content units, or CUs)

● Emerges from the wise crowd

● Distinguishes quality of ideas by quantity of occurrence

● Simple but effective

Pyramid summary content annotation builds a content model of distinct ideas

from N reference summaries written by a wise crowd (size N)

10

CU 2

CU 3

CU 1

CU 4

CU 8

CU 7

CU 9

CU 11

CU 10

CU 12

CU 5

CU 4

CU 6

A list of all the distinct ideas or Content Units (CUs), and their

weights, i.e., how many summaries each occurs in

TEXT: WHAT IS MATTER

CU 1: Matter is classified by physical and chemical properties

W=3

CU 3: All matter has energy

W=2

. . .

CU 12: Matter can be a solid, liquid or gas

W=1

What is pyramid content analysis?

W=5

W=4

W=3

W=2

W=1

A Pyramid of CUs from a wise crowd of 5

What is wide crowd content analysis?

Application of Pyramid Content Model

● In a new summary, find all the phrases that mention a model CU

● Sum the weights of the mentioned CUs

● Normalize the sum

12

5

4

2

Raw sum = 5 + 4 + 2 = 11

What is wide crowd content analysis?

Normalization

● A summary can express each CU once at most

● Sum the weights of the identified CUs

● Normalize the sum in one of two ways:

○ QUALiTY: The maximum sum of weights for the same number of CUs

Did the summary mention mostly important ideas?

○ COVERAGE: The maximum sum of weights for the average number of CUs in

the reference summaries

Did the summary mention most of the important ideas?

13What is wide crowd content analysis?

● 9 reference summaries

● All content models with m

summaries, for m ∈ [1,9]

● All pairs of summaries A, B where

A > B using 9 reference summaries

● Result○ The variance around scores for A and B

diverges given 4 to 5 references

● Conclusion○ No misranking with 5 references

14How reliable is wide crowd content analysis?

How reliable is it? Can misranking errors occur?

Five additional reliability tests

15

Number of reference summaries for probability of

misranking to be ≤ 0.1

5

Number of reference summaries for scores to

correlate with gold standard scores

5

Interannotator agreement on content model, 10

different pairs of models

0.71 to 0.89

Interannotator agreement on application of content

model to new summaries, 5 models

0.77 to 0.81

Correlation of scores of 16 systems using different

content models

0.71 to 0.96

How reliable is wide crowd content analysis?

Key differences between manual and automated methods:

Humans

● Consider a few alternative segmentations

● Sameness of meaning is a binary (yes-no) judgement

Automated methods

● Consider many possible segmentations

○ Simpler decisions

○ Many more of them

● Metric for similarity of meaning is graded from 0 to 1

● Must select the optimal segmentations and meaning similarities

16How did we automate wise crowd content analysis?

Human segmentation into “ideas” and similarity

Sentence: Matter can be measured because it contains volume and mass

CU106: Matter has volume and mass (W=4)Ref Sum 1: because it contains both volume and mass

Ref Sum 2: it takes up space defined as volume and contains . . . mass

Ref Sum 3: Matter is anything that has mass and takes up space (volume)

Ref Sum 4: Matter contains volume and mass

17How did we automate wise crowd content analysis?

Three Automated Methods

● No large scale machine learning required

● All components are pre-trained

● Requires only 5 wise-crowd summaries on same summarization task

18How did we automate pyramid content analysis?

Three Automated Methods

● PyrScore: Requires existing manual content model

○ Brute force segmentation -- considers all possibilities

○ Distributional (statistical) semantics

● PEAK:

○ Open Information Extraction tools extracts subj-pred-obj triples

○ Symbolic semantics (WordNet)

● PyrEval:

○ Sentence decomposition into clauses

○ Distributional (statistical semantics)

19How did we automate pyramid content analysis?

PyrScore Segmentation: Brute Force

● Calculates all ngram segmentations of each sentence in a new summary

All | matter | has | energy | volume | and | mass 7 unigrams

All | matter | has | energy | volume | and | mass 5 unigrams + 1 bigram

All | matter | has | energy | volume | and | mass 5 unigrams + 1 bigram

. . .

All matter has | energy | volume | and | mass 4 unigrams + tri gram

. . .

All matter has energy volume and mass 1 7gram

20How did we automate wise crowd content analysis?

PyrScore Semantics

● Generates a latent vector representation of each phrase in a CU

CU106: Matter has volume and mass (W=4)because it contains both volume and mass

it takes up space defined as volume and contains a certain amount of material defined as mass

Matter is anything that has mass and takes up space (volume)

Matter contains volume and mass

21How did we automate wise crowd content analysis?

● Latent semantics:

○ Weighted Text Matrix Factorization (WTMF;

Guo and Diab, 2012)

○ Assigns small weight to unseen words

○ Word vectors trained offline

PyrScore Scoring

● Generates a WTMF vector representation of each CU phrase

● Generates a WTMF vector representation of each segment in a new

summary

● Similarity to CU is the average cosine similarity to all phrases in the CU

● Optimal assignment of candidate ngrams to CUs

○ A maximum weighted independent set problem

○ Applies a greedy algorithm (WMIN; Sakai et al 2003)

22How did we automate wise crowd content analysis?

PEAK (Pyramid Evaluation by Automated Knowledge Extraction)

● Segmentation: Applies Open Information Extraction tools to extract Subj-

Pred-Obj (SPO) triples from sentences

Matter can be detected and measured because it contains volume and mass

Subj(Matter) Pred(Detected and measured) Obj(because it contains volume and

mass)

Subj(Matter) Pred(contains) Obj(volume and mass)

. . .

● Semantics: Uses explicit representation of meaning (random walks over

WordNet)

23How did we automate wise crowd content analysis?

PEAK Aligns SPO Triples

● From different reference summaries to construct the model

● Uses a hypergraph

○ Triples are hyperedges of SPO nodes

○ Edges between nodes are semantic similarity

● Each CU is a weighted triple

24How did we automate wise crowd content analysis?

PEAK Aligns SPO Triples

● Each CU is a weighted triple

● New summary is a list of triples

● Edges in bipartite graph added from CUs to SPOs

if semantic similarity ≥ 0.50

● Uses the Munkres-Kuhn algorithm with CU

weights as edge costs

25How did we automate wise crowd content analysis?

PyrEval extends PyrScore

● Builds full pyramid using new weighted independent set algorithm

● Decomposes sentences into syntactically meaningful units (roughly clauses)

● Uses the same distributional semantics○ WTMF performs better than Word2Vec

○ WTMF performs better than Glove

● Uses the same scoring algorithm

26

PyrEval constructs a pyramid by a novel set allocation method

● Nested sets

○ Every sentence has a set of segmentations, only one of which can be

selected

○ Every CU is a set of segments, each from a different summary

○ Every pyramid layer is a set of CUs

27

EDUA: Emergent discovery of units of attraction

28

● Constructs a graph

○ Nodes are segments

○ Edges weighted by force of “attraction” (e.g.,

semantic similarity)

● Edge types

○ Dashed edges: attraction(ni,nj) > 𝛂○ Solid edges: connect segments into CUs

Assignment of segments to a CU obeys constraints

● Maximize the average Weighted Avg Similarity within each pyramid layer n

● Capacity of each layer y given segments x

● Relative size of each layer

● No empty layers

● One segmentation per sentence; at most one CU per segment

29

PyrEval and humans construct similar pyramids

● CUs: 69 (PyrEval) versus 60 (Annotator 1) or 46 (Annotator 2)

● Similar distribution○ PyrEval: 1 w5, 2 w4, 7 w3, 22 w2, 37 w1

○ Annotator 1: 3 w5, 7 w4, 13 w3, 15 w2, 22 w1

● Example same weight○ PyrEval (w5): Physical props can occur without changing the identity or nature of the matter

○ Annotator 1 (w5): Physical props can be observed without changing the identity of the matter

● Example different weight○ PyrEval (w4): Unlike physical change, chemical change occurs when the chemical properties

of the matter have changed and a new substance is produced

○ Manual (w3): The difference between a physical change and a chemical change is that a

chemical change creates a new substance

30

A Rubric for Contextualized Curricular Support

● From a study of 16 community college classrooms

● 120 students wrote summaries of a middle school text,

What is Matter?○ Read the passage

○ Answered main ideas questions

○ Wrote the summary

● Researchers identified 14 main ideas

● Main ideas score of a summary: % of main ideas ○ Included partial credit

○ Interrater reliability: Pearson correlation: 0.92

31What assessment rubric did we compare it to?

Pearson correlations of automated and manual methods

32

Correlation

PyrScore 0.95

PEAK 0.82

PyrEval 0.87

What were the results?

Pearson correlations of 120 Main Ideas scores and automated methods

33

Manual Test 120

PyrScore 0.83

PEAK 0.70

What were the results?

Content scores are transparent, can support feedback

● Does the summary have enough important ideas, given its length? (Quality

score)

● Does the summary have enough important ideas, given the set of possible

important ideas (Coverage score)

● Does the summary have a good balance of both (Comprehensive score)

● Which important ideas were expressed?

● Which important ideas were missed?

34

Conclusion

● Wise Crowd Content Analysis

○ Works well to identify important ideas

○ Importance emerges from the wise crowd

○ Correlates with an independently developed main ideas rubric

○ Requires only 5 reference summaries

● Fully automated methods: PyrEval and PEAK

○ Pretrained methods, and parameter tuning on small development set

○ Perform less well if sentences are very complex (e.g., automatic

summarizers on newswire)

○ Potential to inform revision

35Conclusion

What’s Next? Content assessment of essays

● Same ideas are referenced multiple times in the same essay, through

multiple means

○ Paraphrase, definite descriptions (“the evidence shown

here”), deictic pronouns (“This indicates . . .”)

○ Will require more complex methods to detect “the same” idea

● Discourse structure and function

○ Interrelations among ideas within the text

○ Discursive versus argumentative

36What’s next?