Segmentation in NLP: A quick survey of techniques and applications Tony Davis (3M Health Information...

Segmentation in NLP:A quick survey of

techniques and applicationsTony Davis (3M Health Information Systems)

DC NLP meetup13 August 2014

What is segmentation?• Dividing a document, broadly defined as a sequential information-bearing

structure, into parts that have some internal coherence– Well known examples: book chapters, episodes of TV miniseries

(segments are contiguous, strictly ordered, disjoint, and exhaustive)– More examples: sections and subsections of a textbook, topically-based

segments defined by an index, plays in a sports broadcast(these depart from the prototype; they’re hierarchical, overlapping, “fuzzy”)

– Notice that these examples are more or less “preset”, at the time the document is created, but ad hoc segments can be useful, too

– Labeling the segments with good terms is an additional aspect of segmentationExamples: chapter titles, section headers in a clinical report, descriptions of plays in a football game

Why is segmentation useful?

• Segmentation guides and improves the user experience

• Segmentation affects meaning

What are we trying to do with automatic segmentation?

• Create “good”/”useful” segments, label them appropriately, and relate them to one another

• What makes a good segment?– Highly dependent on type of content (baseball game,

cooking show, feature-length movie, talk show)– Thus, the mix of techniques for automatic segmentation will

vary• Relating segments to one another?

– Topical similarity: the same people speaking about the same things, similar players or similar kinds of plays, common themes, …

– Continuity of narrative (subplots in a TV comedy series, continuation of a project on home repair show)

Topic-based segmentation• Topic segmentation: dividing a document into topically coherent segments

– Possibly a partition of the document (exhaustive, non-overlapping segments)– But ad hoc segments may be more useful in some circumstances– Labeling the segments with good terms is often treated separately

• Several approaches used:– Discourse features

• Some signal a topic shift; others a continuation• Highly domain-specific

– Similarity measures between adjacent blocks of text• Use document similarity measures, as in TextTiling (Hearst, 1994) or Choi’s

algorithm (Choi, 2000)• Posit boundaries at points where similarity is low

– Lexical chains: repeated occurrences of a term (or of closely related terms)• Again, posit boundaries where cohesion is low (few lexical chains cross the

boundary (e.g., Galley, et al., 2003)

Topic segmentation – text similarity• Choi’s algorithm (Choi, 2000)

– Construct a similarity matrix of all sentence pairs– “Sharpen” the distinctions by replacing similarity values with their ranks in the

local region– Coherent regions appear as brighter

square areas along the diagonal;segment boundaries lie at theircorners

Freddy Y.Y. Choi. 2000. Advances in domain independent linear text segmentation. Proceedings of NAACL 2000, 109-117.

Topic segmentation – lexical chains• LCSeg (Galley, et al., 2003)

– Based on lexical chains: repeated occurrences of a term in a document– Each chain receives a score based on the number of terms within it, and its

length (shorter chains get a higher score)– Cohesion at each potential boundary is measured by comparing scores of

chains spanning that boundary and chains on either side near that boundary– Segments are postulated where cohesion is low

Michel Galley, Kathleen McKeown, Eric Fosler-Lussier, and Hongyan Jing. 2003. Discourse Segmentation of Multi-Party Conversation. Proceedings of the 41st Annual Meeting of the ACL, 562-569

Topic segmentation – modeling term influence and relatedness

• Model both the influence of a term beyond the sentence it occurs in and semantic relatedness among terms– The range of a term’s influence extends beyond the sentence it occurs in, but

how far? (relevance intervals)– Semantic relatedness among terms (contextually mediated graphs)

• Apply this model to topic-based segmentation

• And now, a short diversion about streaming media…

Geetu Ambwani and Anthony R. Davis. 2010. Contextually–Mediated Semantic Similarity Graphs for Topic Segmentation. Proceedings of the Workshop on Graph=based Methods in NLP, 48th Annual Meeting of the ACL.

Automatic segmentation of streaming media

• Streaming media presents particular difficulties– Its timed nature makes navigation cumbersome, so a useful system

should extract relevant intervals of documents rather than present a list of documents to the user

– Speech recognition is error-prone, so the system must compensate for noisy input

• To address these issues, use a mix of statistical and symbolic NLP– Create relevant intervals for each useful term, using term co-occurrence

information (the “COW model”)– Term co-occurrence measured using pointwise mutual information (PMI)

within a corpusPMI(X,Y) =

P(X,Y)/P(X)P(Y)

“Original” StreamSage system architecture

media archiveco-occurringword (COW)

model

largetext

corpus

speech recognition

SR output

topicsegmentation

WSD

anaphorresolution

queryexpansion

relevanceinterval

calculation

virtualdocumentcollection

compoundinterval

calculation

userinterfaces forapplications

Relevance Intervals (ad hoc segments)• Each RI is a contiguous segment of audio/video deemed relevant to a

term• RIs are calculated for all content words (after lemmatization) and multi-

word expressions• RI basis: sentence containing the term

Each RI is expanded forward and backward to capture relevant material, using the techniques including:

1. Topic boundary detection by changes in COW values across sentences2. Topic boundary detection via discourse markers3. Synonym-based query expansion4. Anaphor resolution

• Nearby RIs for the same term are merged• Each RI is assigned a magnitude, reflecting its likely importance to a

user searching on that term, based on the number of occurrences of the term in the RI, and the COW values of other words in the RI with the term

Relevance Intervals: an Example• Index term: squatter

Paul Bew is professor of Irish politics at Queens University in Belfast.In South Africa the government is struggling to contain a growing demand for land from its black citizens.Authorities have vowed to crack down and arrest squatters illegally occupying land near Johannesburg.In a most serious incident today more than 10,000 black South Africans have seized government and privately-owned property.Hundreds were arrested earlier this week and the government hopes to move the rest out in the next two days.NPR’s Kenneth Walker has a report.Thousands of squatters in a suburb outside Johannesburg cheer loudly as their leaders deliver angry speeches against whites and landlessness in South Africa.“Must give us a place…”

• We build an RI for squatter around each of these sentences…


Paul Bew is professor of Irish politics at Queens University in Belfast.In South Africa the government is struggling to contain a growing demand for land from its black citizens. [cow-expand]Authorities have vowed to crack down and arrest squatters illegally occupying land near Johannesburg.In a most serious incident today more than 10,000 black South Africans have seized government and privately-owned property. [cow-expand]Hundreds were arrested earlier this week and the government hopes to move the rest out in the next two days.NPR’s Kenneth Walker has a report.Thousands of squatters in a suburb outside Johannesburg cheer loudly as their leaders deliver angry speeches against whites and landlessness in South Africa.“Must give us a place…” [topic segment boundary]

• We build an RI for squatter around each of these sentences…


Paul Bew is professor of Irish politics at Queens University in Belfast. [topic segment boundary]In South Africa the government is struggling to contain a growing demand for land from its black citizens. [cow-expand]Authorities have vowed to crack down and arrest squatters illegally occupying land near Johannesburg.In a most serious incident today more than 10,000 black South Africans have seized government and privately-owned property. [cow-expand]Hundreds were arrested earlier this week and the government hopes to move the rest out in the next two days. [merge nearby intervals]NPR’s Kenneth Walker has a report. [merge nearby intervals]Thousands of squatters in a suburb outside Johannesburg cheer loudly as their leaders deliver angry speeches against whites and landlessness in South Africa.“Must give us a place…” [topic segment boundary]

• Two occurrences of squatter produce a complete merged interval.

Merging Ris for multiple terms

Iran

Russia

Occurrence of Original Term

Note that this can only be done at query time,So it needs to be fairly quick and simple.

Merging RIs

Iran

Russia

ActivationSpreading


Merging RIs

Iran

Russia

Russia and Iran


Evaluating RI Quality

• Evaluation of retrieval effectiveness in timed media raises further issues:– Building a gold-standard is painstaking, and potentially quite subjective– It’s necessary to measure how closely the system’s RIs match the gold standard’s– What’s a reasonable baseline?

• We created a gold standard of about 2300 RIs with about 200 queries on about 50 documents (NPR, CNN, ABC, and business webcasts), and rated each RI on a scale of 1 (highly relevant) to 3 (marginally relevant).

• Testing performed on speech recognizer output

Evaluating RI Quality

• Measure amounts of extraneous and missed contentmissed by

systemideal RI

system RIextraneousIn system

Evaluating RI QualityComparison of percentages of median extraneous and missed content over all queries between system using COWs and system using only sentences with query terms present

relevance most relevant relevant most relevantand relevant

marginallyrelevant

systemextra miss extra miss extra miss extra miss

withCOWs 9.3 12.7 39.2 27.5 29.9 21.6 61.7 15.0

withoutCOWs 0.0 57.7 0.3 64.7 0.0 63.4 18.8 60.2

How to spot shifts in topic

• Yesterday, I took my dog to the park.• While there, I took him off the leash to get some

exercise.• After 2 minutes, Spot began chasing a squirrel.• ______________(Topic Shift)________________• I realized I needed to go grocery shopping.• So I went off to Trader Joe’s.• Unfortunately, they were out of cashews.

Calculating connection strengths for edges

• Construct a graph in which each node represents a term and a sentence, iff the sentence is contained in an RI for that term

Calculating connection strengths for edges• Label each sentence with the terms that have Ris including that

sentence; think of these as nodes in a graph


Now we’ll connect these nodes to one another,with edges that have weights

Connection strength formula

and in general, for terms a and b in sentences i and i + 1 respectively:

Build a graph of the terms in the document, sentence by sentence

Filtering edges in the graph

A real example (from CNN)

• S_190 We've got to get this addressed and hold down health care costs.

• S_191 Senator Ron Wyden the optimist from Oregon, we appreciate your time tonight.

• S_192 Thank you. • ______________(Topic Shift)______________• S_193 Coming up, the final day of free health

clinic in Kansas City, Missouri.

Graph representation of documentsAfter pruning low-weight edges, regions near a boundary tend to

be sparse

Data and parameter settings

• Two data sets– Concatenated New York Times articles

• 186 pseudodocuments, each containing 20 articles– Closed-captions for 13 TV shows

• News, talk shows, documentaries, lifestyle shows• “Noisier” than news articles

• Parameter settings– Edge-pruning threshold: remove all edges with a connection strength below a

set threshold (we tried a couple and usually used 0.5)– “Normalized novelty”: on the two sides of a potential boundary, the number

of nodes labeled with the same terms, normalized by the total number of terms

– Threshold (given normalized novelty): posit a boundary between sentences where normalized novelty falls below a set threshold (we typically used 0.6)

Systems comparedC99 Implementation of Choi (2000), from MorphAdorner*SN Our system, using a single node for each term occurrence (no

extension)FE Our system, using an extension of a fixed number of sentences for

each term from the sentence it occurs inSS Our system, using Ris without “hard” boundaries determined by

the modified Choi algorithmSS+C Our full segmentation system, incorporating “hard” boundaries

determined by the modified Choi algorithm

* morphadorner.northwestern.edu/morphadorner/-textsegmenter

Evaluation metrics• How well does the hypothesized set of boundaries match the

true (reference) set?• Pk (Beeferman, et al. 1997) and WindowDiff (Pevzner & Hearst,

2002)– Both compare hypothesis to reference segmentation within a sliding window– Pk is the proportion of windows in which hypothesis and reference disagree on

the number of boundaries– WindowDiff tallies the difference in the number of boundaries in each window– Both commonly used instead of precision and recall, because they take

approximate matching into account– They have drawbacks of their own, however

Doug Beeferman, Adam Berger, and John Lafferty. 1997. Text Segmentation Using Exponential Models. Proceedings of EMNLP 2

Lev Pevzner and Marti A. Hearst. 2002. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 28:1

Evaluation metrics• Pk and WindowDiff in action: comparing two segmentations

Evaluation metrics• Pk and WindowDiff: sliding window is half the average reference segment

size

Evaluation metrics• Pk and WindowDiff in action: window slides one sentence forward

Evaluation metrics• Pk and WindowDiff in action: still no disagreement between the two

segmentations within the window

Evaluation metrics• Pk and WindowDiff in action: one black mark against the hypothesis

segmentation, which posits a boundary where the reference doesn’t

Evaluation metrics• Pk and WindowDiff in action: the same mistake counts again against the

hypothesis (note that mistakes closer to reference boundaries appear in fewer windows, and are therefore penalized less)

Results on pseudodocuments

system precision recall F Pk WindowDiff

C99 0.404 0.569 0.467 0.338 0.360

SN 0.096 0.112 0.099 0.570 0.702

FE 0.265 0.140 0.176 0.478 0.536

SS 0.566 0.383 0.448 0.292 0.317

SS+C 0.578 0.535 0.537 0.262 0.283

185 documents, containing 20 NYT articles eachNumber of boundaries not specified to systems

Results on pseudodocuments86 documents out of the 186, for which C99 and SS+C bothfind at least 20 boundaries; top 20 boundaries selected

system prec = rec = F Pk WindowDiff

C99 0.530 0.222 0.231

SS+C 0.643 0.192 0.201

TV show closed-captions: inter-annotator agreement on segmentation

• Annotators note major and minor boundaries, using 1-5 rating scale• As expected, IAA is low, so we create a reference annotation

TV show closed-captions: inter-annotator agreement on segmentation

• Pk values between pairs of annotators: all boundaries and major boundaries

TV show closed-captions: segmentation• Accuracy is low, which is unsurprising given the low IAA

system precision recall F Pk WindowDiff

All topic boundaries

C99 0.197 0.186 0.184 0.476 0.507

SS+C 0.315 0.208 0.240 0.421 0.462

Major topic boundaries only

C99 0.170 0.296 0.201 0.637 0.812

SS+C 0.271 0.316 0.271 0.463 0.621

Using segment-level metadata for sports broadcast segmentation

• Play-by-play data for sports– Available for several sports in near real-time– Particularly useful for sports that are inherently segmented

into discrete plays– Advantages of consistency and quality

• Many other types of segment level metadata, depending on type of content– DVD chapters– Advertising breaks in TV shows– Screenplays

Segment-level metadata for sports• Closed captions can be problematic

– Sometimes describe clearly what’s going on…

>> Kenny: THEY START FROM THEIR OWN 35. AND THIS IS GRANT WHO FUMBLED MOMENTS AGO. HAD A TERRIFIC HALF OF THE SEASON

• Here is data for this play from Stats, Inc. (near real-time):

<play quarter="1" play-start-time="14:33" play-stop-time="14:27" down="1" distance="10" away-score-before="7" away-score-after="7" home-score-before="0" home-score-after="0" yardline="GB35" end-yardline="GB43" possession="GB" end-possession="GB" details="Ryan Grant rush to the left for 8 yards to the GB43. Tackled by Darryl Tapp."/>

Segment-level metadata for sports• Alignment of this type of metadata with the video is fairly easily achieved,

because the game clock is displayed on the video

• Stats, Inc. data for play, with added video times:

<play quarter="1" play-start-time="14:33" play-stop-time="14:27" video-start-time="0:12:41" video-stop-time="0:12:45” down="1" distance="10" away-score-before="7" away-score-after="7" home-score-before="0" home-score-after="0" yardline="GB35" end-yardline="GB43" possession="GB" end-possession="GB" details="Ryan Grant rush to the left for 8 yards to the GB43. Tackled by Darryl Tapp."/>

SEA 7 14:33 Q1GB 0 DN 1 YD 10

Finding interesting plays in a football game

Segmenting clinical documents

• Clinical documents are divided into sections, but they can be messy, and the section boundaries aren’t always clear

• Segmentation affects semantics; family medical history is different from patient’s current diagnosis

• Medical coding (and ultimately reimbursement) depends on accurately discerning the structure of clinical reports

• Thus, automated medical coding requires, among other things, good segmentation (“regioning”)

Where segmentation fits into automated medical coding

ClinicalDocument

……………………………

………………………

Auto-code diagnosis

Auto-code historical

Auto-code procedure

Ignore

Billing codes:V543.12432.11………………………

Billing codesComputer Assisted

Coding Module

code_diagnosis?

no_coding?

code_procedure?

code_historical?……………………

…………………………

ExtractedSections

AssessCodability Tagging ofSections

Machine learning for clinical document segmentation

• Features used include:– Common section header terms– Formatting and punctuation (e.g., all caps, colon at end of line)– Semantic type of terms within a sentence (e.g., body parts,

medications, procedures, findings)• For each sentence, generate a label:

– new_region or continuing_region– Region use labels can be hypothesized at the same time, or

added later• Results are close to human levels of accuracy on many

kinds of documents

Summing up

• Segmentation problems are ubiquitous in NLP• Motivations are both user and machine driven• Many kinds of information to exploit,

depending on the data and the application• Resources available to get started

Segmentation in NLP: A quick survey of techniques and applications Tony Davis (3M Health Information...

Documents

Transcript of Segmentation in NLP: A quick survey of techniques and applications Tony Davis (3M Health Information...