September 5, 2001Melanie Martin - AI Seminar1 AI Seminar Our web page is at: gradrep Under...

44
September 5, 200 1 Melanie Martin - AI Semin ar 1 AI Seminar Our web page is at: www.cs.nmsu.edu/~gradrep Under “Events” in left frame
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    1

Transcript of September 5, 2001Melanie Martin - AI Seminar1 AI Seminar Our web page is at: gradrep Under...

September 5, 2001 Melanie Martin - AI Seminar 1

AI Seminar

Our web page is at:

www.cs.nmsu.edu/~gradrep

Under “Events” in left frame

September 5, 2001 Melanie Martin - AI Seminar 2

Identifying Ideological Point of ViewPart II

Melanie Martin

September 5, 2001

September 5, 2001 Melanie Martin - AI Seminar 3

Outline of this presentation Where are we??? Ideology Statistical NLP and Machine Learning Discourse features Internet Conclusion

September 5, 2001 Melanie Martin - AI Seminar 4

Where are we???

Let’s recall what we want to do:

Build a system that could take information from web pages and Usenet newsgroups on a given topic and segment, classify or cluster it by ideological point of view…..

September 5, 2001 Melanie Martin - AI Seminar 5

The Proposed System

IdeologicalClustering

TopicClustering,Filtering

Set of documents

on topic

Internet:Web pages,

Usenet

Docs ontopic

clustered by IPV

SearchEngine

User inputstopic

September 5, 2001 Melanie Martin - AI Seminar 6

Where are we???

What do we need?

– A computationally feasible definition of ideological point of view

– A search engine, possibly with additional processing, to produce a collection of documents on the topic specified by the user

September 5, 2001 Melanie Martin - AI Seminar 7

Where are we???

What else do we need?

– A module to cluster documents by ideological point of view

– A user interface

– A way to evaluate the system

September 5, 2001 Melanie Martin - AI Seminar 8

Where are we???

Why do we need this? Some examples using google:

– query: back pain ~2,220,000• scoliosis ~121,000

– query: lyme disease ~163,000– query: zoning shopping center ~65,100

• (add) clark county nv ~299

– query: un racism conference ~74,000

September 5, 2001 Melanie Martin - AI Seminar 9

Outline of this presentation Where are we??? Ideology Statistical NLP and Machine Learning Discourse features Internet Conclusion

September 5, 2001 Melanie Martin - AI Seminar 10

Ideology

Working definition from van Dijk: “Ideologies are the fundamental beliefs of a group and its members.”– instantiated as Us vs. Them– predefined ideologies will not work across

domains– want to avoid researcher bias– definition likely needs more work

September 5, 2001 Melanie Martin - AI Seminar 11

Ideology

Linguistics– van Dijk (1998)– Blommaert & Verschueren (1998)– Wang (1993)– Wortham & Locher (1996)

September 5, 2001 Melanie Martin - AI Seminar 12

Ideology

The Systems– Ideology Machine -1965 to 1973 - Abelson et al.– Politics - 1979 - Carbonell– Pauline - 1987 - Hovy– Tracking Point of View in Narrative - 1994 - Wiebe– Spin Doctor - 1994 - Sack– Terminal Time - 2000 - Mateas et al.

September 5, 2001 Melanie Martin - AI Seminar 13

Ideology

Some issues– Evaluation!!!– Hard-coded knowledge– Domain dependence– Cognitive plausibility– More precise definitions

September 5, 2001 Melanie Martin - AI Seminar 14

Outline of this presentation Where are we??? Ideology Statistical NLP and Machine Learning Discourse features Internet Conclusion

September 5, 2001 Melanie Martin - AI Seminar 15

Statistical NLP and ML

Two techniques we will consider– Latent Semantic Analysis– Probabilistic Classification

September 5, 2001 Melanie Martin - AI Seminar 16

Statistical NLP and ML

Issues– clustering versus classification

• categories may not be predefined• may want to take a variety of features into

account

– favor learning over hard-coding knowledge– supervised versus unsupervised

• cost of annotated training data

September 5, 2001 Melanie Martin - AI Seminar 17

Statistical NLP and ML

Latent Semantic Analysis– text represented as a matrix

• entries are weighted frequency of word in context

– semantic space obtained through SVD• words appearing in similar context have similar

feature vectors

– characterizes semantic content of words in context

September 5, 2001 Melanie Martin - AI Seminar 18

Statistical NLP and ML

Why LSA is a good choice here– semantics is key component of ideological

discourse– clustering without need for predefined

categories– already shown useful for:

• summarization (Ando 2000)• text segmentation (Choi 2001)• measuring text coherence (Foltz 1998)

September 5, 2001 Melanie Martin - AI Seminar 19

Statistical NLP and ML

We want to look a little more closely at Ando’s work– uses term, sentence, and document

vectors– modified SVD algorithm– interesting interface

Multi-document summarization by visualizing topical content. Rie Kubota Ando, Branimir Boguraev, Roy Byrd, and Mary Neff. ANLP/NAACL '00 Workshop on Automatic Summarization

September 5, 2001 Melanie Martin - AI Seminar 20

Statistical NLP and ML

Another option is a probabilistic classifier– assigns most probable class to an object

bases on a probability model– can we get around predefined classes?

September 5, 2001 Melanie Martin - AI Seminar 21

Statistical NLP and ML

Probability model– defines joint distribution of variables

• set of feature variables and a class variable

Wiebe and Bruce (1995) got around the issue of not knowing the classes in advance by breaking up the problem and using a series of classifiers

September 5, 2001 Melanie Martin - AI Seminar 22

Statistical NLP and ML

We need to come up with a set of features…our next topic

Then deciding which features to use can be determined statistically with goodness of fit of graphical models

September 5, 2001 Melanie Martin - AI Seminar 23

Statistical NLP and ML

Both methods seem to have a lot of potential

LSA would be easier to implement – possibly a baseline for evaluation of

probabilistic classifiers Less linguistic knowledge gain likely

with LSA

September 5, 2001 Melanie Martin - AI Seminar 24

Outline of this presentation Where are we??? Ideology Statistical NLP and Machine Learning Discourse features Internet Conclusion

September 5, 2001 Melanie Martin - AI Seminar 25

Discourse features

If we use probabilistic classifiers we need features, so we look at:

– linguistics– previous systems– discourse theory– literary theory

September 5, 2001 Melanie Martin - AI Seminar 26

Discourse features

From linguistics and discourse: General strategy of most ideological

discourse (van Dijk’s Ideological Square):

– Emphasize positive things about Us– Emphasize negative things about Them– De-emphasize negative things about Us– De-emphasize positive things about Them

September 5, 2001 Melanie Martin - AI Seminar 27

Discourse features

How are these strategies instantiated in discourse? (van Dijk)– What is there:

• argument structure• syntactic patterns• style and non-literal language• actor descriptions• thematic structure• topoi (standardized topics)

September 5, 2001 Melanie Martin - AI Seminar 28

Discourse features

– What is not there• implication• presupposition• inference• goals and plans

September 5, 2001 Melanie Martin - AI Seminar 29

Discourse features

Disclaimers, selected examples:– Apparent Negation: I have nothing against X, but...– Apparent Concession: They may be very smart,

but...– Apparent Empathy: They may have had problems,

but...– Apparent Effort: We do everything we can, but...

Positive self-representation and face keeping

September 5, 2001 Melanie Martin - AI Seminar 30

Discourse features

Some discourse theories from Computational Linguistics

– Mann & Thompson (RST) (1988)– Grosz & Sidner (G&S) (1986)– Morris & Hirst (Lexical chains) (1991)

September 5, 2001 Melanie Martin - AI Seminar 31

Discourse features

Issues

– implementation• G&S, RST

– finite number of fixed primitives• RST

– domain specific• RST depends on training

September 5, 2001 Melanie Martin - AI Seminar 32

Discourse features

A reasonable first approach: Lexical Chains (Morris & Hirst)

Sequences of related words spanning a topical unit in the text– based on lexical cohesion– encapsulates context– helps identify key phrases

September 5, 2001 Melanie Martin - AI Seminar 33

Discourse features

Idea of Algorithm– read next word

• if candidate– check chains within suitable span

» check thesaurus or WordNet» check other knowledge sources

– if found » include in chain» recalculate chain

September 5, 2001 Melanie Martin - AI Seminar 34

Discourse features

Lexical chains could help us in:– topic segmentation– intentional structure– lexical features for a classifier

September 5, 2001 Melanie Martin - AI Seminar 35

Discourse features

Lexical chains are easy to implement, but are unlikely to be sufficient…

For the next approximation: RST– Marcu’s implementation incorporating G&S– Mostly used for summarization and

generation– Would help get at the argument structure

of the text

September 5, 2001 Melanie Martin - AI Seminar 36

Discourse features RST Basics

– about 23 rhetorical relations• account for discourse coherence• link adjacent spans of text

– 5 schema• defined in terms of relations• specify how spans can co-occur

– nucleus and satellite spans– end up with tree structure

September 5, 2001 Melanie Martin - AI Seminar 37

Discourse features

Would most likely use RST to generate features for a classifier or as input to a pattern recognizer

Nuclei spans help pick out the more important segments of text

Produces a tree that gives the structure of the rhetorical structure of the text

September 5, 2001 Melanie Martin - AI Seminar 38

Outline of this presentation Where are we??? Ideology Statistical NLP and Machine Learning Discourse features Internet Conclusion

September 5, 2001 Melanie Martin - AI Seminar 39

Internet

We would like to mine the structure of the internet – see if there is a correspondence with

groups– improved IR by topic– figure out what search engine to use as a

base for our system

September 5, 2001 Melanie Martin - AI Seminar 40

Internet

Issues– topic or query disambiguation– what is a minimal unit– how to use the structure of the web

• finding authorities• communities and subgraphs

– Evaluation!!!

September 5, 2001 Melanie Martin - AI Seminar 41

Internet

Kleinberg (1997)– link based model– hub - links to many related authorities– authority– iterative weighting algorithm that

converges (rapidly in practice)– can disambiguate authorities by sense– can be used to trawl for cyber communities

September 5, 2001 Melanie Martin - AI Seminar 42

Outline of this presentation Where are we??? Ideology Statistical NLP and Machine Learning Discourse features Internet Conclusion

September 5, 2001 Melanie Martin - AI Seminar 43

Conclusion It seems that such a system can be built

– find a good search engine– use Kleinberg’s algorithm to improve

collection of documents retrieved– use LSA and/or a probabilistic classifier to

handle the ideological point of view– with a probabilistic classifier use linguistic

and discourse features – develop evaluation methodolgy

September 5, 2001 Melanie Martin - AI Seminar 44

The End

Thanks for listening!

If you want to know more, my Comprehensive Exam paper is at:

www.CS.NMSU.Edu/~mmartin/courses/comps_all.html