Post on 17-Dec-2015
1/25
Malcolm ClarkSupervisors:
Professor Patrik O'Brian Holt Dr Ian Ruthven
Genre Analysis of Structured E-mails for Corpus Profiling
Workshop on Corpus Profiling for NLP/IR
Malcolm Clark 2/25
Presentation Outline
• Introduction• The Problems• Information Retrieval (IR), Genre and
Perception• Experiment – Research Questions,
Setup, How do People use Textual Features?
• Conclusions• Contributions and Implications• Future Work
• Focuses IR and cognitive psychology.
• Corpuses contain ‘exemplar’ documents called genres useful for profiling corpora
• E-mail exchanges have socially constructed communicative behaviours which exist to improve the efficiency of a community of practice and for profiling corpora.
• Investigate these types of genres and how people use emails in terms of genre and perception for filtering.
Malcolm Clark 3/25
Introduction
Malcolm Clark
• Identifying genres for profiling corpus• Filter correct types of documents to user
by genre:• E-mail filtering• Understanding user tasks
• Rapidly understand a text without the necessity for parsing the whole document?
4/25
The Problems
Malcolm Clark
Malcolm Clark 5/25
The Project Examines:• The value of structure.
• How form or layout is perceived in structured texts?
• Constructivist (recognition) and ecological approaches (action afforded ) or are they both used?
• If and how the objects of a community of practice (COP) can be comprehended and exploited?
• How readers react to genre features in document collections.
6/25
Information Retrieval
Division of IR into computer science lab experiments vs ‘user-orientated’ social studies
Järvelin(2006)Malcolm Clark
Malcolm Clark 7/25
Genre – Background
Purpose
TYPICAL GENRE
Form
Structural Features
Comm’sMedium
Language or Symbol System
TopicsThemes Topics
Arguments Discourse Structure
Topics
Communicative purpose
Formality, specialised
vocab
Readily observable
features
Malcolm Clark
Orlikowski and Yates 1994
8/25
Corpus - Genre Example from E-mail-call for papers
Titles:
Topics (list)
Header: Title etcAbstract
Dates and submission
Malcolm Clark
What ? Social institutions/sites. When? Human ‘agents’ draw on genre rules to engage in organizational communication.How? Produced, reproduced, or modified.
But how are they perceived and used?
9/25
Genre – What are Communities of Practice (COP)?
Malcolm Clark
Two prominent fields in perception research:
10/25
Human Perceptual Systems
Malcolm Clark
Final goal?
Recognition Action
• How human beings use genres features and what do they perceive?
• How can genre categorization be performed by using current skimming methods?
• How do genres evolve in communities of practice (i.e. e-mail etc)?
• How are the document genres and structural attributes used?
11/25
Experiment Pilot - Research Questions:
Malcolm Clark
By eye tracking i.e. the position and movement of
the eye:
• Collect and analyse the empirical data produced by experiments in e-mail community of practice.
• Locating the strategies and features for profiling corpora - e.g. centred blocks of text, invariant cues. Taking into account: features, strategies etc.
• How do humans view genre?
12/25
Experiment Pilot - How do People Use Texts?
Malcolm Clark
13/25
Experiment Pilot
Malcolm Clark
• Method - 4 x 16 image blocks (4 genres in each two blocks).
• Measurements• Amount of genres id’d correctly - purpose• Structure vs Non-structure form - form• Identification of genre response time - form• Strategies and distinguishing features - purpose
and form• Variables
• Purpose/type of genre • Form in 4 representations………………………..
14/25
Pilot - Setup
Malcolm Clark
15/25
CFP - Content AND Structure
Malcolm Clark
16/25
CFP – Structure and No Content
Malcolm Clark
17/25
CFP – Content No Structure
Malcolm Clark
18/25
CFP – No Content AND No Structure
Malcolm Clark
• Task and procedure
• Shown 64 images
• Vocally Id each image.
• Eyetracker records features and strategies used.
• Data recorded
• X/Y location saccades and fixations.
• Features and strategies
• Desktop video recording – Wink
• Timed and vocal responses
19/25
Setup
Malcolm Clark
• Amount of genres id’d correctly-purpose • 11.5 per block out of 16. • Un-structured vs structure 41.6%/72.9%• Orig (87.5%),Orig no content (77%), content no
struc (68%), non 27%• Structure vs Non-form - av. response time (sec): 2.22 vs 2.72HOW WAS IT DONE?????• Clues to strategies:
• skimmed shape - left (sem) / centred (cfp) aligned and blocks of text/numerics • No structure/no struc or content: wide spirals of
scanning behaviour poss looking keywords?
20/25
Results after 5 Participants
Malcolm Clark
21/25
Results – Distinguishing features
Malcolm Clark
Genre Features
CFP Dates, centered blocks
Cinema Block numerical content
ITS Inconclusive (participants ignore them?)
Lib List book (s) info at bottom
Nl Paragraph/summary of item then URL
Ord Left alignment/currency
Sem Inconclusive
Spam Keywords LOTTO/address and uppercase emboldened text
• Genre largely overlooked but momentum is building.
• Our approach is useful for filtering e-mails/id features for characterising datasets
• Purpose and form very useful for using texts.• Clues to perception processes found but need
to add familiarity to the mix.• Train machine to emulate human behaviour
and understand textual input without reading whole text?
22/25
Conclusions
Malcolm Clark
• Development of a language/perception theory/framework of:
• How people use different types of texts. • Modelling user tasks and behaviour in
relation to genre and perception.
• Extend laboratory IR/user-orientated IR approach
• From: algorithms and machines.• To: a user-oriented and contextual level.
23/25
Contributions and Implications
Malcolm Clark
• Focus on narrowing down my work domains.• Investigate domains:
• Academic documents collections: CSIRO Enterprise
• Legal documents - Enron• Weblogs – TREC Blog• Web domains - Wikipedia
• Consider multi-genres e.g. course books, large documents e.g social work report
24/25
Future Work
Malcolm Clark
Malcolm Clark 25/25
• Useful features for profiling corpora.• Adds another type of filtering to large data
collections to take advantage of genre i.e. news, biographical etc.
• Genre benefits organisations financially and administratively i.e. rapid retrieval of information.
• Embrace genre and perception to understand and examine these structures!
26/25
Motivation
Malcolm Clark
• Model the findings based on FERRET and McFRUMP’s Predictor and Substantiator.
• Our system: Genre Retrieval and Understanding Memory Program or GRUMP.
• Similar features to Clark and Watt (2007)?
27/25
Evaluation System
Malcolm Clark
28/25
Skimming & Categorisation Skimming• Used to identify the main points in a text much quicker than normal reading without having to understand every word. • Normally used when a reader has a large amount of text to read within a limited time.
Categorisation• Automatically labelled or classified.• No need for manual organisation, labelling or sorting.
Malcolm Clark
29/27
Evaluation System – How it WorksTexts
McFRUMP Parser
AbstractsCase frame patterns
Query Parser
Queries
Case Frame Matcher
Relevant Texts
Figure taken from Mauldin 1991
McFRUMP parser contains the Predictor/Substantiator, Scripts etc
Malcolm Clark
30/25
Evaluation System – Script ExampleUsing Schank’s (1981, ch 3) Conceptual Dependency theory of Scripts, Plans and Goals and DeJong’s (1982) FRUMP make different genre script’s:
John Doe was arrested last Saturday morning after holding up the New Haven Savings Bank
$ARREST SCRIPT
•Police arrive at suspect location
•Suspect Apprehended
•Taken to police station
•Charged
•Incarcerated or bailed
Using this type of script format to understand stories, genre rules/features can be specified in scripts to understand texts.
Modify script with
genre rules