Informal email in the workplace
Transcript of Informal email in the workplace
Outline
Overview of the topic
Paper #1 - "The Dynamics of Electronic Mail as a
Communication Medium“
Paper #2 - "tRuEcasIng“
Paper #3 - "Periods, capitalized words, etc.“
Proposed system
Q & A
2
E-mail Formality in the Workplace
E-mail etiquette is an important issue for communication in an organization
Many etiquette guides warn about developing a negative perception from e-mail formality
“People will notice you but for the wrong reasons”
At the same time, many people use an informal tone to demonstrate a level of trust, understanding or friendship
In addition, some guides claim that an informal tone may encourage a response
3
How do we define Formality?
There are many etiquette guides
Do certain rules have a larger negative “penalty”?
Do certain rules have a greater potential for
gaining\establishing a relationship?
Are rules of formality global across all people?
4
Formality at Enron
Let‟s look at a few examples of formality in the
Enron database
First, we‟ll look at a formal example
5
Formal e-mail example
Hello everyone,
My husband Ryan is going to be participating in the American Heart Walk on November 3. He is walking on behalf of our daughter, Sydney, who was born in May of this year with Congenital Heart Disease.
[...]
Please help this worthwhile cause if you can.
Thanks,
Vicki Versen
6
“Less” formal example
i suck-hope youve made more money in natgas last 3
weeks than i have...
mkt shudbe getting bearish feb forward-cuz we
already have the weather upon us-fuelswitching
and the rest shud invert the whole curve not just dec
cash to jan andfeb forward????
have a good weekend john
7
Ranges of formality in the data
I started becoming obsessed looking at these
So far, I‟ve only scratched the surface
It is clear that there is a wide range of formality in
the organization
It is also clear that there is a wide range of
formality across e-mails from the same sender
There are examples very similar the previous two
which are from the same sender
8
Explanation
How can we account for these differences?
Can we find an explanation for this behavior?
Does this behavior tells us another side of the story
that will enhance or change the findings of other
research?
We‟ll talk about these more after the Papers
9
Paper #1 – Habil & Rafik-Galea
“The Dynamics of Electronic Mail as a Communication
Medium”
An overview of how e-mail is used in the workplace
Discussion of formality and the differences between
written and spoken communication
Not an extremely scientific paper
No CompLing techniques
However, I feel it serves a need to begin the discussion
10
Paper #1 (cont.)
Many of the statements in the paper do not seem to
be well founded, but they help to frame differences
in formality
A common use for e-mail is short notes and
responses
“Senders of e-mail typically behave as if the
medium is like speech”
The above statement likely cannot be globally
applied, but there are certainly examples of this
11
Paper #1 – Features of Formality
Proper capitalization
Proper punctuation
Absence of „…‟, Exclamation marks, etc
Absence of contractions
Absence of 1st and 2nd person pronouns
Absence of slang
Absence of informal tone
“you know”, “so”, “I mean”, “sort of”
More examples on the next Slide…
12
Paper #1 – Features (cont)
Absence of abbreviations
Complete sentences
Standard spellings
As opposed to “thru”, “cuz”, “thanx”, “thx”
13
Paper #1 – Data
The data comes from two organizations in Malaysia
E-mails are intended to show communication which is
horizontal (between employees of equal position)
and vertical (sent to an employee of a higher or
lower position in the company)
14
Paper #1 - Findings
The purpose of the paper as stated is to “identify
and discuss instances of the email messages being
„formal‟ or „conversational‟”
The paper does begin a discussion yet there is little
discovered about “how and why?”
The data shown is merely 3 samples formal and
informal
15
Paper #1 – Summary
Potential explanations are discussed, but no hard
numbers
This paper can help us get started on analysis and
selecting features
16
Preview of Papers 2 and 3
Why were Papers 2 and 3 chosen?
Several other features of formality seemed simple
to extract
Capitalization is an important issue and didn‟t seem
trivial
Since we are dealing with Enron, we want this to be
robust within that domain
Ex : “PCE”, “Prentice”, etc.
17
Paper #2 – Lita et al
“tRuEcasIng”
Process of restoring case information to badly-
cased or non-cased text
Statistical language model
Several other applications as well:
Corpora cleaning
Named Entity Recognition
Machine Translation
Automatic Speech Recognition
18
Paper #2 – Problems to solve
Ambiguity can be a significant problem
Several common words like “pond” and “now” might
actually need to be uppercase
Examples :
“us rep. james pond showed up riding an it and going
to a now meeting”
“US Rep. James Pond showed up riding an IT and going
to a NOW meeting”
19
Paper #2 – Baseline and Approach
The baseline used is a simple unigram model
The approach builds a statistical language model
Probabilities include :
Trigrams, bigrams and unigrams
A trellis is constructed which is very similar to a
Hidden Markov Model
Probabilities are computed at the sentence level
20
Paper #2 - Results
Tested against four different test sets
Significant reduction of error compared to the
baseline (unigram model)
On current news stories, the accuracy is ~98%
22
Paper #2 – Future work
Could be applied to:
Accent marks
Punctuation
Additional features could be added or adapted for
improvement
23
Paper #3 - Mikheev
“Periods, Capitalized Words, etc.”
Approach for several aspects of text normalization :
Sentence Boundary Disambiguation (SBD)
Disambiguation of capitalized words
Identification of abbreviations
24
Paper #3 – Preview of Approach
Before going any further, sorry for picking such a
long paper
Coverage will be brief since time is short
Previous work has worked with local contexts
Mikheev proposes a Document-Centered Approach
(DCA) in order to derive information from the entire
document
25
Paper #3 – Building Resources
To use this approach, support resources must be
generated
These resources can be built from raw (unlabeled)
texts
Development resources were created from the New
York Times corpus, but these could also be scraped
from the Internet
26
Paper #3 – Building resources (cont)
List 1 - Common word list
All lowercase words
Threshold was used to prevent source errors in spelling or capitalization
List 2 - Frequent sentence starter list of common words
For all words starting sentences, they are added to the list if they also belong to the common word list
Not perfect, but it provides the 200 most frequent common words that start sentences
27
Paper #3 – Building resources (cont)
List 3 - Frequent proper names list
Single word proper names that also coincide with the
list of common words
Captures words like „China‟ which are also present as
common words like „china‟
Again, the 200 most frequent instances are on the list
4 - Abbreviations list
Collected by applying abbreviation guessing heuristics
28
Paper #3 - Strategies
A cascade of strategies are applied in a specific
order
These strategies use the list resources
Each of these strategies provides different
coverage :
Sequence strategy
Frequent-list lookup strategy
Single-Word Assignment
Quotes, Brackets and “After Abbr.” Heuristic
29
Paper #3 - Results
Results are competitive with other machine learning
and rule-based systems when comparing SBD,
Capitalized words and Abbreviations
Incorporating the DCA method into a POS tagger
significantly reduced error rate
Robust with respect to domain shift and new lexica
30
Paper #3 - Limitations
Processing relies on “well behaved” (non-noisy) text
Not expected to perform well for single cased texts
Short documents -> Not enough clues
Long documents -> Too many clues
Potential solution for short documents is to make use
of a “caching module” to propagate features from
one document to the next
31
Paper #3 – Other testing
Tested on a corpus of Russian news
Different language
Short documents (1-2 paragraphs)
32
Paper #3 – Interesting quote
“We deliberately shaped our approach so that it
largely does not rely on precompiled statistics,
because the most interesting events are inherently
infrequent and hence are difficult to collect reliable
statistics for”
33
Questions
Using the dataset from Jabbari et al, are senders
more likely to be formal in business emails (as
opposed to personal)?
Are certain positions in the company more likely to
be formal?
Are senders more likely to be formal when sending
to a person of higher position?
Are senders more likely to be formal with more
people on a thread (“Broadcast”)
35
Questions (cont)
How likely are senders to use informal
communication on first email contact?
What is the average number of emails before
communication switches from formal to informal?
How often does communication between “switch”
from informal to formal?
Does formal communication become more or less
prominent during the media coverage of the
scandal?
36
Questions (cont)
Are senders likely to echo the style of the person
they respond to?
Is there much shift over time in an individual‟s
formality?
Do “informal connections” support the findings of
Social Network Analysis and other research?
37
Research Issues
It seems that the range of what is considered
“maximum formality” or “minimum formality” is
different across the company
Each sender has their own range of formality
38
Research Issues – Gold Standard?39
Since each sender has a range, annotator
agreement on “overall formality” seems impossible
Annotator cannot classify as “Formal\Informal”
Even a 5 point scale is not reasonable
Best annotation is likely a count of each “informal
speech act”
Analysis of Formality
Content features to be used :
Capitalization issues
Punctuation issues („…‟, Exclamation points)
Contractions
Complete sentences
Q : Since email length differs, how can I normalize these features?
Q : Should each of these dimensions be tracked discretely or calculated into a full score?
40
Analysis of formality (cont)
How to capture capitalization issues?
Possibly create a hybrid solution of both papers (Lita et
al, Mikheev)
Might be able to create DCA resources both from a
“clean” corpus and also the “cleanest” data in Enron
Domain-specific capitalization will likely be critical
Ex : „Lay‟ (Kenneth) vs. „lay‟
41
Analysis of Formality (cont)
Some features seem more difficult to normalize
since they can occur at most once :
Greeting
Sign-off
Q : Should these be used to determine formality or
used for comparison after analysis has been
completed?
42
Comparison metrics
Once I can quantify formality, two important metrics
will be needed:
Average formality across the organization
Average formality across each sender
Comparing each e-mail against these averages with
respect to standard deviation will help us determine
messages which are more or less formal
Significant differences across sender formality will
used for most questions
43
Data capture and Results
Metrics of formality will be stored in a new
database table so that relationships can easily be
analyzed against other data (times, recipients,
business vs. personal, etc)
The data members of this table will capture each
selected dimension of formality and possibly a total
score
Should be simple to start generating initial reports
and answering research questions
44