2003 - Discovery of Working Activities by Email Analysis

download 2003 - Discovery of Working Activities by Email Analysis

of 13

Transcript of 2003 - Discovery of Working Activities by Email Analysis

  • 8/8/2019 2003 - Discovery of Working Activities by Email Analysis

    1/13

    1

    Discovery of Working Activities by Email Analysis

    Yueyu Fu & Hong Zhang

    School of Library and Information Science

    Indiana University, Bloomington

    Email {yufu | honzhang} @indiana.edu

    April 30, 2003

    Abstract

    Email has become one of the most widely used computer applications. As the

    number of emails we exchange increases at high rate, the number of uses for email

    increases. Email data patterns may give us other useful information such as personal

    activities. However, there is no proper visualization tool which can meet this purpose.

    Our goal is to explore working activity involved in email communication using Treemap

    algorithm. The results show that the Treemap layout can successfully present the various

    activities involved in the email flow.

    Introduction

    Email has become one of the most widely used computer applications. As the

    number of emails we exchange increases at high rate, the number of uses for email

    increases. Although email was originally designed as a communication tool, it is

    currently being used for a number of additional functions including personal archiving

    and task management. In addition, email data patterns may give us other useful

    information such as personal activities. However, there is no proper visualization tool

    which can meet these purposes.

  • 8/8/2019 2003 - Discovery of Working Activities by Email Analysis

    2/13

    2

    There have been a number of studies aimed at exploring the current uses of email

    and identifying the problems commonly encountered by users dealing with large amount

    of emails. They have focused on two features of email: threads and time. Timestore

    organizes emails by time and send in a two-dimensional grid. It focused on time-based

    email archive and retrieval. Outlook 2000 and NECs VisualMail also have time-based

    view. However, the view might be messy and hard to understand if there are too many

    emails. Threading is necessary to help manage conversation history and track the status

    of conversation in email. Usually, a thread is defined as a series of emails sharing the

    same subject line, where prefixes such as Re: and Fw: are ignored. Time attribute of

    email is very important in both visualization approaches. However, neither of these

    systems can discover the various activity trends hidden in email communications. In this

    paper, we proposed a visualization of email dataset to help users perform this kind of task.

    Visualization Goal

    The outcome of our project is going to be used to explore working activity by

    analyzing email flow.

    User AnalysisThe intended audience of this project can be any one communicating extensively with

    others by email. For instance, programmers email each other to solve programming bugs,

    and researchers ask for help to locate research papers by email. The users can be globally

    distributed, and of various genders, ages, professions, lifestyles, and technical skills. But

    all of them own the knowledge of using email and manipulating mouse for interaction.

  • 8/8/2019 2003 - Discovery of Working Activities by Email Analysis

    3/13

    3

    To apply our visualization tool, users have to be able to identify patterns, color hues,

    and intensity levels. They also can easily distinguish between different morphological

    elements such as words, shapes and images. In addition, their desktops have to have

    certain graphical ability, color monitor, and browsing software which allows visualization

    tool such as Java Applet to work properly.

    Task AnalysisWell-organized emails may give people clearer idea about their working progress.

    Therefore it is not unusual that people spend lots of time in arranging their mailbox. They

    categorize emails, create folders, and house emails with same subject in one folder. Most

    email system provides features such as creating folders, moving email to a specific folder,

    and deleting email or folder. Those functions allow people to clean their mail box and

    organize it to some degree. But it suffers disadvantages as below:

    The organization work is time-consuming. People have to figure out how to label their folders. The name of each folder

    serves a reminder of email contents in this folder. But if the name is too long,

    due to space issue, people can only see partial of it. Or if people create some

    abbreviation in place of the full name, they may suffer the danger of forgetting

    the meaning of the abbreviation.

    The number of and relationship between emails in a folder are hard to see.Based on above, we proposed a new system which will provide following features:

    Each email will be represented as a rectangular in the Treemap. Emails with same subject will be grouped together. The colors represent different senders.

  • 8/8/2019 2003 - Discovery of Working Activities by Email Analysis

    4/13

    4

    The sender, subject, and time of email will be shown by moving mouse overthe specific rectangular for this email. Thus people need not to open each

    email to see those kinds of information.

    After clicking the rectangular assigned to an email, a pop-up windowcontaining email content will be generated.

    The control panel will allow user to manipulate the Treemap to get an optimalview.

    Data Mining

    The dataset of our project is provided by Mr. Jason Baumgartner, the instructor of

    the course Information Visualization. This dataset consists of 1695 emails, each of which

    contains subject title, content body, senders name, senders email address, sent type,

    receivers name, receivers email address, and received type. The senders and receivers

    of those emails are software developers. They communicated extensively by email during

    software development. And the email contents are closely related to problems and

    progresses in the period of development of new software. We hope to discover working

    activities of those developers by analysis of those emails.

    The dataset is stored as a table in Microsoft Access. Each record has nine fields:

    identification number, subject, content, sender name, sender address, sent type, receiver

    name, receiver address, and received type. Due to our goal, which is to visualize emails

    to explore working flow, the attributes such as subject, body content, and sender are

    important. By quickly browsing table contents, we discovered three characteristics of our

    datasets. First is that the subject title of each email is a good representative of its body

  • 8/8/2019 2003 - Discovery of Working Activities by Email Analysis

    5/13

    5

    content. Second is that the developers has had discussions on several topics. And the

    number of emails involved in each discussion may indicate different interest developers

    have in different topics. Third is that the dataset presents an obvious hierarchical structure.

    Based on these three points, we determined to categorize the data according to subject

    field. By going through the whole dataset, we identified that the first sub-level contains

    three categories: subject with Fluency, subject with Knownspace-teama, and subject

    without either Fluency or Knownspace-teama. We did queries on both the original

    and derived tables. Finally, the dataset was divided into tables with a tree-like structure

    (See Figure 1). It has three first-level nodes, two of which contain three and five

    secondary-level nodes respectively. The number of third-level nodes contained by the

    eight secondary-level ones is 2, 3, 0, 2, 2, 2 and 3, respectively. This structure should be

    able to be visualized very well by Treemap.

    Each of these derived tables maintains the same format as the original table. Our

    algorithm has been redesigned so that it can retrieve field content of each of those tables

    automatically.

  • 8/8/2019 2003 - Discovery of Working Activities by Email Analysis

    6/13

    6

    Figure 1. Hierarchical Structure of Dataset

    Visualization & Interaction

    To pursue our design goals, Treemap layout is utilized to visualize our dataset.

    The Treemap algorithm we chose for this project was developed by Christophe Bouthier.

    It uses a space filling technique to map a tree structure into nested rectangles with each

    rectangle representing a node. A rectangle area is first allocated to hold the representation

    of the tree, and this area is then subdivided into a set of rectangles that represent the top

    level of the tree. This process continues recursively on the resulting rectangles to

    represent each lower level of the tree, each level alternating between vertical and

    horizontal subdivision. According to Ben Shneiderman, Treemap layout is best suited to

    hierarchies in which the content of the leaf nodes and the structure of the hierarchy are of

  • 8/8/2019 2003 - Discovery of Working Activities by Email Analysis

    7/13

    7

    primary importance, and the content information associated with internal nodes is largely

    derived from their children. However, Treemaps should not be used to convey

    hierarchical structure of a very large data set.

    To utilize the Treemap package, each node of the data tree should implement the

    treemap.TMNode interface. Once each node implements the TMNode interface, it just

    needs to create a new Treemap, and then pass it the root of the data tree in the constructor.

    We utilized the file directory structure to construct the hierarchy of the dataset. Each

    folder represents a category. Folders may contain various numbers of sub-folders. The

    bottom folder contains a text file, which includes all the record IDs corresponding to that

    category. Now, the Treemap is ready to get any number of views of the data tree passed

    in parameter. Each view can be configured independently from each other, and if the data

    tree is changed, all view will be updated.

    InterfacesTwo interfaces were created to explore the dataset. They implemented the

    Treemap algorithm in different way. Both interfaces represented the hierarchical structure

    in a Treemap layout. But each one chose a different color coding scheme.

  • 8/8/2019 2003 - Discovery of Working Activities by Email Analysis

    8/13

    8

    Figure 2. Interface I individual activities

    Interface I (See Figure 2) was designed to discover the individual activities

    involved in the lifecycle of the software development. Software development is a team-

    work. Exploration of interaction trends the email communication can help understand the

    development process better and may provide instructions for further software

    development. To visualize these personal activities, a color was assigned to represent the

    emails from one of the most active person in the email flow. The emails from the inactive

    senders were represented by another color. A threshold was applied to select the most

    active senders based on the volume of the emails they sent. In a Treemap, the size of a

    node is usually determined by a numeric attribute associated with that node. For our data

  • 8/8/2019 2003 - Discovery of Working Activities by Email Analysis

    9/13

    9

    set, the size of the node is set to a constant because there is no other suitable attribute

    associated with the nodes.

    Figure 3. Interface II component distribution

    Interface II (See Figure 3) was designed to explore the critical human efforts

    involved in the lifecycle of the software development. Developing software needs a lot of

    team-based intelligent human efforts. It would be interesting to see how these efforts are

    distributed in the whole development process. It also will discover the major parts of the

    process visually. Hopefully, this will help improve project management when planning

    project progress and assigning human labor. In this Treemap layout, the size of the nodes

  • 8/8/2019 2003 - Discovery of Working Activities by Email Analysis

    10/13

    10

    is set to a constant for the same reason as above. The color of each node is dependant on

    the category that email belongs to.

    InteractionsThe Treemap provides a control panel allowing user to freely manipulate the way to

    organize emails (See Figure 4).

    Figure 4. User Interface Overview

    As moving mouse over, a tooltip containing the information of the node can be

    seen (See Figure 5). Users can see the detailed information by left clicking on the node

    (See Figure 6). A pop-up window will show up with the whole email message. Each

    email message has the subject, the sender, and the body. A potential problem with this

  • 8/8/2019 2003 - Discovery of Working Activities by Email Analysis

    11/13

    11

    interaction is that email bodies are not well formatted so that the display sometimes may

    be messy.

    Figure 5. Interactive Function: Tooltip

  • 8/8/2019 2003 - Discovery of Working Activities by Email Analysis

    12/13

    12

    Figure 6. Interactive Function: Pop-up Window

    Discussion

    Treemap layout is designed to manipulate large data set. In this visualization, it

    can handle much larger data set than the one we are using now. Though the size of the

    nodes will become smaller and smaller, the goals of the visualization wont be affected.

    The pattern hidden in the data set can still be discovered. However, as the data set gets

    larger, the topics involved may also increases and become more diverse. This can make it

    difficult to categorize the data into meaningful groups, which depends on human decision.

    Comparing standard classic Treemap and squarified Treemap, the latter one is better at

    dealing with large data set. Form our result, in the standard classic Treemap, the

  • 8/8/2019 2003 - Discovery of Working Activities by Email Analysis

    13/13

    13

    individual cells are hard to identify and sometimes even dont display very well. The

    cells can get clotted and only a black area can be seen. A desirable extension of the

    current system is to provide additional options to analyze the data. Also, providing a

    legend to explain the color coding scheme will help users to interpret the graphs.

    Acknowledgement

    Thanks to Jason Baumgartner who provided data set and helped us in the

    development of this project.

    References

    1. C. Bouthier, Treemap visualization package, 2001.

    2. J. Baumgartner, Y. Zou, and K. Brner, Space Filling or Treemap Algorithms at

    http://iv.slis.indiana.edu/treemap.html

    3. S. L. Rohall, D. Gruen, P. Moody and S. Kellerman, Email Visualiztions to Aid

    Communications, Late-Breading Topics,Proceedings of the IEEE Symposium on

    Information Visualization, October 22-23, 2001, San Diego, CA., pp. 12-15.

    4. S. Sudarsky and R. Hjelsvold, Visualizing Electronic Mail, International Conference

    on Information Visualization, 10-12 July 2002, London, England, UK., pp. 3-9.

    5. Y. A. Kim and M. Shin, Project Report, Retrieved from

    www.cs.umd.edu/class/spring2001/cmsc838b/Project/Kim_Shin/FinalReport.doc