Mining Social Networks for Personalized Email Prioritization Shinjae Yoo, Yiming Yang, Frank Lin,...

32
Mining Social Networks for Personalized Email Prioritization Shinjae Yoo, Yiming Yang, Frank Lin, II- Chul Moon [KDD ’09] 1 Advisor: Dr. Koh Jia-Ling Reporter: Che-Wei, Liang Date: 2009/08/25

Transcript of Mining Social Networks for Personalized Email Prioritization Shinjae Yoo, Yiming Yang, Frank Lin,...

  • Mining Social Networks for Personalized Email PrioritizationShinjae Yoo, Yiming Yang, Frank Lin, II-Chul Moon[KDD 09]

    *Advisor: Dr. Koh Jia-LingReporter: Che-Wei, Liang Date: 2009/08/25

  • OutlineIntroductionSocial ClusteringMeasuring Social ImportanceSemi-supervised Importance PropagationExperimentsConclusions and Future work*

  • IntroductionEmailOne of the most prevalent personal and business communication toolsAsynchronous

    *

  • IntroductionInformation overload problemNeed to develop systems that automatically learn personal priorities for each userIdentify personally interesting Identify important messages for users attention

    *

  • Introduction*

  • IntroductionThis paperCreate a new collection of anonymized personal email data with importance levelsProposed a fully personalized methodology for technical development and evaluationDeveloped a supervised classification framework For model personal priorities over messages, and predicting importance levels for new messages

    *

  • OutlineIntroductionSocial ClusteringMeasuring Social ImportanceSimi-supervised Importance PropagationExperimentsConclusions and Future work*

  • MotivationSender information One of most indicative featuresMessages sent by the members of the same group tend to share similar priority levelCapturing sender groups would be informative for predicting the importance of messages

    If a sender who does not have any labeled instancesBased on unsupervised clustering, infer that users importance from other group members*

  • Personalized Social NetworkFor each user, a personalized social network is constructed by using the email data of that userPracticalityPersonalization

    Email contact network Represent by graph G=(V, E)V: email contacts (users)E: message sending among users, un-weighted(Eij=1 if there is a message from user i to user j, Eij=0 otherwise.)

    *

  • ClusteringNewman ClusteringBe used to successfully find social structuresDefines edge-betweennessA link has a high score means that the link is crucial between two boundary nodes of two clustersDelete links with high edge-betweenness scores, results in disconnect components as clusters

    *ABEDCFGHIJLR

  • OutlineIntroductionSocial ClusteringMeasuring Social ImportanceSemi-supervised Importance PropagationExperimentsConclusions and Future work*

  • Measuring Social ImportanceLink relations provides useful information about the centrality of each contact*

  • Measuring Social ImportanceIn-degree centrality

    Out-degree centrality

    Total-degree centrality*BCDAE

  • Measuring Social ImportanceClustering CoefficientMeasure connectivity among the neighborhood of the node

    Clique CountClique: fully connected sub-graphA large clique count of node v means It connects to large and well-connected sub-graphsIt is located in the center of the sub-graphs

    *BCDAEF

  • Measuring Social ImportanceBetweenness centralityPercentage of existing shortest paths out of all possible paths that goes through the node v

    jk: number of shortest path between j and k jk(i): number of shortest path between j and k that goes through i

    *

  • Measuring Social ImportanceHITS AuthorityHyperlink-Induced Topic Search, also known as Hubs and authoritiesmeasures the global importance of node

    Definition: Adjacency matrix X N-by-N, can be calculated by Finding the principle eigenvector r of matrix, where r satisfies , is the largest eigenvalue

    *

  • Measuring Social ImportancePCC AnalysisPearson Correlation CoefficientCompute PCC of each social metric with human-labeled importance levels of email messagesIndicative about How useful each metric for predicting the importance of email messages*

  • OutlineIntroductionSocial ClusteringMeasuring Social ImportanceSemi-supervised Importance PropagationExperimentsConclusions and Future work*

  • Semi-supervised Importance PropagationSemi-supervised Importance Propagation (SIP)Propagate the importance values of labeled email messages (the training examples) to other messages and corresponding contact persons*

  • SIP Algorithm*

  • SIP AlgorithmTreat each importance label (1~5) as a categoryUse vector (M by 1) to indicate the labels of message,xk,i=1 if message i belongs to category k, xk,i=0 otherwise

    Importance propagation from messages to persons (receivers) is calculated as

    Importance propagation from persons (senders) to messages is calculated as*

  • Propagation Example

    * ? ? ? ? ? 4 3 2 ? ? Messages to persons (receivers) Persons (senders) to messages

  • SIP AlgorithmUpdating of the importance values for contact persons at each time step (t) is calculated by:

    *

  • SIP Algorithm

    is a linear transformation of If is irreducible, and t is large stabilizes at the principal eigenvector of CIrreducible property is not always guaranteedIf so, its principal eigenvector is insensitive to the starting vector

    *

  • SIP Algorithm*

  • SIP AlgorithmFinally, SIP method is define iteratively as:

    ( ) ( )

    Ek is irreducible , yk stabilizes when t is largeyk consists of the expected importance score of each person after iterative SIP*

  • OutlineIntroductionSocial ClusteringMeasuring Social ImportanceSemi-supervised Importance PropagationExperimentsConclusions and Future work*

  • Experiments*

  • ExperimentsFeaturesBasic features are tokens in from, to, cc, title, and body text, use a v-dimensional vector to representSocial-network based featuresUse a m-dimensional sub-vector to represent NC featuresSub-vector (7-dims) to represent the social importance (SI)5-dimensional sub-vector to represent five SIP scores per user

    *

  • Experiments*

  • Experiments*

  • Conclusions and Future WorkFuture workCollection of more data from a larger number of users in a longer time periodComparative study on different clustering algorithms, and graph-mining techniques with respect to effectiveness*

    email =>

    *How can we effectively learn user-specific models for accurate prediction of personalized importance using only small amounts of labeled training data and limited observations on personal communications with others?

    Presents the first study with several statistical classification and clustering methodsAddressing PEP problem based on personal importance judgments by multiple usersCreate a new dataset of anonymized email messages from each user, using for training and testing

    *Indicative: ;;;*A global social network may include noisy feature and de-emphasize personalization*Hyperlink-Induced Topic Search (HITS) (also known as Hubs and authorities) is a link analysis algorithm that rates Web pages, developed by Jon Kleinberg. It determines two values for a page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages*User1 clustering coefficient, clique count and HITS Authority*Canonical: *