Improving Latent User Models in Online Social...

11
Improving Latent User Models in Online Social Media Adit Krishnan Dept of Computer Science, University of Illinois at Urbana-Champaign [email protected] Ashish Sharma Dept of Computer Science, Indian Institute of Technology Kharagpur [email protected] Hari Sundaram Dept of Computer Science, University of Illinois at Urbana-Champaign [email protected] ABSTRACT Modern social platforms are characterized by the presence of rich user-behavior data associated with the publication, sharing and consumption of textual content. Users interact with content and with each other in a complex and dynamic social environment while simultaneously evolving over time. In order to eectively charac- terize users and predict their future behavior in such a seing, it is necessary to overcome several challenges. Content heterogene- ity and temporal inconsistency of behavior data result in severe sparsity at the user level. In this paper, we propose a novel mutual- enhancement framework to simultaneously partition and learn latent activity proles of users. We propose a exible user parti- tioning approach to eectively discover rare behaviors and tackle user-level sparsity. We extensively evaluate the proposed framework on massive datasets from real-world platforms including Q&A networks and interactive online courses (MOOCs). Our results indicate signicant gains over state-of-the-art behavior models ( 15% avg ) in a varied range of tasks and our gains are further magnied for users with limited interaction data. e proposed algorithms are amenable to parallelization, scale linearly in the size of datasets, and provide exibility to model diverse facets of user behavior. An updated, published version of this paper can be found here [18]. KEYWORDS Behavior Models; Latent Factor Models; Knowledge Exchange Net- works; MOOCs; Data Skew 1 INTRODUCTION is paper addresses the problem of developing robust statisti- cal representations of participant behavior and engagement in online knowledge-exchange networks. Examples of knowledge- exchange networks include interactive MOOCs (Massive Online Open Courses), where participants interact with lecture content and peers via course forums, and community Q & A platforms such as Stack-Exchanges 1 . We would like statistical representations to pro- vide insight into dominant behavior distributions, and understand how an individual’s behavior evolves over time. Understanding and proling time-evolving participant behavior is important in several applications; for instance, pro-actively identifying unsat- isfactory student progress in MOOCs may lead to redesign of the online learning experience. e dynamic nature and diversity of user activity in such learning environments pose several challenges to proling and predicting behavior. Behavior skew poses a signicant challenge in identifying infor- mative paerns of user engagement with interactive social-media 1 hps://stackexchange.com Comment Like Question Answer Edit Ask-Ubuntu Dominant Action Type 0 20 40 60 User Percentage (a) 14.04 Drivers Graphics SSH BootLoader Ask-Ubuntu Content Tags 0 5 10 15 20 Tag Count (x1000) (b) Figure 1: Illustration of (a) Dominant Action Skew and (b) Content Skew in our largest Stack-Exchange, Ask-Ubuntu [9, 35]. In the Ask-Ubuntu 2 Q&A network, most users primarily en- gage by commenting on posts as seen in Figure 1(a). Subject experts who invest most of their time editing or answering questions are relatively infrequent. e extent of behavior skew is compounded by the presence of popular and niche subject areas (Figure 1(b)). For instance, users who comment on popular topics vastly outnumber those who edit or answer questions on niche subject areas. Inconsistent participation and user-level data sparsity are other prominent challenges in most social media platforms [3, 13, 38]. In Community Q&A websites and MOOCs, a minority of partici- pants dominate activity in a classic power-law distribution [3] as observed in Figure 2(a). Additionally, an overwhelming majority of users record activity on less than 10% days of observation in our datasets (Figure 2(b)). Temporal inconsistency in user participation renders evolutionary user modeling approaches [30, 31] ineective for sparse or bursty participants in social learning platforms. Despite several relevant lines of work, including user evolution modeling [30], behavior factorization [50], sparsity-aware tensor factorization [13] and contextual text mining [31], there are few sys- tematic studies addressing these pervasive challenges in modeling user behavior across diverse platforms and applications. Evolution- ary activity sequence based user modeling approaches [30] do not explicitly account for sparse or bursty users, and are best suited to temporally consistent user activity. 2 hps://askubuntu.com/ 0 10 20 Temporal Presence (% Days with Activity) 0 10 20 30 40 50 60 70 User Percentage (a) 0 10 20 Interaction Count 0 10 20 30 40 User Percentage Dataset Ask-Ubuntu Stack Exchange Comp Sci-1 Coursera MOOC (b) Figure 2: Temporal Consistency and User Interaction Vol- ume ( η 3) are highly skewed in Stack-Exchange/Coursera arXiv:1711.11124v2 [cs.SI] 7 Feb 2019

Transcript of Improving Latent User Models in Online Social...

Page 1: Improving Latent User Models in Online Social Mediaaditk2.web.engr.illinois.edu/reports/arxiv1.pdf · Improving Latent User Models in Online Social Media Adit Krishnan Dept of Computer

Improving Latent User Models in Online Social MediaAdit Krishnan

Dept of Computer Science, Universityof Illinois at Urbana-Champaign

[email protected]

Ashish SharmaDept of Computer Science, IndianInstitute of Technology Kharagpur

[email protected]

Hari SundaramDept of Computer Science, University

of Illinois at [email protected]

ABSTRACTModern social platforms are characterized by the presence of richuser-behavior data associated with the publication, sharing andconsumption of textual content. Users interact with content andwith each other in a complex and dynamic social environment whilesimultaneously evolving over time. In order to e�ectively charac-terize users and predict their future behavior in such a se�ing, itis necessary to overcome several challenges. Content heterogene-ity and temporal inconsistency of behavior data result in severesparsity at the user level. In this paper, we propose a novel mutual-enhancement framework to simultaneously partition and learnlatent activity pro�les of users. We propose a �exible user parti-tioning approach to e�ectively discover rare behaviors and tackleuser-level sparsity.

We extensively evaluate the proposed framework on massivedatasets from real-world platforms including Q&A networks andinteractive online courses (MOOCs). Our results indicate signi�cantgains over state-of-the-art behavior models ( 15% avg ) in a variedrange of tasks and our gains are further magni�ed for users withlimited interaction data. �e proposed algorithms are amenable toparallelization, scale linearly in the size of datasets, and provide�exibility to model diverse facets of user behavior. An updated,published version of this paper can be found here [18].

KEYWORDSBehavior Models; Latent Factor Models; Knowledge Exchange Net-works; MOOCs; Data Skew

1 INTRODUCTION�is paper addresses the problem of developing robust statisti-cal representations of participant behavior and engagement inonline knowledge-exchange networks. Examples of knowledge-exchange networks include interactive MOOCs (Massive OnlineOpen Courses), where participants interact with lecture content andpeers via course forums, and community Q & A platforms such asStack-Exchanges1. We would like statistical representations to pro-vide insight into dominant behavior distributions, and understandhow an individual’s behavior evolves over time. Understandingand pro�ling time-evolving participant behavior is important inseveral applications; for instance, pro-actively identifying unsat-isfactory student progress in MOOCs may lead to redesign of theonline learning experience. �e dynamic nature and diversity ofuser activity in such learning environments pose several challengesto pro�ling and predicting behavior.

Behavior skew poses a signi�cant challenge in identifying infor-mative pa�erns of user engagement with interactive social-media

1h�ps://stackexchange.com

Comment Like Question Answer EditAsk-Ubuntu Dominant Action Type

0

20

40

60

Use

rP

erce

ntag

e

(a)

14.04 Drivers Graphics SSH BootLoaderAsk-Ubuntu Content Tags

0

5

10

15

20

Tag

Cou

nt(x

1000

)

(b)

Figure 1: Illustration of (a) Dominant Action Skew and (b)Content Skew in our largest Stack-Exchange, Ask-Ubuntu[9, 35]. In the Ask-Ubuntu2 Q&A network, most users primarily en-gage by commenting on posts as seen in Figure 1(a). Subject expertswho invest most of their time editing or answering questions arerelatively infrequent. �e extent of behavior skew is compoundedby the presence of popular and niche subject areas (Figure 1(b)). Forinstance, users who comment on popular topics vastly outnumberthose who edit or answer questions on niche subject areas.

Inconsistent participation and user-level data sparsity are otherprominent challenges in most social media platforms [3, 13, 38].In Community Q&A websites and MOOCs, a minority of partici-pants dominate activity in a classic power-law distribution [3] asobserved in Figure 2(a). Additionally, an overwhelming majority ofusers record activity on less than 10% days of observation in ourdatasets (Figure 2(b)). Temporal inconsistency in user participationrenders evolutionary user modeling approaches [30, 31] ine�ectivefor sparse or bursty participants in social learning platforms.

Despite several relevant lines of work, including user evolutionmodeling [30], behavior factorization [50], sparsity-aware tensorfactorization [13] and contextual text mining [31], there are few sys-tematic studies addressing these pervasive challenges in modelinguser behavior across diverse platforms and applications. Evolution-ary activity sequence based user modeling approaches [30] do notexplicitly account for sparse or bursty users, and are best suited totemporally consistent user activity.2h�ps://askubuntu.com/

0 10 20Temporal Presence (% Days with Activity)

0

10

20

30

40

50

60

70

Use

rP

erce

ntag

e

(a)

0 10 20Interaction Count

0

10

20

30

40

Use

rP

erce

ntag

e

DatasetAsk-Ubuntu Stack Exchange

Comp Sci-1 Coursera MOOC

(b)Figure 2: Temporal Consistency and User Interaction Vol-ume (η ≈ 3) are highly skewed in Stack-Exchange/Coursera

arX

iv:1

711.

1112

4v2

[cs

.SI]

7 F

eb 2

019

Page 2: Improving Latent User Models in Online Social Mediaaditk2.web.engr.illinois.edu/reports/arxiv1.pdf · Improving Latent User Models in Online Social Media Adit Krishnan Dept of Computer

Matrix factorization methods have been adapted to extract dy-namic user representations and account for evolution of user in-terests. Jiang et al [13] develop a multi-facet tensor factorizationapproach for evolutionary analysis, employing facet-wise regu-larizer matrices to tackle sparsity. [50] discovers topical interestpro�les via simultaneous factorization of action-speci�c matrices.�adratic scaling imposes restrictive computational limits on thesemethods. Generative behavior models are controlled by introducingDirichlet priors in user pro�le assignments [25, 31, 32]. However,this se�ing is limited in it’s ability to model skew, and could mergeinfrequent behavior pro�les with common ones. Furthermore, be-haviors learned could be contaminated by the presence of severalinactive users.

In contrast to these approaches, we propose to simultaneouslypartition and pro�le users in a uni�ed latent framework to adaptto varying degrees of skew and data sparsity in our network. Ouruser-partioning scheme builds upon preferential a�achment mod-els [1, 29] to explicitly discount common activity pro�les and favorexploration of diverse user partitions, learning �ne-grained repre-sentations of behavioral signals. Mutual-enhancement of behaviorpro�ling and user-partitioning can also bridge temporal inconsis-tencies or sparsity in participant activity. Users exhibiting similarengagement pa�erns are jointly grouped and pro�led within par-titions. Furthermore, our latent behavior pro�les can be �exiblyde�ned to integrate several facets of user behavior and evolution,hence generalizing to diverse social platforms and applications.

�e main contributions of this paper are:• Partitioning and Pro�ling : We simultaneously gener-

ate �exible user partitions and pro�le user behavior withinpartitions in a uni�ed mutually-enhancing latent frame-work. Our partitioning scheme can adapt to varying lev-els of behavior skew, e�ectively uncover �ne-grained orinfrequent engagement pa�erns, and address user-levelsparsity.

• Generalizability : Our model is generalizable to diverseplatforms and applications. User pro�les can be �exiblyde�ned to integrate several facets of user behavior, so-cial activity and temporal evolution of interests, providingcomprehensive user representations.

• User Evolution : We formalize our evolutionary pro�lesto integrate the time-evolving content-action associationsobserved in user activity and social dynamics.

• E�ciency : We provide several optimizations for e�cientmodel inference (see Section 5) and scale linearly in thesize of our datasets compared to quadratic-time scaling intensor factorization approaches.

Extensive experiments over large Coursera3 datasets as well asStack-Exchange websites indicate that our approach strongly out-performs state-of-the-art baselines. We perform three predictiontasks: certi�cate completion prediction (MOOCs), reputation pre-diction (Stack Exchange), and behavior distribution prediction. Forcerti�cate prediction, we outperform baselines on the AUC measureby 6.26%-15.97%; reputation prediction by 6.65%-21.43%; behaviorprediction MOOCs (12%-25%) and Stack-Exchanges (9.5%-22%).

3h�ps://coursera.org

On experiments related to activity sparsity, we see magni�edgains on participants who o�er limited data (10.2%-27.1%). We alsoexamine the e�ects of reducing behavior skew: our approach stilloutperforms baselines on data with reduced skew. Our scalabilityanalysis shows that the model scales well, and is amenable to aparallel implementation. Finally, we study the e�ects of modelparameters and obtain stable performance over a broad range ofparameter values, indicating that in practice our model requiresminimal tuning.

We organize the rest of the paper as follows. In the next sec-tion, we formalize our data abstraction and user representation.In Section 3, we describe the datasets used and in Section 4, wepresent our approach. Section 5 describes a collapsed Gibbs samplerto infer model parameters, and experimental results are presentedin Section 6. We discuss related work in Section 7 and conclude thepaper in Section 8.

2 PROBLEM DEFINITIONWe study networks where participants seek to primarily gain andexchange knowledge (e.g. MOOCs, Stack-Exchanges). In thesenetworks, participants act (e.g. “post”, “play video”, “answer”) oncontent and communicate with other participants. Content mayeither be participant-generated (e.g. in a forum), or external (e.g.MOOC lecture). Interactions with content encode latent knowl-edge and intent of participants - for instance, answering or editingpublished content is indicative of subject expertise. Furthermore,social exchanges between participants play an important role inpro�ling their activity.

LetU denote the set of all participants on the network. �eseparticipants employ a �nite set of distinct actions A to interactwith content generated from vocabulary V . Atomic participantactivity is refered to as an interaction. We de�ne each interactiond as a tuple d = (a,W , t), where the participant performs actiona ∈ A on contentW = {w | w ∈ V} at a normalized time-stampt ∈ [0, 1]. We denote the set of all interactions of participant u asDu . �us the entire collection of interactions in the network isgiven by, D = ∪u ∈UDu .

Inter-participant links are represented by a directed multigraphG = (U,E). A directed labeled edge (u,v, l) ∈ E is added for eachinteraction of user u, du ∈ Du (e.g. “answer”) that is in response toan interaction of user v , dv ∈ Dv (e.g. “ask question”) with edgelabel l ∈ L indicating the speci�c nature of the social exchange(e.g. “answer” →“question”).

Our model infers a set of latent activity pro�les R, where eachpro�le r ∈ R encodes a speci�c evolutionary pa�ern of user be-havior and social engagement. Observable facets of user behavior,namely (Du ,Lu ),u ∈ U are drawn from pro�le r ∈ R with likeli-hood p(Du ,Lu | r ) which we abbreviate as p(u | r ). Each user isthen represented by the vector of likelihoods over latent pro�les,i.e. Pu = [p(u | r ) ∀ r ∈ R] .

3 DATASET DESCRIPTIONWe study datasets from two diverse real-world learning platforms.�ey encompass rich temporal behavior data in conjunction withtextual content, and a community element whereby participantsform social ties with each other.

2

Page 3: Improving Latent User Models in Online Social Mediaaditk2.web.engr.illinois.edu/reports/arxiv1.pdf · Improving Latent User Models in Online Social Media Adit Krishnan Dept of Computer

Platform Action Description

MOOC

Play First lecture segment viewRewatch Repeat lecture segment viewClear Concept Back and forth movement, pausesSkip Unwatched lecture segmentCreate �read Create a forum thread with a ques-

tionPost Reply to existing threadsComment Comment on existing posts

Stack Ex.

�estion Posting a questionAnswer Authoring answer to a questionComment Comment on a question/answerEdit Modify posted contentFavorite Liking posted content

Table 1: UserActionDescription (Coursera/Stack-Exchange)

Stack-ExchangeQ&ANetworks: Stack-Exchanges are Q&A web-sites covering broad domains of public interest. Users interact byasking/answering questions, and editing, liking and commentingon published content (Table 1). Furthermore, users communicateby reacting to other users’ activity, speci�cally liking and editingcontent, favorite, and answering questions hence se�ing up Editing,Liking and Answering links between the pair of users, indicativeof their shared interests and knowledge. We apply our model onseveral Stack-Exchange websites from varied domains (Table 2).

Coursera MOOC Platform: Coursera MOOCs feature a struc-tured learning environment driven by both, lecture content andcommunication between students and instructors via multiple courseforums. Pa�erns of lecture viewing obtained from video click-streams provide valuable cues on student learning behavior [30, 40],in addition to forum activity [26]. We combine these two sources tode�ne the action set of students (Table 1). Lecture segment contentis extracted from subtitle �les. Students engage in social exchangesby commenting on or upvoting content from their peers. We studyseveral MOOC datasets described in Table 2.

Platform Dataset #Users #Interactions ηt SN

Coursera

Math 10,796 162,810 -2.90 0.69Nature 6,940 197,367 -2.43 0.70Comp Sci-1 26,542 834,439 -2.51 0.67Comp Sci-2 10,796 165,830 -2.14 0.73

Stack-Ex

Ask-Ubuntu 220,365 2,075,611 -2.81 0.65Android 28,749 182,284 -2.32 0.56Travel 20,961 277,823 -2.01 0.66Movies 14,965 150,195 -2.17 0.67Chemistry 13,052 175,519 -2.05 0.63Biology 10,031 138,850 -2.03 0.71Workplace 19,820 275,162 -2.05 0.59Christianity 6,417 130,822 -1.71 0.64Comp. Sci. 16,954 183,260 -2.26 0.62Money 16,688 179,581 -1.72 0.63

Table 2: Preliminary Analysis of Behavior Skew and Tempo-ral Inconsistency of participant activity in our datasets

To quantify data sparsity in our datasets, we compute the power-law index (ηt ) that best describes the fraction of users againstnumber of weeks with activity. A more negative index indicatesthat fewer users are consistently active over time. Behavioral skewcan be quanti�ed by grouping participants by their dominant actiontype (e.g. users who mostly comment), and computing normalizedentropy (SN ) of the resulting distribution of users across actions.In large Stack-Exchanges such as Ask-Ubuntu, ’Answer’ is thedominant action for less than 5% users while ’Comment’ accountsfor over 60% (Figure 1). In MOOCs, ’Play’ is the most commonaction and forum interactions are rare (∼10-15% participation),resulting in fewer social links. It is interesting to observe that largeStack-Exchanges have more inactive users in comparison to nichedomains of discussion (Ask-Ubuntu vs Money, Table 2).

4 OUR APPROACHIn this section, we motivate our behavior pro�ling framework (Sec-tion 4.1) proceeding in two simultaneous mutually-enchancingsteps. Discovering diverse homogenous partitions of users in thenetwork (Section 4.2), and learning latent evolutionary pro�les char-acterizing facets of their content interactions and social exchanges(Section 4.3).

4.1 MotivationWe develop our pro�ling framework with two key objectives. First,to account for behavior skew (participants are unevenly distributedacross varying behavior pa�erns) as well as temporal inconsistencyand sparsity in user-level data. Second, to learn informative evolu-tionary pro�les simultaneously characterizing engagement withcontent and social exchanges between users.

Modeling inherent behavior skew necessitates the developmentof pro�ling approaches that adapt to the observed data and learninformative representations of user activity. Generative behaviormodels are traditionally controlled by introducing Dirichlet priorsin the topic assignment process [25, 31, 32]. However, this se�ingis limited in it’s ability to model inherent topical skew, where someoutcomes signi�cantly outnumber others. In the context of user be-havior, it is necessary to explicitly account for the presence of skewin the proportions of behavior pa�erns and e�ectively separateusers to learn discriminative evolutionary pro�les of activity. �ereare two key advantages to mutually enhancing user paritioningand pro�le learning over conventional pro�le assignments:

• Tackling Behavior Skew : Conventional topic assign-ments tend to merge infrequent behavior pa�erns withcommon ones, resulting in uninformative pro�les. Ourapproach explicitly discounts common pro�les and favorsdiverse user partitions. Pro�le variables assigned to thesepartitions learn informative representations of infrequentbehaviors.

• Temporal Inconsistency and Sparsity : Our pro�le as-signment process enforces common pro�les across incon-sistent and active users within homogenous partitions. Asour inference process converges, users with limited dataare probabilistically grouped with the most similar usersbased on available interaction data.

We now proceed to formalize our user partitioning scheme.3

Page 4: Improving Latent User Models in Online Social Mediaaditk2.web.engr.illinois.edu/reports/arxiv1.pdf · Improving Latent User Models in Online Social Media Adit Krishnan Dept of Computer

Symbol Description

D,U,A,V Set of all content interactions, users, actions and content vocabularyDu ,Lu Content interactions and social links observed for user u ∈ U

d = (a,W , t) Interaction involving action a on contentW at time t(s,u, l), (u,y, l ′) ∈ Lu ; l , l ′ ∈ L l-label inward link from source s, l’-label outward link to target y; L - prede�ned link label set

R, K �e set of evolution pro�les, and the set of behavior topicsϕVk ,ϕ

Ak Multinomial word and action distributions in behavior k ∈ KϕKr Multinomial distribution over behaviors for pro�le r ∈ R

αrk , βrk Parameters of the temporal beta distribution for behavior k in pro�le rϕLr,r ′ Multinomial distribution over link labels for links directed from pro�le r to r ′

γ ,δ ,G0 Scale parameter, discount parameter and base distribution of Pitman-Yor processa,na , ra ;N ,Ar Table ID, # seated users, pro�le served on it; # total seated users and # tables serving pro�le r ∈ R

αV ,αA ,αK ,αL Dirichlet-Multinomial priors for ϕVk , ϕAk , ϕKr and ϕLr,r ′ ∀ k ∈ K & r , r ′ ∈ RTable 3: Notations used in this paper

4.2 Skew Aware User PartitioningWe �rst introduce a basic preferential a�achment model based onthe Pitman-Yor process [29] to generate skewed partitions of inte-geres. We proceed to develop our pro�le-driven user partitioningscheme, building upon the Chinese Restaurant perspective [1] ofthe Pitman-Yor process to group similar users within partitions. Ourapproach explicitly discounts common behavior pro�les to generatediverse user partitions and learn subtle variations and infrequentbehavior pa�erns in the network. Additionally, it jointly pro�lessparse and temporally inconsistent users with best-�t partitions.

4.2.1 Basic Preferential A�achment. �e Pitman-Yor process[29] (generalization of the Dirichlet process [43]) induces a distri-bution over integer partitions, characterized by a concentrationparameter γ , discount parameter δ , and a base distribution G0. Aninterpretable perspective is provided by the Chinese Restaurantseating process [1] (CRP), where users entering a restaurant arepartitoned across tables. Each user is either seated on one of severalexisting tables [1, . . . ,A], or assigned a new table A + 1 as follows,

p(a |u) ∝na−δN+γ , a ∈ [1,A], existing tableγ+AδN+γ , a = A + 1, new table

(1)

where na is the number of users seated on existing tables a ∈[1,A], A + 1 denotes a new table, and N =

∑a∈[1,A] na is the

total number of participants seated across all tables. Equation (1)thus induces a preferential seating arrangement proportional tothe current size of each partition. �e concentration (γ ) and thediscount (δ ) parameters govern the formation of new tables.

�is simplistic assignment thus captures the “rich get richer”power-law characteristic of online networks [3]. However a signi�-cant drawback is it’s inability to account for similarities of usersseated together. Our approach enforces a pro�le-aware seatingarrangment to generate homogenous user partitions.

4.2.2 Profile-Driven Preferential A�achment. Let us now assumethe presence of a set of evolutionary pro�les, R describing temporalpa�erns of user engagement and their social ties. Each user u ∈ Uis associated with a set of time-stamped interactions Du , and sociallinks Lu . �e likelihood of generating these observed facets viapro�le r ∈ R is given by p(Du ,Lu | r ), which we abbreviate top(u | r ).

�us, to continue the restaurant analogy above, we “serve” atable-speci�c pro�le ra ∈ R to participants seated on each tablea ∈ [1,A]. When we seat participant u on a new table A + 1, apro�le variable rA+1 ∈ R is drawn on the new table to describe u,

rA+1 ∼ p(u | r )p(r )where p(r ) is parameterized by the base distribution G0 of the

Pitman-Yor process, acting as a prior over the set of pro�les. We setp(r ) to a uniform distribution to avoid bias. A user u in our modelis thus seated on table a as follows,

p(a |u) ∝na−δN+γ × p(u | ra ), a ∈ [1,A],γ+AδN+γ ×

1|R |

∑r ∈R p(u | r ), a = A + 1.

(2)

�us, the likelihood of assigning a speci�c pro�le r to participantu, p(r |u) is obtained by summing up over the likelihoods of beingseated on existing tables serving r , and the likelihood of beingseated on a new table A + 1 with the pro�le draw rA+1 = r ,

p(r | u) =(∑

a∈[1,A],ra=r

na − δN + γ

p(u | r )) + 1|R | ·

γ +Aδ

N + γp(u | r ) (3)

=

(Nr −ArδN + γ

+γ +Aδ

|R |(N + γ )

)p(u | r ) (4)

where Ar is the number of existing tables serving pro�le r and Nris the total number of participants seated on tables serving r .

Discount Factor : �e extent of skew is jointly controlled byboth, the number of users exhibiting similar activity pa�erns, en-coded by p(u | r ) as well as the se�ing of the discount parameterδ . Common pro�les are likely to be drawn on several tables, thustheir probability masses are discounted by the productArδ in Equa-tion (3). A higher se�ing of δ favors exploration by seating userson new tables and generating diverse partitions, learning subtlevariations in pro�les rather than merging them.

Temporal Inconsistency : Users o�ering limited evidence arelikely to be assigned to popular pro�les that well explain theirlimited interaction data. Our user partitioning scheme enforces acommon pro�le assignment across users sharing a partition, thusensuring proximity of their likelihood distributions over the latentspace of the inferred evolutionary pro�les.

4

Page 5: Improving Latent User Models in Online Social Mediaaditk2.web.engr.illinois.edu/reports/arxiv1.pdf · Improving Latent User Models in Online Social Media Adit Krishnan Dept of Computer

Algorithm 1 Pro�le-Driven Preferential A�achment1: for each user u ∈ U do2: Sit at existing table a ∈ [1,A] ∝ na−δ

N+γ × p(u | ra )3: Sit at new table A + 1 ∝ γ+Aδ

N+γ × (∑r ∈R p(u | r ) × 1

|R | )4: if A + 1 is chosen then5: Draw rA+1 ∝ p(u | r ) × 1

|R | , rA+1 ∈ R

Simplistic prefential assignment (Equation (1)) can be interpretedas a specialization of our model where all evolutionary pro�les r ∈ Rare equally likely for every user. Our model e�ectively generalizesEquation (1) to generate partitions of typed entities rather thanintegers. �e resulting seating arrangement in our model can beshown to be exchangeable similar to that in [1] and hence amenableto e�cient inference. Our partitioning approach can be extended toseveral diverse applications, depending on the precise formulationof p(u | r ). In the next subsection, we formalize our de�nition ofevolutionary pro�les r ∈ R.

4.3 Evolutionary Activity Pro�lingWe now formalize the notion of latent behaviors and evolutionaryactivity pro�les. A behavior (or a behavioral topic) k ∈ K is jointlydescribed by ϕVk , a |V| dimensional multinomial distribution overwords, and ϕAk , a |A| dimensional multinomial distribution overthe set of actions. �e combined probability of observing an actiona (e.g. “play”, “post”) on contentW = {w | w ∈ V} (e.g. a sentenceon “Probability”) conditioned on behavior k can be given by,

p(a,W | k) ∝ ϕAk (a)∏w ∈W

ϕVk (w). (5)

Each observed interaction is assumed to be drawn from a single be-havior k ∈ K , thus learning consistent action-content associations.

Evolutionary pro�les are temporal mixtures over behaviors. Wedescribe each activity pro�le r ∈ R jointly by a K dimensionalmultinomial-distribution parameterized by ϕKr over the K latentbehaviors, and a Beta distribution speci�c to each behavior k ∈ K ,over normalized time t ∈ [0, 1], parameterized by {αrk , βrk }. Eachcomponent of the multinomial distribution ϕKr (k) is the likelihoodof observing behavior k in pro�le r . �e αrk , βrk parameters of theBeta distributions capture the temporal trend of behavior k withinpro�le r .

We draw an interaction d = (a,W , t) within pro�le r by �rstdrawing a behavior k in proportion to ϕKr (k). We then draw actiona and contentW conditioned on behavior k , and time t conditionedon both r and k . �us, the likelihood of observing interaction d inpro�le r , p(d | r ) is obtained by marginalizing over behaviors,

p(d | r ) ∝∑k

ϕKr (k) × p(a,W | k) × p(t | r ,k). (6)

where p(a,W | k) is computed as in Equation (5) and p(t | r ,k)through the corresponding Beta distribution Beta(t ;αrk , βrk ),

p(t | r ,k) = tαrk−1(1 − t)βrk−1

B(αrk , βrk ). (7)

�e Beta distribution o�ers us �exibility in modeling temporalassociation. Prior behavior models [13, 30] used static time slicing

to describe user evolution. Choosing an appropriate temporal gran-ularity is challenging. A single granularity may be inadequate tomodel heterogeneous user activity. Since we analyze behavior datarecorded over �nite intervals, the parameterized Beta distribution iscapable of learning �exible continuous variations over normalizedtime-stamps via parameter estimation [45].

We can now compute the likelihood of observing the entire setof interactions Du of a user u ∈ U conditioned on pro�le r as,

p(Du | r ) ∝∏

d ∈Dup(d | r ) (8)

�e above process is summarized in Algorithm 2. In addition tointeraction set Du , we now proceed to exploit inter-participantlink structure in the construction of activity pro�les.

4.3.1 Modeling Participant Links. We create a directed multi-graph between pairs of pro�les to incorporate the nature of partici-pant ties. Edge labels l ∈ L describe the nature of exchanges (e.g.“question”, “answer”) connecting users who perform them, and thedirectionality of the exchange is indicative of the implicit social re-lationship between their pro�les. We thus associate a multinomialdistribution parameterized by ϕLr,r ′ between each ordered pair ofpro�les r , r ′, se�ing up |R |2 distributions in all.

�e network structure also enhances modeling for participantswith inconsistent interaction data to take advantage of more exten-sive interaction data of their neighbors. �e probability p(Lu | ru )of the set of links Lu associated with user u is proportional to thelikelihoods of independently observing each link, based on the cur-rent pro�le assignment ru ∈ R, and the participant pro�les of userslinked to u,

p(Lu | ru ) ∝∏

(s,u,l )∈LuϕLrs ,ru (l) ×

∏(u,y,l )∈Lu

ϕLru ,ry (l), (9)

EvolutionaryActivityProfiling

UserBehavior SocialLinks

Profile-DrivenPreferentialAttachment

Figure 3: Graphicalmodel illustrating our simultaneous par-titioning and pro�ling framework

5

Page 6: Improving Latent User Models in Online Social Mediaaditk2.web.engr.illinois.edu/reports/arxiv1.pdf · Improving Latent User Models in Online Social Media Adit Krishnan Dept of Computer

Algorithm 2 Generative process for drawing facets Du , Lu frompro�le r ∈ R assigned to the partition containing user u ∈ U

1: for each behavior k ∈ K do2: Draw word distribution ϕVk ∼ Dir (αV )3: Draw action distribution ϕAk ∼ Dir (αA )4: for each pro�le r ∈ R do5: Draw distribution over behaviors, ϕKr ∼ Dir (αK )6: for each pro�le r ′ ∈ R do7: Choose link distribution ϕLr,r ′ ∼ Dir (αL)8: for each behavior interaction d = (a,W , t) ∈ Du do9: Choose behavior k ∼Multi(ϕKr )

10: for word w ∈Wd do11: Draw w ∼Multi(ϕVk )12: Draw action a ∼Multi(ϕAk )13: Draw normalized time t ∼ Beta(αrk , βrk )14: for each inward link (s,u,l) ∈ Lu do15: Let rs denote source user pro�le16: Draw (s,u, l) ∼ Multi(ϕLrs ,r )17: for each outward link (u,y,l) ∈ Lu do18: Let ry denote target user pro�le19: Draw (u,y, l) ∼ Multi(ϕLr,ry )

whereϕLrs ,ru (l) is the likelihood of an l-labeled inward link to useruemerging from a source user s with pro�le rs , and ϕLru ,ry (l), is thatof the analogous outward link to user y. We thus encode observedsocial links as implicit relationships of the respective evolutionarypro�les.

We combine social ties Lu and content interactions Du (eqs. (8)and (9)), to compute the joint conditional probability p(u | r ),

P(u | r ) ∝ p(Du | r ) × p(Lu | r ). (10)

�e above equation provides the likelihood of describing useru ∈ Uwith a chosen pro�le r ∈ R. �e generative process correspendingto eq. (10) is summarized by Algorithm 2.

In this section, we motivated our partitioning and pro�ling frame-work to tackle the issues of skew and sparsity in social learningenvironments. Next, we describe a collapsed Gibbs-sampling ap-proach [21] for model inference, where we iteratively sample userseating arrangements and update pro�le parameters to re�ect theresulting set of user partitions, until mutual convergence.

5 MODEL INFERENCEIn this section we describe a collapsed Gibbs-sampling [21] ap-proach for model inference, an analysis of it’s computational com-plexity, and propose an e�cient parallel batch-sampler to scale tolarge datasets.

5.1 Inference via Gibbs SamplingWe exploit the widely used Markov-Chain Monte-Carlo(MCMC)algorithm, collapsed Gibbs-sampling, to sample user seating andlearn pro�les by iteratively sampling the latent pro�le variableru for each user u ∈ U, latent behavior-topic assignments kd forinteractions d ∈ Du , and table au serving sampled pro�le ru , onwhich user u is seated.

Symbol Description

n(w )k ,n

(a)k ,n

(.)k Number of times word w , action a were as-

signed to topic k, and respective marginalsn(k )r ,n

(.)r Number of interactions of users in r assigned

topic k, total interactions of all users in r

n(l )r,r ′ ,n

(.)r,r ′ Number of l-label links across users in pro�le

r with r ′, and total links between r and r ′

Table 4: Gibbs-sampler count variables

5.1.1 Initialization: Randomized initializtion of user partitionsand corresponding pro�le assignments could result in longer con-vergence times. We speed-up convergence by exploiting contenttags and action distributions of users to generate a coherent initialseating arrangement. Users with similar action and content tagdistributions are seated together to form homogenous partitions.

5.1.2 Sampling User Partitions: �e likelihood of generatinginteraction d = (a,W , t) ∈ Du from behavior k ∈ K (Equation (5))can be given by,

p(a,W | k) ∝n(a)k + αA

n(.)k + |A|αA

×∏w ∈W

n(w )k + αV

n(.)k + |V|αV

(11)

�us the likelihood of observing interaction d = (a,W , t) in pro�ler ∈ R (Equation (6)) is,

p(d | r ) ∝∑k ∈K

nkr + αK

n(.)r + |K |αK

× p(a,W | k) × p(t | r ,k) (12)

where p(t | r ,k) is computed as in eq. (7). Link likelihood for sourcepro�le p, target p′ and label l is computed as,

ϕLp,p′(l) =nlp,p′ + αL

n(.)p,p′ + |L|αL

(13)

�us p(u | r ),u ∈ U (eq. (8)) can be obtained as the product ofeq. (12) over d ∈ Du and eq. (13) over Lu respectively. Givenp(u | r ) we can sample pro�le ru for user u as in eq. (3),

P(ru = r | u,a−u , r−u ,k−u ) ∼(Nr −ArδN + γ

+γ +Aδ

|R |(N + γ )

)p(u | r )

(14)where a−u , r−u ,k−u indicate the seating and pro�le assignments ofall other users, and the behavior assignments for their interactions.Behavior assignments for each interaction d ∈ Du are sampled inproportion to eq. (12) with r = ru , the chosen pro�le for u, and theuser is seated either on an existing table a ∈ [1,A] serving ru , ornew table A + 1 with rA+1 = ru ,{

a ∈ [1,A] ∝ na−δN+γ if ra = ru , else 0

New table A + 1 ∝ γ+(δ×A)N+γ × 1

|R | , rA+1 = ru

Note that N = |U| − 1, i.e. all users except u.

5.1.3 Parameter Estimation: All counts (Table 4) correspondingto previous behavior and pro�le assignments of u are decrementedand updated based on the new assignments drawn. At the endof each sampling iteration, Multinomial-Dirichlet priors αV , αA ,αK and αL are updated by Fixed point iteration [28] and pro�leparameters (αrk , βrk ) are updated by the method of moments [45].

6

Page 7: Improving Latent User Models in Online Social Mediaaditk2.web.engr.illinois.edu/reports/arxiv1.pdf · Improving Latent User Models in Online Social Media Adit Krishnan Dept of Computer

All time-stamps are rounded to double-digit precision and val-ues of p(t | r ,k) ∀ t ∈ [0, 1], r ∈ R,k ∈ K are cached at the end ofeach sampling iteration. �is prevents R × K scaling for p(u | r )in Equation (14) by replacing computations with fetch operations.Pitman-Yor parameters can be estimated via auxiliary variable sam-pling with hyperparameters set to recommended values in [39, 42].

5.2 Computational ComplexityIn each Gibbs iteration, Equations (11) and (12) involve |D| × (K +R) computations. Equation (14) requires an additional |U| × Rcomputations. R×K scaling forp(u | r ) in Equation (14) is preventedby restricting temporal precision as described in section 5.1. �e�rst product term of Equation (14) is cached for each r ∈ R, andupdated only when tables of pro�le r are altered.On the whole, our algorithm is linear in |D|+ |U|, scaled by R+K inboth time and space complexity (results in Figure 6). We e�cientlyscale to massive datasets by parallelizing our algorithm across usersvia batch-sampling, described in the next subsection.

5.3 Parallelizing Model Inference�e Gibbs sampler described above samples each user’s seatingP(ru = r | u,a−u , r−u ,k−u ) conditioned on all other users, whichnecessitates iteration overU. Instead, seating arrangements couldbe simultaneously sampled in batches U ⊂ U conditioned on allusers outside the batch, i.e. P(RU = R | U ,aU−U , rU−U ,kU−U ).

Batch sampling is most e�cient when each batch U ⊂ U ischosen such that users u ∈ U entail comparable computationalloads. We approximate computation load for u ∈ U ∝ |Du | + |Lu |to decide apriori batch splits for sampling iterations.

All assignment counts can be updated at the end of the samplingprocess for one batch. Note that social links between users in a givenbatch U are ignored since their pro�les are drawn simultaneously.However, in practice the batch-size is a small value in comparisonto |U|, thus rendering this loss to be negligible.

6 EXPERIMENTAL RESULTSIn this section, we present our experimental results on datasetsfrom MOOCs as well as Stack-Exchange. We �rst introduce the setof competing baselines. �en in Section 6.2, we discuss predictiontasks used to evaluate the di�erent behavioral representation mod-els. In Section 6.3, we demonstrate the impact of data sparsity onprediction tasks, and in Section 6.4, we discuss the e�ects of behav-ior skew. Next, we present results on the parameter sensitivity andscalability of our model in Sections 6.5 and 6.6 and conclude with adiscussion on the limitations of our approach.

6.1 Baseline MethodsWe compare our model (CMAP) with three state-of-the-art behaviormodels and two standard baselines.

LadFG [30]: LadFG is a dynamic latent factor model whichuses temporal interaction sequences and demographic in-formation of participants to build latent representations.We provide LadFG action-content data from interactionsand all available user demographic information.

BLDA [31]: BLDA is an LDA based extension to captureactions in conjunction with text. It represents users as amixture over latent content-action topics.

FEMA [13]: FEMA is a multifaceted sparsity-aware matrixfactorization approach which employs regularizer matricesto tackle sparsity. Facets in our datasets are users, wordsand actions. We set User-User and Word-Word regularizersto link and co-occurrence counts respectively. We couldnot run FEMA on Ask-Ubuntu and Comp Sci-1 datasetsowing to very high memory and compute requirements.Regularizer matrices in FEMA scale as |U|2, whereas ourmodel scales as |U|.

DMM (Only text) [49]: We apply DMM to the textual con-tent of all interactions to learn topics. Users are representedby the probabilistic proportions of learned topics in theirinteraction content.

Logistic Regression Classi�er (LRC) [19]: It uses logisticregression to train a classi�er model for prediction. Inputfeatures include topics(obtained from DMM) that the userinteracts with and respective action proportions for eachtopic.

We initialize Dirichlet priors for ϕVk ,ϕAk ,ϕ

Kr and ϕLr,r ′ following

the common strategy [9, 48] (αX = 50/|X |,X = {A,L,K}, andαV = 0.01) and all Beta parameters αrk , βrk are initialized to 1. Wefound CRP parameter initialization at δ = 0.5,γ = 1 to performconsistently well across datasets. All models were implemented inPython, and experiments were performed on an x64 machine with2.4GHz Intel Xeon cores and 16 GB of memory.

6.2 Prediction TasksIn this section, we identify three prediction tasks, discuss evaluationmetric and compare results with competing baseline methods. Wefocus on two platform speci�c user characterization tasks, and acommon content-action prediction task.

Certi�cate Prediction: Students maintaining a minimumcumulative grade over course assignments are awardedcerti�cations by Coursera. Connecting behavior to perfor-mance may help be�er online educational experiences. Wea�empt to predict students recieving certi�cates based ontheir behavioral data in each MOOC.

User Reputation Prediction: For Stack-Exchanges, wepredict if participants have a high reputation. Participantsreceive a reputation score based on the quality of theirparticipation. We de�ne high reputation as the top quartileof all reputation scores on that stack-exchange.

Behavior Prediction: We predict the distribution of par-ticipant behavior across content and actions, in all Stack-Exchanges and MOOC datasets. Speci�cally, for each dataset,we extract T = 20 topics from the text of all interactionsusing DMM [49], and assign each participant interactionto the most likely topic learned by DMM.

We use standard classi�ers and evaluation metrics to evaluateall models. Prediction tasks use linear kernel Support Vector Ma-chine (SVM) classi�er (with default parameters) in sklearn4 and wecompute results for each dataset through 10-fold cross validation.4h�p://scikit-learn.org/

7

Page 8: Improving Latent User Models in Online Social Mediaaditk2.web.engr.illinois.edu/reports/arxiv1.pdf · Improving Latent User Models in Online Social Media Adit Krishnan Dept of Computer

Method Precision Recall F1-score AUC

LRC 0.76 ± 0.04 0.71 ± 0.05 0.74 ± 0.04 0.72 ± 0.03DMM 0.77 ± 0.03 0.74 ± 0.04 0.75 ± 0.03 0.74 ± 0.03LadFG 0.81 ± 0.02 0.78 ± 0.02 0.79 ± 0.02 0.79 ± 0.02FEMA 0.78 ± 0.03 0.75 ± 0.04 0.76 ± 0.03 0.78 ± 0.03CMAP 0.86 ± 0.02 0.81 ± 0.03 0.83 ± 0.02 0.84 ± 0.02BLDA 0.80 ± 0.04 0.75 ± 0.03 0.77 ± 0.03 0.77 ± 0.04

Table 5: Certi�cation Prediction (µ ± σ across MOOCs).CMAP outperforms baselines by 6.26-15.97% AUC.

Method Precision Recall F1-score AUC

LRC 0.73 ± 0.04 0.69 ± 0.04 0.72 ± 0.03 0.73 ± 0.03DMM 0.69 ± 0.05 0.65 ± 0.04 0.66 ± 0.04 0.70 ± 0.04LadFG 0.86 ± 0.03 0.75 ± 0.03 0.79 ± 0.02 0.80 ± 0.03FEMA 0.79 ± 0.04 0.73 ± 0.03 0.77± 0.03 0.79 ± 0.04CMAP 0.85 ± 0.02 0.83 ± 0.03 0.84 ± 0.02 0.86 ± 0.02BLDA 0.75 ± 0.04 0.71 ± 0.04 0.74 ± 0.03 0.74 ± 0.04

Table 6: Reputation Pred. (µ ± σ across Stack-Exchanges).CMAP outperforms baselines by 6.65-21.43% AUC.

LRC is not used in behavior prediction since it does not build auser representation. We evaluated performance of all methods inthe certi�cate and reputation prediction tasks via Precision, Recall,F1-Score and Area-Under-Curve (AUC). For the behavior predictiontask, we measure the Root Mean Squared Error (RMSE) in the pre-dicted user interaction proportions for (topic, action) pairs againstthe actual interaction proportions of users.

We show strong results across prediction tasks. In the certi�cateprediction task, our method improves on the baselines using theAUC measure by 6.26-15.97%, averaged across MOOCs (c.f. Table 5).In the reputation prediction task over Stack-Exchanges, we improveon all baselines on the AUC metric in the range 6.65-21.43%. Inthe behavior prediction task, our method improves on competingbaselines using RMSE by 12%-25% on MOOCs and between 9.5%-22% on Stack-Exchange datasets (c.f. Table 9).

6.3 E�ects of Data SparsityIn order to study the gains of our algorithm on characterizing userswith di�erent levels of activity, we split participants in each datasetinto four equal partitions based on their number of interactions(�artiles 1-4, 1 being least active). We then sample participantsfrom each quartile and evaluate all methods on prediction perfor-mance (AUC).

Our model shows magni�ed gains (Figure 4) in prediction per-formance over baseline models in the �rst and second quartiles

Method DMM LadFG FEMA CMAP BLDA

MOOC 4.9 ± 0.4 4.2 ±0.3 4.1 ± 0.2 3.6 ± 0.2 4.4 ± 0.4Stack-Ex 8.6 ± 0.6 7.9 ± 0.4 7.5 ± 0.3 6.7 ± 0.4 7.4 ± 0.5

Table 7: Behavior Prediction (RMSE (×10−2) µ ± σ ). CMAPouperforms baselines in MOOCs (12%-25%) and Stack-Exchanges (9.5%-22%)

which correspond to sparse or inactive users. BLDA performs theweakest in �artile-1 since it relies on interaction activity to builduser representations. Our model e�ectively bridges gaps in sparseor inconsistent participant data by exploiting similar active userswithin user partitions.

6.4 E�ects of Behavior SkewWe study the e�ect of behavioral skew on the prediction results,by subsampling users who predominantly perform the two mostcommon activities in our two largest datasets, Ask-Ubuntu (Com-ments and �estions) and Comp Sci-1 (Play and Skip) in half, andretaining all other users. �is reduces overall skew in the data.Baseline models are expected to perform be�er with de-skew. Allmodels degrade in Ask-Ubuntu owing to signi�cant content loss inthe de-skew process.

Method Ask-Ubuntu CompSci1 MOOC

Original Deskewed Original DeskewedLRC 0.671 0.656 0.713 0.734

DMM 0.647 0.611 0.684 0.672LadFG 0.734 0.718 0.806 0.830BLDA 0.706 0.683 0.739 0.788CMAP 0.823 0.746 0.851 0.839

Table 8: CMAP outperforms baselines (AUC) in original andde-skewed datasets. Performance gap reduces with de-skew.

We also investigate performance gains achieved by our approachin our most skewed and sparse Stack-Exchange (Ask-Ubuntu) vsleast skewed (Christianity, Table 2). On average, our model out-performs baselines by 13.3% AUC for Ask-Ubuntu vs 10.1% forChristianity Stack-Exchange in the Reputation Prediction task.

Method DMM LRC LadFG FEMA CMAP BLDA

Ask-Ubuntu 0.647 0.671 0.734 - 0.823 0.706Christianity 0.684 0.720 0.842 0.818 0.856 0.791

Table 9: Performance gains in Reputation Pred for mostskewed/sparse dataset (Ask-Ubuntu) vs least (Christianity)

1 2 3 4 5 6 7 8 9 10 11 12 13 14Dataset

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

Are

aU

nder

the

Cur

ve

1st Quartile (least active)Method

CMAP

FEMA

LadFG

BLDA

1 2 3 4 5 6 7 8 9 10 11 12 13 14Dataset

2nd Quartile

1 2 3 4 5 6 7 8 9 10 11 12 13 14Dataset

3rd Quartile

1 2 3 4 5 6 7 8 9 10 11 12 13 14Dataset

4th Quartile (most active)

Figure 4: E�ects of activity sparsity on prediction tasks (AUC) for Stack Exchanges (datasets 1-10) andMOOCs (datasets 11-14).CMAP has greatest performance gains in�artile-1 (Sparse users), performance gap reduces for very active users (�artile-4).

8

Page 9: Improving Latent User Models in Online Social Mediaaditk2.web.engr.illinois.edu/reports/arxiv1.pdf · Improving Latent User Models in Online Social Media Adit Krishnan Dept of Computer

6.5 Parameter Sensitivity AnalysisOur model is primarily impacted by three parameter values: num-ber of pro�les R, number of behaviors K and discount parameterδ . We �nd results to be stable in a broad range of parameter val-ues indicating that in practice our model requires minimal tuning(Figure 5). It is worth noting that while R primarily impacts thegranularity of the discovered activity pro�les, while K impacts theresolution of content-action associations. Dirichlet parameters andother hyper-parameters have negligible impact on the pro�les andbehaviors learned. We set R = 20, δ = 0.5 and K = 50 for alldatasets. Our inference algorithm is found to converge within 1%AUC in less than 400 sampling iterations across all datasets.

0.2 0.3 0.4 0.5 0.6Discount values, δ

0.80

0.85

0.90

Are

aU

nder

the

Cur

ve

10 15 20 25 30Number of Profiles

30 40 50 60 70Number of Behaviors

Figure 5: Mean performance(AUC) & 95% con�dence inter-val with varying model parameters one at a time: δ , R, K .Stability is observed in broad ranges of parameter values.

6.6 Scalability AnalysisWe compared the runtimes and memory consumption of our serialand batch-sampling (with 8 cores) inference algorithms with othermodels, for di�erent volumes of interaction data obtained fromrandom samples of the Ask-Ubuntu Stack-Exchange. BLDA is thefastest among the set of compared models. Our 8x batch sampler iscomparable to BLDA in runtime. FEMA was the least scalable owingto the |U|2 growth of the User-User regularizer matrix. Figure 6shows the comparisons between the algorithms.

5 10 15 20Number of Interactions (×104)

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

2.00

Run

tim

e(×

104 )

Sec

.

CMAP

LadFG

CMAP-P8x

FEMA

BLDA

5 10 15 20Number of Interactions (×104)

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

2.00

Mem

ory

Con

sum

ed(G

b)

Figure 6: E�ects of dataset size on algorithm runtime andmemory consumption. BLDA is the fastest among the set ofcompared models.

6.7 LimitationsWe identify two limitations. First, we make no assumptions aboutthe structure of knowledge (e.g. a knowledge of “probability” isuseful to understand “statistical models”); incorporating knowledgestructure, perhaps in the form of an appropriate prior will helpwith be�er understanding participants with low activity. Second,we assume a bounded time range and our model is inapplicable onstreaming data.

7 RELATEDWORKWe categorize research related to our problem into four groups:Contextual Text Mining, Behavior Modeling, Temporal BehaviorDynamics, and Platform-speci�c work.Contextual Text Mining: Generative models which combine con-textual information with text have found success in generatingdiscriminative combined trends. Topics Over Time [45] is a latentgenerative model over text and time-stamps of documents. Othertemporal content models have been proposed [27, 47]. Link basedmodels a�empt to extract a static view of author communities andcontent [5, 23, 37]. Short-text approaches [33, 49] address contentsparsity, absent publishing and consuming behaviors of users. Skew-aware models have also been developed for morphological structureanalysis [10], topic modeling [16, 20, 39], dependency parsing [44]and query expansion [15, 24]. BLDA [31] is most closely related toour work since it a�empts to integrate user actions with textualcontent, in the absence of a temporal factor.Behavior Modeling: Matrix factorization has been a popular ap-proach to study and predict human behavior. In the past, modelshave a�empted to study two dimensional relations such as user-item a�liation [14, 51], and higher dimensional data via tensormodels in web search and recommender systems [11, 41]. �esemodels extract static views of user behavior.Temporal Behavior Dynamics: �is line of work integrates thetemporal evolution of users with behavior modeling approaches.Previous works a�empt to exploit historical user behavior data topredict future activity in online media [7, 22], recommender ap-plications [8], and academic communities [46]. [30] a�empts tobuild temporally-discretized latent representations of evolutionarylearner behavior. �ese approaches do not explicitly address datasparsity at the user level. Jiang et al [13] have proposed a sparsity-aware tensor factorization approach to study user evolution. �eirmodel however faces scalability challenges in massive real-worlddatasets, and relies on external regularizer data.Platform-speci�c work: Characterization of users with gener-ated content and actions has been studied in both se�ings, MOOCs[6, 26, 34] and community �estion-Answering [12, 25]. [2] devel-ops an engagement taxonomy for learner behavior. [4, 36] integratesocial data to study dropout in MOOCs. More recently, adversar-ial learning has been applied to address the long-tail in NeuralCollaborative Filtering [17]. In contrast to these approaches, ourobjective is to learn robust and generalizable representations tostudy participant behavior in diverse interactive social learningplatforms.

8 CONCLUSIONIn this paper, we address the challenge of characterizing user be-havior on social learning platforms in the presence of data sparsityand behavior skew. We proposed a CRP-based Multi-facet ActivityPro�ling model (CMAP) to pro�le user activity with both, contentinteractions as well as social ties. Our experimental results on di-verse real-world datasets show that we outperform state of the artbaselines on prediction tasks. We see strong gains on participantswith low activity, our algorithms scale well, and perform well evenwith de-skewing data.

9

Page 10: Improving Latent User Models in Online Social Mediaaditk2.web.engr.illinois.edu/reports/arxiv1.pdf · Improving Latent User Models in Online Social Media Adit Krishnan Dept of Computer

We identify three rewarding future directions. Developing Incre-mental models for streaming data; incorporating priors to structureknowledge; allowing for in�nite action spaces.

REFERENCES[1] David J Aldous, Illdar A Ibragimov, and Jean Jacod. 2006. Ecole d’Ete de Probabilites

de Saint-Flour XIII, 1983. Vol. 1117. Springer.[2] Ashton Anderson, Daniel Hu�enlocher, Jon Kleinberg, and Jure Leskovec. 2014.

Engaging with massive online courses. In Proceedings of the 23rd internationalconference on World wide web. ACM, 687–698.

[3] Albert-Laszlo Barabasi and Reka Albert. 1999. Emergence of scaling in randomnetworks. science 286, 5439 (1999), 509–512.

[4] Jaroslav Bayer, Hana Bydzovska, Jan Geryk, Tomas Obsivac, and LubomirPopelinsky. 2012. Predicting Drop-Out from Social Behaviour of Students. Inter-national Educational Data Mining Society (2012).

[5] Jonathan Chang and David M Blei. 2009. Relational topic models for documentnetworks. In International conference on arti�cial intelligence and statistics. 81–88.

[6] Carleton Co�rin, Linda Corrin, Paula de Barba, and Gregor Kennedy. 2014. Visual-izing pa�erns of student engagement and performance in MOOCs. In Proceedingsof the fourth international conference on learning analytics and knowledge. ACM,83–92.

[7] Peng Cui, Shifei Jin, Linyun Yu, Fei Wang, Wenwu Zhu, and Shiqiang Yang.2013. Cascading outbreak prediction in networks: a data-driven approach. InProceedings of the 19th ACM SIGKDD international conference on Knowledgediscovery and data mining. ACM, 901–909.

[8] Peng Cui, Fei Wang, Shaowei Liu, Mingdong Ou, Shiqiang Yang, and Lifeng Sun.2011. Who should share what?: item-level social in�uence prediction for usersand posts ranking. In Proceedings of the 34th international ACM SIGIR conferenceon Research and development in Information Retrieval. ACM, 185–194.

[9] Qiming Diao, Jing Jiang, Feida Zhu, and Ee-Peng Lim. 2012. Finding bursty topicsfrom microblogs. In Proceedings of the 50th Annual Meeting of the Association forComputational Linguistics: Long Papers-Volume 1. Association for ComputationalLinguistics, 536–544.

[10] Sharon Goldwater, Mark Johnson, and �omas L Gri�ths. 2006. Interpolatingbetween types and tokens by estimating power-law generators. In Advances inneural information processing systems. 459–466.

[11] Liang Hu, Jian Cao, Guandong Xu, Longbing Cao, Zhiping Gu, and Can Zhu.2013. Personalized recommendation via cross-domain triadic factorization. InProceedings of the 22nd international conference on World Wide Web. ACM, 595–606.

[12] Zongcheng Ji, Fei Xu, Bin Wang, and Ben He. 2012. �estion-answer topic modelfor question retrieval in community question answering. In Proceedings of the21st ACM international conference on Information and knowledge management.ACM, 2471–2474.

[13] Meng Jiang, Peng Cui, Fei Wang, Xinran Xu, Wenwu Zhu, and Shiqiang Yang.2014. Fema: �exible evolutionary multi-faceted analysis for dynamic behavioralpa�ern discovery. In Proceedings of the 20th ACM SIGKDD international conferenceon Knowledge discovery and data mining. ACM, 1186–1195.

[14] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-niques for recommender systems. Computer 42, 8 (2009).

[15] Adit Krishnan, P Deepak, Sayan Ranu, and Sameep Mehta. 2017. Leveragingsemantic resources in diversi�ed query expansion. World Wide Web (2017), 1–27.

[16] Adit Krishnan, Aravind Sankar, Shi Zhi, and Jiawei Han. 2017. UnsupervisedConcept Categorization and Extraction from Scienti�c Document Titles. arXivpreprint arXiv:1710.02271 (2017).

[17] Adit Krishnan, Ashish Sharma, Aravind Sankar, and Hari Sundaram. 2018. AnAdversarial Approach to Improve Long-Tail Performance in Neural CollaborativeFiltering. In Proceedings of the 27th ACM International Conference on Informationand Knowledge Management. ACM, 1491–1494.

[18] Adit Krishnan, Ashish Sharma, and Hari Sundaram. 2018. Insights from the Long-Tail: Learning Latent Representations of Online User Behavior in the Presenceof Skew and Sparsity. In Proceedings of the 27th ACM International Conference onInformation and Knowledge Management. ACM, 297–306.

[19] Jure Leskovec, Daniel Hu�enlocher, and Jon Kleinberg. 2010. Predicting pos-itive and negative links in online social networks. In Proceedings of the 19thinternational conference on World wide web. ACM, 641–650.

[20] Robert V Lindsey, William P Headden III, and Michael J Stipicevic. 2012. A phrase-discovering topic model using hierarchical pitman-yor processes. In Proceedingsof the 2012 Joint Conference on Empirical Methods in Natural Language Processingand Computational Natural Language Learning. Association for ComputationalLinguistics, 214–222.

[21] Jun S Liu. 1994. �e collapsed Gibbs sampler in Bayesian computations withapplications to a gene regulation problem. J. Amer. Statist. Assoc. 89, 427 (1994),958–966.

[22] Lu Liu, Jie Tang, Jiawei Han, Meng Jiang, and Shiqiang Yang. 2010. Miningtopic-level in�uence in heterogeneous networks. In Proceedings of the 19th ACM

international conference on Information and knowledge management. ACM, 199–208.

[23] Yan Liu, Alexandru Niculescu-Mizil, and Wojciech Gryc. 2009. Topic-link LDA:joint models of topic and author community. In proceedings of the 26th annualinternational conference on machine learning. ACM, 665–672.

[24] Hao Ma, Michael R Lyu, and Irwin King. 2010. Diversifying �ery SuggestionResults.. In AAAI, Vol. 10.

[25] Zongyang Ma, Aixin Sun, �an Yuan, and Gao Cong. 2015. A Tri-Role TopicModel for Domain-Speci�c �estion Answering.. In AAAI. 224–230.

[26] Jenny Mackness, Sui Mak, and Roy Williams. 2010. �e ideals and reality ofparticipating in a MOOC. In Proceedings of the 7th International Conference onNetworked Learning 2010. University of Lancaster.

[27] Yasuko Matsubara, Yasushi Sakurai, B Aditya Prakash, Lei Li, and ChristosFaloutsos. 2012. Rise and fall pa�erns of information di�usion: model andimplications. In Proceedings of the 18th ACM SIGKDD international conference onKnowledge discovery and data mining. ACM, 6–14.

[28] �omas Minka. 2000. Estimating a Dirichlet distribution. (2000).[29] Jim Pitman and Marc Yor. 1997. �e two-parameter Poisson-Dirichlet distribution

derived from a stable subordinator. �e Annals of Probability (1997), 855–900.[30] Jiezhong Qiu, Jie Tang, Tracy Xiao Liu, Jie Gong, Chenhui Zhang, Qian Zhang,

and Yufei Xue. 2016. Modeling and predicting learning behavior in MOOCs. InProceedings of the Ninth ACM International Conference on Web Search and DataMining. ACM, 93–102.

[31] Minghui Qiu, Feida Zhu, and Jing Jiang. 2013. It is not just what we say, but howwe say them: Lda-based behavior-topic model. In Proceedings of the 2013 SIAMInternational Conference on Data Mining. SIAM, 794–802.

[32] Qiang �, Cen Chen, Christian S Jensen, and Anders Skovsgaard. 2015. Space-Time Aware Behavioral Topic Modeling for Microblog Posts. IEEE Data Eng. Bull.38, 2 (2015), 58–67.

[33] Xiaojun �an, Chunyu Kit, Yong Ge, and Sinno Jialin Pan. 2015. Short andSparse Text Topic Modeling via Self-Aggregation.. In IJCAI. 2270–2276.

[34] Arti Ramesh, Dan Goldwasser, Bert Huang, Hal Daume III, and Lise Getoor. 2013.Modeling learner engagement in MOOCs using probabilistic so� logic. In NIPSWorkshop on Data Driven Education, Vol. 21. 62.

[35] Fatemeh Riahi, Zainab Zolaktaf, Mahdi Sha�ei, and Evangelos Milios. 2012.Finding expert users in community question answering. In Proceedings of the21st International Conference on World Wide Web. ACM, 791–798.

[36] Carolyn Penstein Rose, Ryan Carlson, Diyi Yang, Miaomiao Wen, Lauren Resnick,Pam Goldman, and Jennifer Sherer. 2014. Social factors that contribute to a�ri-tion in MOOCs. In Proceedings of the �rst ACM conference on Learning@ scaleconference. ACM, 197–198.

[37] Yiye Ruan, David Fuhry, and Srinivasan Parthasarathy. 2013. E�cient communitydetection in large networks using content and links. In Proceedings of the 22ndinternational conference on World Wide Web. ACM, 1089–1098.

[38] Hassan Saif, Yulan He, and Harith Alani. 2012. Alleviating data sparsity fortwi�er sentiment analysis. CEUR Workshop Proceedings (CEUR-WS. org).

[39] Issei Sato and Hiroshi Nakagawa. 2010. Topic models with power-law usingPitman-Yor process. In Proceedings of the 16th ACM SIGKDD international confer-ence on Knowledge discovery and data mining. ACM, 673–682.

[40] Tanmay Sinha, Patrick Jermann, Nan Li, and Pierre Dillenbourg. 2014. Your clickdecides your fate: Inferring information processing and a�rition behavior fromMOOC video clickstream interactions. arXiv preprint arXiv:1407.7131 (2014).

[41] Jian-Tao Sun, Hua-Jun Zeng, Huan Liu, Yuchang Lu, and Zheng Chen. 2005.Cubesvd: a novel approach to personalized web search. In Proceedings of the 14thinternational conference on World Wide Web. ACM, 382–390.

[42] Yee Whye Teh. 2006. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the 21st International Conference on ComputationalLinguistics and the 44th annual meeting of the Association for ComputationalLinguistics. Association for Computational Linguistics, 985–992.

[43] Yee Whye Teh. 2011. Dirichlet process. In Encyclopedia of machine learning.Springer, 280–287.

[44] Hanna Wallach, Charles Su�on, and Andrew McCallum. 2008. Bayesian modelingof dependency trees using hierarchical Pitman-Yor priors. In ICML Workshop onPrior Knowledge for Text and Language Processing. 15–20.

[45] Xuerui Wang and Andrew McCallum. 2006. Topics over time: a non-Markovcontinuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDDinternational conference on Knowledge discovery and data mining. ACM, 424–433.

[46] Xiaolong Wang, Chengxiang Zhai, and Dan Roth. 2013. Understanding evolutionof research themes: a probabilistic generative model for citations. In Proceedingsof the 19th ACM SIGKDD international conference on Knowledge discovery anddata mining. ACM, 1115–1123.

[47] Jaewon Yang and Jure Leskovec. 2011. Pa�erns of temporal variation in onlinemedia. In Proceedings of the fourth ACM international conference on Web searchand data mining. ACM, 177–186.

[48] Hongzhi Yin, Yizhou Sun, Bin Cui, Zhiting Hu, and Ling Chen. 2013. Lcars: alocation-content-aware recommender system. In Proceedings of the 19th ACMSIGKDD international conference on Knowledge discovery and data mining. ACM,221–229.

10

Page 11: Improving Latent User Models in Online Social Mediaaditk2.web.engr.illinois.edu/reports/arxiv1.pdf · Improving Latent User Models in Online Social Media Adit Krishnan Dept of Computer

[49] Jianhua Yin and Jianyong Wang. 2014. A dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the 20th ACM SIGKDDinternational conference on Knowledge discovery and data mining. ACM, 233–242.

[50] Zhe Zhao, Zhiyuan Cheng, Lichan Hong, and Ed H Chi. 2015. Improving usertopic interest pro�les by behavior factorization. In Proceedings of the 24th Interna-tional Conference onWorldWideWeb. International World Wide Web ConferencesSteering Commi�ee, 1406–1416.

[51] Xiaodong Zheng, Hao Ding, Hiroshi Mamitsuka, and Shanfeng Zhu. 2013. Collab-orative matrix factorization with multiple similarities for predicting drug-targetinteractions. In Proceedings of the 19th ACM SIGKDD international conference onKnowledge discovery and data mining. ACM, 1025–1033.

11