On MySpace Account Spans and Double Pareto-Like ...ribeiro/pdf/ribeiro_MySpaceTechRep2010.pdfOn...

On MySpace Account Spans and DoublePareto-Like Distribution of Friends

UMass Technical Report UM-CS-2010-001

Bruno Ribeiro1, William Gauvin2, Benyuan Liu2, and Don Towsley1

1Computer Science Department 2Computer Science DepartmentUniversity of Massachusetts Amherst University of Massachusetts Lowell

Amherst, MA, 01003 Lowell, MA 01854{ribeiro, towsley}@cs.umass.edu {wgauvin,bliu}@cs.uml.edu

Abstract—In this work we study the activity span of MySpaceaccounts and its connection to the distribution of the number offriends. The activity span is the time elapsed since the creationof the account until the user’s last login time. We observeexponentially distributed activity spans. We also observer thatthe distribution of the number of friends over accounts withthe same activity span is well approximated by a lognormalwith a fairly light tail. These two findings shed light into thepuzzling (yet unexplained) shape of the distribution of friendsin MySpace when plotted in log-log scale: Two straight lineswith different parameters joined by an inflection point (knee).We argue that the inflection point is more likely related to theinflection point of Reed’s (Double Pareto) Geometric BrownianMotion with Exponential Stopping Times model than to theDunbar number hypothesis of online social networks, whichargues, without proof, that the inflection point is due to theDunbar number (a theoretical limit on the number of people thata human brain can sustain active social contact with). While weanswer many questions, we leave many others open.

I. I NTRODUCTION

MySpace is one of the largest on-line social networks todate with approximately 200 million accounts (users) geo-graphically distributed around the globe. In this work wecollect 400,000 randomly sampled MySpace accounts. Anunbiased estimate of the distribution of the number of friendsof MySpace users can be seen in log-log scale in Figure 1. Theshape of the distribution seen in Figure 1 agrees with previousunbiased estimates for MySpace [4]. Our findings in this workshed light into the puzzling (yet unexplained) shape of thedistribution of friends seen in Figure 1: two Pareto-shapeddistributions with different parameters joined by an inflectionpoint (knee). The choice of MySpace for our study comes fromtwo valuable records available in most of MySpace accounts:the date in which the account was created and the user’slast login date. By randomly sampling MySpace accounts weobserve that: (1) Using this data we find activity spans to beexponentially distributed. This phenomenon may explain muchof the shape of the friends distribution shown in Figure 1.The activity span is the time elapsed since the creation of the

This work was supported by ARO MURI W911NF-08-1-0233.

10-5

10-4

10-3

10-2

10-1

1

1 10 102 103 104 105 106

CC

DF

Number of friends

Fig. 1. Empirical Complementary cumulative distribution of thenumber of friends in MySpace.

account until the user’s last login time. (2) Inspired by previousworks on the Double Pareto distribution [10], [11] we alsoobserve that the distribution of the number of friend amongaccounts with roughly the same activity span can be welldescribed by a lognormal distribution whose average numberof friends grows according to the square root of the account’sspan. Our findings provide a sound alternative hypothesis forthe emergence of a “knee” in the distribution of friends in on-line social networks: The mixture of multiple lognormal dis-tributions with exponentially distributed averages. By notingthat no such “knee” exists in the distribution of friend overaccounts with roughly the same activity span, we challengean alternative hypothesis (formulated in [1] for the CyWorldon-line social network) in which the “knee” is seen as aconsequence of the Dunbar’s number, a theoretical cognitivelimit of the number of people (about147.8) with whomone can maintain stable social relationships [5]. Our work isexploratory in nature. We answer a number of questions, butalso leave a number of other interesting questions open.

This work is organized as follows. In Section II we reviewthe data collected from MySpace. In Section III we analyze thedata collected from MySpace. Section IV connects the DoublePareto distribution with the analysis provided in Section III,

2

showing possible connections between the number of friendsin a MySpace account and Geometric Brownian Motion.

II. M EASUREMENT METHODOLOGY

Unfortunately, studying such a large and active social net-work has its drawbacks. The massive number of user accountscombined with MySpace’s stringent rules on crawling itsnetwork forces researchers to rely on statistics from incom-plete datasets. We collect data from MySpace by samplingaccount profiles uniformly at random. An entry in our datasetis comprised of user ID, IDs of all user friends, the date inwhich the account was created, and the user’s last login date.Our data was collected in two phases: During the first phase,denoted “fast probing”, we obtained a (time) snapshot of theMySpace graph. In 4 days we randomly sampled 1 millionIDs where 70,000+ correspond tovalid public accounts. Thismeasurement had to be shut down due to complaints fromMySpace. During the second phase, denoted “slow probing”,we obtained 312,713valid public MySpace accounts over aperiod of 7 months.

The data collected during the “fast probing” phase is usedfor our snapshot-sensitive analysis, e.g. the activity spandistribution. As we are not too interested in the tail of thesedistributions, we believe that 70,000+ samples suffice to obtaingood estimates. The data collected during the “slow probing”phase is used to obtain the distribution of friends of accountswith the same activity span. The results obtained from thefast probing phase is also used to double check the resultsobtained in the slow probing phase. In this preliminary workwe hypothesize that private profiles (profiles from which wecannot obtain friends information) do not affect our results. Weleave as future work the task to collect data that can verifythis hypotheses.

One of the challenges of this work is to perform statisticalanalysis using relatively few samples. The quality of ourconclusions depends directly on the quality of our estimates.In our experiments we sample nearly0.25% of all validaccounts. Appendix A analyzes the impact of the incompletedata over our estimates. In what follows we describe thestatistics obtained using this data.

III. D ATA ANALYSIS

In this section we focus on the impact of activity spans onthe distribution of friends. In what follows we look at accountactivity span and friends distribution. While the friends dis-tribution of social networks has been extensively studied inthe literature, including a MySpace study [4] (that, like ourwork, presents an unbiased estimate of the distribution of thenumber of friends), we show crucial statistical propertiesthathave escaped the attention of previous works.

A. Activity span distribution

In this section we analyze three statistics collected in ourexperiment:

• Activity span: Time between the creation of an accountand the last time the user logged in.

• Age: Time between the creation of an account and whenit is probed(recorded in our trace). Age is also studied in [15].

• Inter-login time: Time between two consecutive loginsinto the same account.

Figure 2 shows the complementary cumulative distributionfunction (CCDF) of MySpace activity spans where they-axisis shown in log scale. We see that the majority of accounts inMySpace are active for a very short period of time. The CCDFof activity spans divide into two parts. The first part withactivity spans

3

be the reason behind the two distinctive modes in the accountactivity span distribution.

0 4 8 13 19 25 31 37 43 49 55 61 67

Age (in months)

Frac

tion

of a

ccou

nts

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

Fig. 3. Fraction of MySpace accounts with age = ( - ). After an exponential growth from 2003(MySpace’s launch) to 2005, the number of new accounts transitionsto linear growth.

We believe that account activity spans are one of the mostimportant statistics that one can obtain from an OSN suchas MySpace. We also argue that account ages are not asrelevant. This is because friends are not automatically addedinto MySpace accounts. Users must log into their accounts inorder to add or accept friends. Therefore, an account createdand later abandoned cannot play a significant role in the friendsdistribution after it is abandoned. Also note that MySpace doesnot delete accounts due to inactivity.

The activity span distribution brings us to another question:How frequently do users log into their accounts? Note thatone could generate the same activity span statistics if userslogged in just once. In order to answer this question we need toestimate the time between two consecutive logins into the sameaccount (inter-login time). Assuming that our probes arriveat points in time that are distributed uniformly at random,we can estimate the inter-login time distribution using theaccount’s last login time and the time of the probing. It isclear that we are more likely to probe long inter-login timesthan short ones. This sampling phenomenon is known as theinspection paradox [12]. Appendix B presents a maximumlikelihood estimator that is used to obtain the CCDF of inter-login times (Figure 4). In order to speedup calculation of ourestimates we assume that there are no inter-login times greaterthan 3 years. We believe this assumption to be reasonableas MySpace had existed for only 6 years at the time ofour measurements and we observed that fewer than20% ofMySpace accounts in our trace were created more than 3 yearsfrom the time of our measurements. Unfortunately, we canonly rely on our estimates as we do not have access to theground truth. However, the results shown in Figure 4 seem toagree with our intuition. First, we observe that most accountsare logged in quite frequently. This is a sign that users log

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Time (in days)

CCDF

Estimated inter-arrival times

Empirical -

Fig. 4. CCDF of estimated inter-login times. Note that a heavy tail isexpected as many users abandon their MySpace accounts. The sharpdrop at the end of the tail is due to an artificial constraint that thereare no inter-login times greater than 3 years.

in quite often during the account activity span. Also, mostaccounts that are not active during the span of one year arenot likely to be active in less than three years. This is expectedas accounts inactive for more than one year are likely to havebeen abandoned. Figure 4 also shows the distribution obtainedfrom the difference between the time of the probing and theaccount’s last login time, which is the input data used in ourestimator.

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

1 10 100 1000 10000 100000

CC

DF

(con

ditio

ned

on th

e ac

tivity

spa

n)

Number of friends

Less than 1 month of activity1 month of activity

2 months of activity10 months of activity20 months of activity30 months of activity

Fig. 5. Log-log plot of the CCDF of friends for accounts with activityspans of{

4

-4 -2 0 2 4

-3-2

-10

12

3

(a) "Normalized" sample quantiles

Theore

tical quantile

s

-4 -2 0 2 4

-3-2

-10

12

3

(b) "Normalized" sample quantiles

Theore

tical quantile

s

Fig. 6. QQ-plots of the distributions of friends within accounts withthe same number of months of activity span. These graphs plot 63curves that correspond to the distributions of 3 months of activityuntil up to 65. The theoretical quantiles are given by the t-Studentdistribution (tests if the samples come from a standard Normal).

{1, 2} seem to have an intermediate shape between the shapeof

5

0

2

4

6

8

10

12

(

6

looked at the last login times. Our work, on the other hand,uses another metric (the activity span) to understand thedistribution of the number of friends.

VI. SUMMARY & CONCLUSIONS

In this work we studied the activity span of MySpaceaccounts and its connection to the distribution of the number offriends. We observed exponentially distributed activity spansand that the distribution of friends over accounts with thesame activity spans can be well approximated by a lognormalwith a fairly light tail. These new findings shed light into thepuzzling (yet unexplained) shape of the distribution of friendsin MySpace when plotted in log-log scale: Two Pareto-shapeddistributions with different parameters joined by an inflectionpoint (knee). We argued that the inflection point is morelikely related to the inflection point of Reed’s (Double Pareto)Geometric Brownian Motion with Exponential Stopping Timesmodel than to the Dunbar number hypothesis of online socialnetworks in [1]. While our work answers many question, weleave many others open, such as reason behind the puzzlingconstant standard deviation of the friends distribution condi-tioned on an activity span. Another related open question isifReed’s Geometric Brownian Motion with exponential stoppingtimes model can be changed to accommodate lognormaldistributions with parameters(µ

√T , σ).

APPENDIX

APPENDIX ATHE IMPACT OF SAMPLING ON OUR ESTIMATES

This section is dedicated to explain the methodology usedto substantiated our claims and describe the implications ofworking with incomplete (sampled) data. The following expo-sition is quite straightforward but needed to ensure us thatourconclusions are sound. Fitting distributions to sampled data issomewhat of a controversial topic [6]. In the complex networksliterature heavy-tailed distributions are often found in observed(incomplete) data: links in Web pages [2], [7], file sizes [10],among many others. This comes as no surprise as, according tothe theory of stable laws, heavy-tailed distributions are easilygenerated from a number of stochastic processes.

A. Truncated tail

The study of heavy-tailed distributions requires a briefwarning about the tail of the distribution. In most, if not all,scenarios these tails are truncated. A good example is thedistribution of the energy of earthquakes. While, from thesampled data already collected, such distribution appearsto beheavy-tailed, it is clear that the tail of the distribution is nottruly “heavy” as there is a limit to the amount of energy thatcan be released from the Earth’s interior [9]. Our application isno exception and has an obvious truncation point (the numberof users in MySpace). In what follows we refer to the “tail”of our distributions as all points that are “far from zero”but smaller than the truncation point. While there is greatinaccuracy in measuring the tail [6], and our measurements

are no exception, there is still much that can be said about thetail. In what follows we show how this is possible.

In what follows we provide a detailed analysis of theestimation error and the maximum likelihood estimate of ourdata.

B. Estimation error

Let θi be the fraction of MySpace accounts withi friendsandθ = {θi|i = 1, . . . }. Let

Θd =∞∑

i=d+1

θi,

be the fraction of accounts with more thand friends. LetY = {Yi}Ni=1 be the (incomplete) raw data obtained fromN sampled MySpace accounts. We define thesampled dis-tribution to be the distribution of friends obtained from theincomplete datasetY. Let Td(Y) be an unbiased estimate ofΘd, i.e., E[Td(Y)] = Θd. Also let

T ⋆d = argminTd

E[(Θd − Td(Y))2],

i.e., estimatorT ⋆ has the smallest mean squared error amongall unbiased estimators (we assume Hajék regularity [8]). Inwhat follows we answer the following questions:

(1) How accurate canT ⋆ be?(2) What is the most likely shape of the original distribu-

tion?For now we assume that the only information aboutθ con-tained in the accounts is the number of friends. If accountsdisplay only the number of friends (not their IDs) and aswe sample accounts independently at random, sampling ac-counts is equivalent to sampling degrees directly fromθ. Let#(Y == d) denote the number of sampled accounts withdfriends. We have

P [#(Y == d) = k] = θkd(1 − θd)N−k.From the above it is easy to see that the following inequalityholds

E[

(Θd − T (Y))2]

≥ Θd(1 − Θd)N

. (3)

The above inequality is a straightforward application of theCraḿer-Rao inequality [3]. AndT ⋆ obtains the sampled dis-tribution making the bound in eq. (3) tight. The above analysisis quite trivial but it has a remarkable impact on our abilitytodraw conclusions from our sampled data. In order to exemplifythe implications of equation (3) over the accuracy of ourestimates, we assume, for the sake of argument, thatΘd = d−µ

with µ ≥ 1 andd = 1, . . . , i.e., distributionθ is Pareto withscale=1 and shapeµ ≥ 1. The empirical distribution of thenumber of friends in our MySpace traces has shape parameterµ̂ = 1.47 at the tail. In order to simplify our analysis we useanother metric of accuracy, the normalized root mean squarederror (or NRMSE):

NRMSE(T ) =

√

E[

(Θd − T (Y))2]

Θd≥

√

dµ − 1N

,

7

recall thatN is the number of sampled MySpace accounts. Theinequality in the equation above comes from equation (3). TheNRMSE is a metric that gives the average error of estimatorTas a fraction of the quantity being estimated. WithN = 4×105(400,000 sampled accounts) we have

NRMSE(T ) ≥√

(dµ − 1)/(4 × 105).

Thus, any unbiased estimate ofΘ10,000 has an averageNRMSE of at least1.38 ·Θ10,000. This reasonably large erroris one of the reasons why fitting a distribution to the tail ofa sampled distribution is a controversial topic. Please referto Gong et al. [6] for an interesting look at the difficulty inestimating the tail of Web file size distributions. An interestingquestion for future work is whether the poor accuracy ofT ⋆

implies that we cannot be confident about the shape of theoriginal distribution. In what follows we estimate the likelyshape of the distribution.

C. Maximum likelihood estimation

Here we find the most likely non-parametric distribution thatgenerated the sampled data. We wish to findθ̂ that maximizesthe probability thatY = y, i.e., we wish to find the maximumlikelihood estimate

θ̂ = argmaxθ

P [Y = y |θ].

The maximum likelihood estimate of the above Binomialrandom variable iŝθd = 1/#(y == d), i.e., θ̂ is the sampleddistribution ofy. Note that#(y == d) denotes the number ofsampled accounts withd friends. Figure 1 shows the empiricaldistribution of friends in our MySpace traces. The rather trivialconclusion is that the empirical curve seen in Figure 1 is themost likely distribution of friends in MySpace.

D. Changes to the original (incomplete) data

Zero friends: When someone joins MySpace they have thecreator of MySpace “Tom” as their friend. Thus, there are veryfew accounts with zero friends. For the sake of simplicity weignore accounts with zero friends.

APPENDIX BUSER INTER-LOGIN TIME DISTRIBUTION

Let Y be the time (in days) between when an account isprobed and the last time it was logged in. If the difference intime is less than 24 hours thenY = 1, if the difference in timeis between 24 and 48 hours thenY = 2, and so forth. Assumethat, collectively, users login an infinite number of times.LetX be the time (in days) between two consecutive logins of anuser. In what follows we assume that accounts do not go stale(in reality many users abandon their accounts).

If we assume that the time we sample the account isdistributed uniform at random, the probability of landing onan interarrival time of sizex is

P [Y = i|X = j] ={

0 if j < i1/i otherwise

. (4)

The probability that we will sample an intervalX of sizej is

jP [X = j]∑∞

k=1 kP [X = k]. (5)

Putting equations (4) and (5) together we have

P [Y = i] =

∞∑

j=i

1

j

jP [X = j]∑∞

k=1 kP [X = k]=

P [X ≥ i]E[X]

Thus we can recursively calculateP [X ≥ i] from:

E[X] =1

P [Y = 1], and ,

P [X ≥ i] = E[X]P [Y = i].As we only have an estimate ofP [Y = i] and not its true

value, the above estimate is subject to sampling noise. Indeed,using the above estimator in our dataset we obtain a number ofnegativeP [X = j] values. In order to obtain better estimates,we use the maximum log-likelihood estimator

argmax{P [X=j]}

∑

∀iyi

1 − ∑i−1j=1 P [X = j]∑∞

k=1 kP [X = k],

whereyi is the number of samples ofY with valuei. We alsoenforce the constraints0 ≤ P [X = j] ≤ 1, j = 1, 2, . . . and∑

∀j P [X = j] = 1.

REFERENCES

[1] Yong-Yeol Ahn, Seungyeop Han, Haewoon Kwak, Sue Moon, andHawoong Jeong. Analysis of topological characteristics ofhuge onlinesocial networking services. InWWW ’07: Proceedings of the 16thinternational conference on World Wide Web, pages 835–844, New York,NY, USA, 2007. ACM.

[2] Albert-Lászĺo Barab́asi and Ŕeka Albert. Emergence of scaling inrandom networks.Science, 286(5439):509–512, October 1999.

[3] George Casella and Roger Berger.Statistical Inference. DuxburyResource Center, June 2001.

[4] James Caverlee and Steve Webb. A large-scale study of MySpace: Ob-servations and implications for online social networks. InProceedingsfrom the 2nd International Conference on Weblogs and Social Media(AAAI), 2008.

[5] R. Dunbar. Coevolution of neocortex size, group size andlanguage inhumans.Behavioral and Brain Sciences, 16(4):681–735, 1993.

[6] Weibo Gong, Yong Liu, Vishal Misra, and Don Towsley. On the tails ofWeb file size distributions. InProceedings of 39-th Allerton Conferenceon Communication, Control, and Computing, Oct. 2001.

[7] Bernardo Huberman and Lada Adamic. Growth dynamics of the World-Wide Web. Nature, pages 130–130, 1999.

[8] I.A. Ibragimov and R.Z. Khasminskii.Statistical Estimation. AsymptoticTheory. Springer, New York, 1981.

[9] L. Knopoff and Y. Kagan. Analysis of the theory of extremesas appliedto earthquake problems.Journal of Geophysical Research, 82:5647–5657.

[10] Michael Mitzenmacher. Dynamic models for file sizes and double paretodistributions.Internet Mathematics, 1(3), 2003.

[11] William J. Reed. The pareto, zipf and other power laws.EconomicsLetters, 74(1):15–19, December 2001.

[12] Sheldon M. Ross. The inspection paradox.Probability in the Engineer-ing and Informational Sciences, 17(01):47–51, 2003.

[13] Mukund Seshadri, Sridhar Machiraju, Ashwin Sridharan, Jean Bolot,Christos Faloutsos, and Jure Leskove. Mobile call graphs: beyondpower-law and lognormal distributions. InKDD ’08: Proceeding of the14th ACM SIGKDD international conference on Knowledge discoveryand data mining, pages 596–604, New York, NY, USA, 2008. ACM.

[14] Didier Sornette. Critical Phenomena in Natural Sciences: Chaos,Fractals, Selforganization and Disorder: Concepts and Tools. Springer,April 2006.

8

[15] Mojtaba Torkjazi, Reza Rejaie, and Walter Willinger. Hot today,gone tomorrow: on the migration of MySpace users. InWOSN ’09:

Proceedings of the 2nd ACM workshop on Online social networks, pages43–48, New York, NY, USA, 2009. ACM.

On MySpace Account Spans and Double Pareto-Like ...ribeiro/pdf/ribeiro_MySpaceTechRep2010.pdfOn...

Documents

Transcript of On MySpace Account Spans and Double Pareto-Like ...ribeiro/pdf/ribeiro_MySpaceTechRep2010.pdfOn...