[IEEE 2012 International Conference on Future Generation Communication Technology (FGCT) - London,...

6
Not Every Friend on a Social Network Can be Trusted: Classifying Imposters Using Decision Trees Simon Fong, Yan Zhuang, Jiaying He Department of Computer and Information Science University of Macau Macau SAR {ccfong, syz}@umac.mo Abstract—There is an alarming news recently revealed on media that 8.7 percent of users on Facebook are fake; this amounts to more than 83 million accounts worldwide. Consequently this huge number of fake users whose profiles were unverified translates to the potential dangers ranging from espionage, identity thievery, information misuse and loophole to privacy compromise to the users and their families. Nowadays with the popularity of online social networks (OSN), it is easy to footprint a potential target from the information easily trawled from the Web. Anyone can simply impose as somebody else that s/he claimed to be, without checking whether the information is genuine or not. For example it is so easy to impersonate one’s identity on OSN by supplying fake photos and false names, which will go preemptively unchecked by Facebook. In this paper, a preliminary experiment of applying decision tree classification algorithms is presented, for identifying imposters from a pool of “friends” in Facebook. The classification approach is similar to that of classifying spams from legitimate emails except the attributes of a user’s account is taken into consideration instead of text-mining the message contents. An accuracy of 92.1% is demonstrated to be achievable using the classification techniques. Keywords—Fake users; Classification algorithms, Social Network computing. I. INTRODUCTION Online social networks (OSNs), for example Facebook, allow users to build an online presence thereafter a virtual community in which users can interact, share and post about their personal information. OSNs have already woven into the fabric of our online lifestyle, and this trend is expected to grow in popularity in years ahead. At the same time there also exists a growing concern about fake users who can easily deceive themselves off as somebody else, by using photos and profiles either snatched from a real person (without him/her knowing) or generated artificially. In this paper, these fake accounts are generally called imposters who claim to be somebody that they are not. Usually identities of real existent people are being used; profile information like photos, address and affiliation can easily be obtained online from websites and search engines. The motives of these imposters can be anything but benevolent – it may be purely for fun or prank, but often the ultimate purpose behind bogus accounts is malicious. Imposters are keen to phish individual naïve users to phony relationships that lead to sex scam, human trafficking, or to mass users they float invitations to online gambling or porno websites, or simply try to promote marketing of some products. By imposing as socially trustable figure, for example, other OSN users may be deceivably lured to the scams mentioned above, because of their trust laid on the reputable figure that the imposter is pretending to be. The damage on the reputable figure is grave too, for scam messages or mischievous invitations were initiated from bogus accounts purported from him, infuriating a broad outreach of OSN users. Fake accounts that allow breeding scammers and imposters attribute a serious security problem. Recently a press 1 reported on August 1, 2012 by CNET indicated that Facebook has 8.7% fake users and this percentage estimates to 83.09 million accounts. Spokesman from Facebook, Frederic Wolens, assured that they have a policy for removing accounts that are signed up with fake names or names that are impersonated from others. Quoted from his speech in a press conference: When a person reports an account for this reason, we run an automated system against the reported account,” he explained. “If the system determines that the account is suspicious, we show a notice to the account owner the next time he or she logs in warning the person that impersonating someone is a violation of Facebook's policies and may even be a violation of local law. This notice also asks the person to confirm his or her identity as the true account owner within a specified period of time through one of several methods, including registering and confirming a mobile phone number. If the person can't do this or doesn't respond, the account is automatically disabled.” This security measure however relies upon detection and complaints made by users who are being impersonated; this implies it may be often too late when the damage is already inflicted. Fundamentally the problem lies in the lack of preventive measure and the fact that the information used in registering a social network account is not verified at all. How we wish we could have an early alarm that alerts us whether a user who tried to add you as a friend on OSN indeed is suspicious or not. In other words, how do we know somebody on OSN who tries to befriend you possess a real identity of his or her, or one that is forged from somebody else? To this end, researchers attempted to come up with automated detection tools for finding out fake accounts, which would be labor-intensive and costly if it would be done manually otherwise. A software that is able to screen through 1 http://news.cnet.com/8301-1023_3-57484991-93/facebook-8.7-percent-are- fake-users/ 58 978-1-4673-5861-3/12/$31.00 ©2012 IEEE

Transcript of [IEEE 2012 International Conference on Future Generation Communication Technology (FGCT) - London,...

Page 1: [IEEE 2012 International Conference on Future Generation Communication Technology (FGCT) - London, United Kingdom (2012.12.12-2012.12.14)] The First International Conference on Future

Not Every Friend on a Social Network Can be Trusted: Classifying Imposters Using Decision Trees

Simon Fong, Yan Zhuang, Jiaying He

Department of Computer and Information Science

University of Macau

Macau SAR

{ccfong, syz}@umac.mo

Abstract—There is an alarming news recently revealed on media

that 8.7 percent of users on Facebook are fake; this amounts to

more than 83 million accounts worldwide. Consequently this

huge number of fake users whose profiles were unverified

translates to the potential dangers ranging from espionage,

identity thievery, information misuse and loophole to privacy

compromise to the users and their families. Nowadays with the

popularity of online social networks (OSN), it is easy to footprint

a potential target from the information easily trawled from the

Web. Anyone can simply impose as somebody else that s/he

claimed to be, without checking whether the information is

genuine or not. For example it is so easy to impersonate one’s

identity on OSN by supplying fake photos and false names, which

will go preemptively unchecked by Facebook. In this paper, a

preliminary experiment of applying decision tree classification

algorithms is presented, for identifying imposters from a pool of

“friends” in Facebook. The classification approach is similar to

that of classifying spams from legitimate emails except the

attributes of a user’s account is taken into consideration instead

of text-mining the message contents. An accuracy of 92.1% is

demonstrated to be achievable using the classification techniques.

Keywords—Fake users; Classification algorithms, Social

Network computing.

I. INTRODUCTION

Online social networks (OSNs), for example Facebook, allow users to build an online presence thereafter a virtual community in which users can interact, share and post about their personal information. OSNs have already woven into the fabric of our online lifestyle, and this trend is expected to grow in popularity in years ahead. At the same time there also exists a growing concern about fake users who can easily deceive themselves off as somebody else, by using photos and profiles either snatched from a real person (without him/her knowing) or generated artificially. In this paper, these fake accounts are generally called imposters who claim to be somebody that they are not. Usually identities of real existent people are being used; profile information like photos, address and affiliation can easily be obtained online from websites and search engines.

The motives of these imposters can be anything but benevolent – it may be purely for fun or prank, but often the ultimate purpose behind bogus accounts is malicious. Imposters are keen to phish individual naïve users to phony relationships that lead to sex scam, human trafficking, or to mass users they float invitations to online gambling or porno websites, or simply try to promote marketing of some products.

By imposing as socially trustable figure, for example, other OSN users may be deceivably lured to the scams mentioned above, because of their trust laid on the reputable figure that the imposter is pretending to be. The damage on the reputable figure is grave too, for scam messages or mischievous invitations were initiated from bogus accounts purported from him, infuriating a broad outreach of OSN users.

Fake accounts that allow breeding scammers and imposters attribute a serious security problem. Recently a press

1 reported

on August 1, 2012 by CNET indicated that Facebook has 8.7% fake users and this percentage estimates to 83.09 million accounts. Spokesman from Facebook, Frederic Wolens, assured that they have a policy for removing accounts that are signed up with fake names or names that are impersonated from others. Quoted from his speech in a press conference:

“When a person reports an account for this reason, we run an automated system against the reported account,” he explained. “If the system determines that the account is suspicious, we show a notice to the account owner the next time he or she logs in warning the person that impersonating someone is a violation of Facebook's policies and may even be a violation of local law. This notice also asks the person to confirm his or her identity as the true account owner within a specified period of time through one of several methods, including registering and confirming a mobile phone number. If the person can't do this or doesn't respond, the account is automatically disabled.”

This security measure however relies upon detection and complaints made by users who are being impersonated; this implies it may be often too late when the damage is already inflicted. Fundamentally the problem lies in the lack of preventive measure and the fact that the information used in registering a social network account is not verified at all. How we wish we could have an early alarm that alerts us whether a user who tried to add you as a friend on OSN indeed is suspicious or not. In other words, how do we know somebody on OSN who tries to befriend you possess a real identity of his or her, or one that is forged from somebody else?

To this end, researchers attempted to come up with automated detection tools for finding out fake accounts, which would be labor-intensive and costly if it would be done manually otherwise. A software that is able to screen through

1 http://news.cnet.com/8301-1023_3-57484991-93/facebook-8.7-percent-are-

fake-users/

58978-1-4673-5861-3/12/$31.00 ©2012 IEEE

Page 2: [IEEE 2012 International Conference on Future Generation Communication Technology (FGCT) - London, United Kingdom (2012.12.12-2012.12.14)] The First International Conference on Future

the characteristics of an OSN user, and raise an alarm if the information fit into a predefined model of imposters, would be a very useful guard at the front-line. The contribution of this article is a data mining classification model designed specifically for distinguishing suspected imposters from the mass. The paper is organized as follow. Section 2 briefly reviewed some similar detection tools for sorting out malicious online users. Our proposed model then is introduced in details in Section 3. An experiment, by using empirical data from a Facebook account is conducted for illustrating the effectiveness of the classification model, which is described in Section 4. The last section concludes the work.

II. RELATED WORK

Using classification model to identify messages online has a share of history. A classifier model is constructed from a collection of training data with labels indicating whether the data are malicious or normal. The trained model can then be used for recognizing which group a new piece testing data should belong to. By this design principle, a number of applications have been developed for similar security problems. Examples range from data mining and using decision trees to distinguish different groups of online twitter messages [1], to classify unsolicited bulk online messages [2], namely spams, and detect different emotions from online articles [3].

These applications however are done by examining the contents of the messages aka text mining, instead of the attributes of the users and the patterns of the users’ activities. Suspected online imposters who are stealth and syndicate -like, they may not often generate a large amount of posts sufficient for text mining their contents. Rooted from the capability of decision tree, the same recognition power should be applied on scrutiny of the user’s profile.

Earlier, we proposed an estimate of trust level called Trust-rank [4] for computationally inferring how much an OSN user can be trusted given his social distance from you as well as from your peers and other authority figures. From the social distance which is quantitatively represented by Trust-rank, an OSN user can know whether a user is relatively safe (if his Trust-rank is high) or otherwise. Another tool called SybilRank was developed recently [5] for ranking OSN users based on the perceived possibility of being fake by using social graph. Many other researchers also adopted such similar social graph properties for guessing if a user is fake by referencing to the social graph and if he has disproportionately few connections to non-fake users. However, this phenomenon is far from reliable as it is debunked in the latter section of this paper. We notice the opposite that fake users do have large number of connections to non-fake users given a very deceiving “coat” that they put on, for example, a fabricated profile of an innocent next-door-looking girl. Many male users then would want to connect her as friends, like bees swarm to honey.

There are alternative approaches for detecting fake accounts without the hassle of building a massive social graph. Some utilized machine learning methods for learning the features of fake-users. For instance, Zhi Yang et al applied a support vector machine (SVM) classifier [6] for detecting suspicious accounts by comparing their features, such as request frequency for friends and other types of requests and

other activities. It is assumed that one of the common objectives for fake accounts is to expand their social networks by aggressively sending many unsolicited requests to friends of friends. The authors concluded that using SVM is computationally efficient, and its performance is on par with human experts with a very low false alarm rate, at 1%. Similar machine learning approach was attempted too by Stringhini et al [7] using Bayesian filters, for detecting spammers on social networks. Their online mischievous behaviors were collectively surmised and a representative model was established.

III. OUR PROPOSED METHOD

The abovementioned machine learning techniques however operate as a black-box, though they can attend a very high accuracy. Another drawback is that, in some social networks (at least on Facebook), friend request frequency is difficult to uncover. That is, simply there is no information available for knowing how many friend requests an OSN has sent out as a matter of privacy. Lack of this vital information, accurate methods is in vain. Our method is different as only profile information would be used as features (instead of friend requests). It capitalizes on decision tree which is known to be human interpretable; subsequently a decision tree can be decomposed into rules in the form of IF-THEN-ELSE, that are readily to be programmed in an automated software. Moreover, building a decision tree is simpler than a social graph. Unlike social graphs that depict the whole picture of social network and the inter-relations of each individual user within, decision tree only contains the significant paths in terms of conditional checks that lead to a conclusion. Decision trees are more scalable and computationally efficient, because only the abstract and effective representation is retained after training over the data.

A. Features Modeling

The first step in constructing a decision tree according to the classical knowledge discovery methodology is to pre-process and well prepare the training data. Identifying suitable features for the training data is a central part, such that they can sufficiently characterize a general model from the available data. For our domain of fake OSN user detection, a list of the following attributes is considered. These information are basic demographic information and simple counts of activities that are usually available publicly on an OSN.

Attribute #1: Age. It is noticed that the age that was specified in an online profile of a fake user tend to be within a certain age group, 17-25, imposing as a young lady.

Attribute #2: Gender. This is a strong indicator as we found most of the imposters hijacked appealing photos of young ladies. They disguised as seductive females in order to get hooked of males.

Attribute #3: College degree. This may not be a strong clue. However, education background is usually found in a user’s profile. Imposters may use this information for jacking up their social status.

Attribute #4: Avatar photo. A head portrait photo is commonly found in almost all OSN user accounts. The true

59

Page 3: [IEEE 2012 International Conference on Future Generation Communication Technology (FGCT) - London, United Kingdom (2012.12.12-2012.12.14)] The First International Conference on Future

identity of the photo, or the ownership of the photo, however, was never vetted. In other words, anyone almost can upload a photo of anybody else without being caught off. Most OSNs rely on the reactive type of security measure – only upon receiving complaint by some users, corrective actions will then be taken to verify if the photo is authenticate. In our case, this feature is a binary attribute that indicates whether an avatar photo that can be easily found from a search engine is being used or not.

Avatar photo is also being used as a confirmation of whether this user is of fake or not. During the preparation of training data, the class target field needs to be filled for each record. In this stage, the avatar photo from each record/user is uploaded to Google Photo Search for searching for photos that are found online and look identical to the avatar photo. It is assumed that fake users would snatch photos online that are usually of young and appealing ladies, like photos of female model posing sexy. Often these photos are widely circulated on Internet, and they could be found across multiple websites. As a result, if an avatar photo is checked to be matching with these photos, one can safely conclude that the account of the avatar is fake; at least technically the verdict is telling the fact that the account user used one of these photos as his/her avatar, with a good chance that the user is trying to pretend to be that good looking lass.

For a case example, a Facebook user created an account called “**** Sweet” is imposing a Taiwanese female model

that is known as “ !” (Xiao Bu). Figure 1 shows a screen-

shot from the imposter’s account. The imposter on his/her wall mingled with a number of users who might have mistaken the imposter as an existent person with the look of Xiao Bu, commenting and praising about her beauty. Thereby the imposer actively engaged with other OSN users online, with over dozens of comments being conversed for almost every photo posted on the wall. This is a typical case of imposing a false identity by using somebody else’s photo that can be easily found from the Internet. Xiao Bu however has a public page on Facebook, the link is http://www.facebook.com/milk0817, with over 11,834 Likes (as of 3/11/2012). Without the knowledge of this page, one may be fooled to believe the imposter looks as Xiao Bu. Figure 2 shows a method to verify though manually whether the avatar under question is genuine, is to search using Google’s Photo Search by the same photo uploaded.

Figure 1. Screen-capture of Xiao Bu’s photo being posted on another

Facebook account with a different name.

Figure 2. Screen-capture of the many search results returned from Google

Photo Search, showing the likely source is Taiwanese Xiao Bu.

Attribute #5: Personal information in the profile. It is believed that authentic users would have more information that they posted in their personal profile, than imposters who are secretive in nature. Of course this may not be an absolute truth because some real users do have privacy concern and they may not be posting as much information as a pretender. The bias of this variable would be spoken by the data themselves given a pool of samples in the training data.

Attribute #6: Authentic pictures. This attribute is a measure of how many pictures are authentic on an OSN user account compared to those that can be found in duplicate on the Internet. Authentic pictures are defined in this context here that the pictures are likely belonged to the owner. The approximate method to find out is by sourcing it over the Internet. If the photo cannot be found from another source in the Internet, it is likely original and the ownership claimed by the user is likely true. As a counter example, imposters often post in addition to his/her avatar, other photos that are also pinched from other sources from the Internet.

Attribute #7: Advertisement. This is a strong indicator as one of the adverse motives of imposters is to spam (convey unsolicited messages) to his network of friends. Having an imposed profile as somebody else who looks trustworthy is a bait to recruit a wide network of friends. Subsequently such users would tag their friends or float invitations that link to advertising certain products/services. Normally a legitimate OSN user would not be so bend on advertising or advocating about a certain product. Checking the amount of ads that initiated from a user serves as an effective approach in determining if he is an imposter or not.

Attribute #8: Profile completeness. The completion of the profile requires filling in information across nine parts, on Facebook: the schools that the user used to attend, working address, home address, coach fellow, family details, basic profile information, self-assessment, other homepages, contact information, and favorite proverbs and mottos, etc. Imposters may not have the full details of the identity that they impose of. Often they just put (may be arbitrarily) a birth year that implies

60

Page 4: [IEEE 2012 International Conference on Future Generation Communication Technology (FGCT) - London, United Kingdom (2012.12.12-2012.12.14)] The First International Conference on Future

a young lady, and one or two links that connect visitors to some malicious websites, like porno. The completeness of the profile is represented by a percentage of blanks versus the total required information fields.

Attribute #9: Number of friends. From our experience, fake users on Facebook have usually many friends. These friends might not been met or acquainted with the imposter in any offline means, but they are added because of visual attraction via the deceiving photos when the imposters broadcasted friend requests to the mass. The other more precise measure is the rate of increase in the number of friends. Taking into consideration of how long the user has joined Facebook setting up his account, and the current number of friends, the average rate of increase can be estimated per day. For an extreme example, if a user has started his account for a month, and he has over 1,000 friends on the list, the rate is 1,000/30. This high increase rate if making more than 30 friends per day is far from realistic for a common folk. It is a known characteristic that imposters work aggressively on expanding their friend base (for spamming) by adding as many friends as possible, even without really knowing them.

Attribute #10: Length of membership. How long a user has been on a social network relates to his/er trustworthiness. It was generally perceived that senior users on OSNs who posted and survived responsibly are relatively credible. True enough many imposters might not last for long; their presences vanish as soon as their scams are exposed or when policing actions took place by the OSN administrators.

Attribute #11: Gender of majority of mutual friends. This may not be a strong indicator. But it is a reasonable gauge that most of the OSN users who are lured into adding the alleged imposter (who usually used an avatar of an appealing young girl) are male. Opposite sexes attract each other. Upon receiving the friend request, one can only see the information about the friends that are in mutual, even by using Facebook Developer API. Male friends who are keen to add or be added as friends to imposters are observed.

Attribute #12: Comments on other posts. Genuine OSN users are more likely and open to interact with other users, like commenting and talking about things under the sun. So are legitimate commercial type of users or administrators who disseminate information, even about his company’s products and services to a wide range of users. In contrast, imposters may prefer to remain stealth for their purpose is to syndicate and to stalk.

Attribute #13: Others. This attribute is a composite of other supplementary ‘known’ factors taken from a recent study

2.

Fake account users tend to abuse photo tagging, and on average they do this about 100 times more often than real users. They also tag about 136 times more often for each four photos while a real account user tags just once among each four of them. 58% of fake Facebook accounts list that they are interested in both men and women, while only about 6% of legitimate accounts would have published the same. In addition, phony profiles tend to stand out due to the sheer volume of their

2 http://www.webpronews.com/facebook-likely-has-83-million-fake-users-

2012-08

"Friends" by another study 3 . On average, they possess remarkably 726 Facebook friends, while real users have just about 130 Friends on the social network. Nearly 70% of the posers claim to have attended college, while about 40% of legitimate users' profiles include college educations. Moreover there are other features can influence judging a fake user. If it is possible to obtain system information at the backend server, somebody who created many accounts at the same IP address, those accounts may not be ‘real’. If a user does not login to the OSN for a long time and he does not update content of his homepage, he may not be ‘real’ too.

B. Decision Tree Construction

Five different decision tree algorithms are employed for conducting the experiment. What they have in common is the train-and-test dual step in building a decision tree, as shown in Figure 3. A decision tree is characterized by piecing a set of IF-THEN-ELSE like conditional paths, starting from the root of the tree and ending in target class labels which are called leaves. When a tree is built, the set of rules mature and stabilize. Then the set of paths are ready to be used for testing a new instance. An instance is a record that contains values for the corresponding attributes like those described in the preceding section. The decision tree, trained, is basically for mapping the attributes values by the conditional tests in each node, and finally a verdict will be attained by going through the set of rules.

Figure 3. A typical train-and-test process for decision tree.

Without being exhaustive, the five decision tree algorithms are briefly introduced as follows. The algorithms are implemented in Java on WEKA

4.

J48: This is the classical implementation of C4.5 unpruned decision tree or C5.0 the pruning version respectively. It is also known as Classification And Regression Trees (CART) that uses a recursive partitioning method called Greedy Search, for dividing the training samples while finding an attribute to be nominated as a splitting node. The criterion is the highest information gain that results by converting a suitable attribute to a splitting node. As nodes are produced in each recursive step, the decision tree expands in level until all the attributes are exhausted.

REPTree: It is conned from its original name, Fast decision tree learner. Similar to J48, REPTree constructs a decision and

3 http://www.darkreading.com/insider-threat/167801100/security/client-

security/232600186/how-to-spot-a-fake-facebook-profile.html 4 http://www.cs.waikato.ac.nz/ml/weka/

61

Page 5: [IEEE 2012 International Conference on Future Generation Communication Technology (FGCT) - London, United Kingdom (2012.12.12-2012.12.14)] The First International Conference on Future

regression tree by using information gain. It prunes the tree to its smallest possible size by using reduced-error pruning.

RandomTree: As the name suggests, a certain number of attributes are randomly selected at each node for building a decision tree. This is a streamlined version of J48 that goes without a recursive process and without computing information gain in each attribute test. Speed is its advantage. RandomTree benefits from very quickly building up a tree, which may be suitable for real-time online applications.

ADTree: The full name is called alternating decision tree. ADTree is a kind of boosting method based on decision tree learning algorithm. Its prediction accuracy is higher than the general decision tree. The current ADTree construction algorithm can effectively deal with small data, but for mass data processing it can be very slow

FT: This is an improved version of decision tree that combines with a predictive leaf called Function Leaf (thereby the name Functional Tree) for producing refined prediction results. The base classifier is regression similar to CART. FT uses the statistics in every node in mapping the attribute values to a predicted class. By doing this, it incurs a high computational cost and the need to store up the ‘counts’ of each attribute value in each node during the training phase.

IV. EXPERIMENT

Empirical data from the author’s Facebook account5 are

used in the experiment for checking out the efficacy of the data mining classification method in classifying fake or real users. The data are collected longitudinally for a period of half year, by accumulating up to 915 friends on his Facebook account. A simple and consistent rule applies in adding users that potentially come from all walks of life, mostly in Macau. The rule is: send a friend request to a user which appears on the Facebook friend recommendation list, whenever the number of mutual friends in between is equal to or more than 30.

This way, gives a fair mix of samples regardless of their backgrounds and other details. The social connections are based purely and quantitatively on the number of mutual friends, though Facebook might have some internal algorithms in recommending friends who have something in common or related to the active user, one way or another.

Along the way, both types of fake users and real users are systemically recruited into the friend list. Thereafter their profiles information are extracted to build the training dataset for constructing a decision tree. The next step soon after profile information are extracted and the attribute values are collected for each record, we have to fill in the a-priori class values on whether the user is of fake or real. Of course the fundamental users that existed before the sample recruitments are screened real friends that include family members, relatives, colleagues, buddies and acquaintances that have already been met and known offline. For the other, manual evaluation is used; the method was introduced in the earlier section, using Google Image Search. Photos of the newly added friends are checked for matching with those that might have circulated widely

5 http://www.facebook.com/fongsimon

online. Comparing and contrasting with the information specified in the user’s account and the counterpart and his/her information that are being told online from the Internet, we can conclude about the authenticity of the information by referencing to that of the majority, hence determine if the user is lying about the profile information as well as the authenticity of the photos posted.

For instance, given the case of Xiao Bu, the Facebook account nicked **** Sweet was deemed fake because majority of information about Xiao Bu indicate that the person in the photo is a Taiwanese model. But the person who imposed Xiao Bu on the Facebook claimed that she is a Macauese student living in Macau. The subjective information contradicted with the majority information found from the Internet.

Furthermore an interesting phenomenon is observed from this pool of fake users who have been verified by Google Photo Search and the intuitive logic described above. The fake users tend to cluster together under the clustering algorithm that essentially emulates the length of link in a social group by measuring the similarity between a pair of users. The distance being measure is Euclidean distance e.g. that takes into account of multi-dimensional attributes. An online graph visualization program called Meurs Challenger 6 is used for generating a snapshot of social connections of the experimental data.

As shown in Figure 4, there are four groups of users on the friend list. The group on the most left hand side is the group of authentic users that have existent users and are known to the author personally. On the right side of it, there are three sub-clusters that form a large blob. The blob is the result of the newly added friends according to the 30-mutual-friend rule. In general, this large blog of users are not genuine friends that are personally known by any mean. Within the blog, interesting enough, there are three sub groups, one being a group of 'commercial' users who are there to promote their products (at eight o'clock), one is a group of possibly real users that are vetted by Google Image Search (at ten o'clock), and the last one (at four o'clock) is a bunch of fake users who imposed somebody popular by stealing their photos from the Internet. The groups in Figure 4 are labeled with callouts for clear illustration.

Figure 4. A social graph that shows four distinct group of Facebook users.

6 http://www.q1000.ro/challenger

Real and acquainted

friends

Real but not

acquainted friends

Fake friends

(imposters)

Real but

commercial

friends

62

Page 6: [IEEE 2012 International Conference on Future Generation Communication Technology (FGCT) - London, United Kingdom (2012.12.12-2012.12.14)] The First International Conference on Future

With the dataset which includes both the specified real and fake users in place, five different decision tree classifiers are trained. The training is done in 10-fold validation for ensuring the model is properly and sufficiently evaluated. Three major performance indicators are reported here, namely accuracy which is defined by the number of correctly classified instances over the total amount of instances, the time taken to train up the full model, and the size of the resultant decision tree. These three performance indicators have implications in decision support software design in OSN environment. Accuracy of course is mandatory as a prime purpose is to accurately classify fake or real users; the time taken for training a decision tree model corresponds to the training speed in the algorithm level; tree size has an implication in the requirement of memory space in the run-time memory. If the classifier is programmed in a mobile app, run-time memory consumption would be stringent.

The performances in terms of the three measures for the five different classification algorithms are tabulated in Table 1. For the sake of easy comparison, however, the values in Table 1 which are in different scales (e.g. percentage in accuracy, length of time and tree sizes) should be normalized. The values are converted to a standard scale between 0 and 1, as follow, and they are shown in Table 2.

TABLE I. PERFORMANCES OF CLASSIFIERS

TABLE II. NORMALIZED PERFORMANCES OF CLASSIFIERS

The normalized accuracy, �� , is defined as: ��������� �� ����

���� ����

. The normalized speed, ��, the greater the value the

better it is, is defined as the inverse of normalized time taken as,

��������� � ��� ����

���� ����

. Likewise, the compactness of a

decision tree, �� , is in opposite proportion to the tree size,

defined as ��������� � ��� ����

���� ����

�where i is the index of the

algorithm, A is accuracy, T is the time taken and N is the number of nodes that represents the tree size. As a result, the relative performances between the five algorithms can be computed and put side-by-side for easy comparison. A tri-bar chart which is shown in Figure 5 summarizes their performances hence the applicability for classifying fake OSN users.

Apparently we can see from Figure 5, the classifiers such as REPTree, RandomTree and FT do not qualify because one of

their performance indictor scores zero (worst among all). For example, REPTree may have perfect normalized scores for speed and compactness but its accuracy is the lowest of all. FT has the greatest accuracy but takes the longest time to train a model. RandomTree is one of the fastest but it incurs the biggest tree size. In contrast, J48 and ADTree are good candidates that achieve a well balance of the three metrics.

Figure 5. Comparison chart for applying different classification algorithms

for finding fake users, with respect to three performance indicators.

V. CONCLUSION

Classifying fake users on online social network has always been a challenging computational task. In this paper we proposed a feature-based data mining method, specifically ‘decision trees’ for automated classifying of fake users away from normal ones. Different from social graphs, decision trees requires much less memory space yet they can well abstract the patterns in terms of conditional tests on the feature attributes that lead to a verdict. Their accuracies range from 70.3% to 92.1% depending on the choice of algorithms. The efficacy has been validated by using empirical data collected longitudinally from the author’s Facebook account, as a case study.

REFERENCES

[1] J. Fiaidhi, O. Mohammed, S. Mohammed, S. Fong, and T-H. Kim, “Mining Twitterspace for Information: Classifying Sentiments Programmatically using Java”, IEEE Seventh International Conference on Digital Information Management (ICDIM 2012), 22-24 August 2012, Macau, pp.303-308.

[2] S. Mohammed, O. Mohammed, J. Fiaidhi, S. Fong, and T-H. Kim, “Classifying Unsolicited Bulk Email (UBE) using Python Machine Learning Techniques”, International Journal of Hybrid Information Technology, ISSN: 1738-9968, SERSC, 2012.

[3] S. Fong, “Measuring Emotions from Online News and Evaluating Public Models from Netizens' Comments: A Text Mining Approach”, Journal of Emerging Technologies in Web Intelligence (JETWI), (Invited Paper), Academy Publisher, ISSN 1798-0461, Volume 4, Issue 1, Feburary 2012, Oulu, Finland, pp.60-66.

[4] R. Tang, L. Lu, Y. Zhuang, and S. Fong, “Not Every Friend on a Social Network Can be Trusted: An Online Trust Indexing Algorithm”, Workshop of IEEE/WIC/ACM Web Intelligence 2012 (WI'12).

[5] Q. Cao, M. Sirivianos, X. Yang and T. Pregueiro, “Aiding the Detection of Fake Accounts in Large Scale Social Online Services”, Symposium on Networked Systems Design and Implementation, 2012, pp.1-14.

[6] Z. Yang, C. Wilson, X. Wang, T. Gao, B. Y. Zhao, and Y. Dai. “Uncovering Social Network Sybils in the Wild”, In IMC, 2011.

[7] G. Stringhini, C. Kruegel and G. Vigna, “Detecting spammers on social networks”, In Proc. of ACSAC (Austin, TX, December 2010).

63