Jun Zhang - University of Kentuckyjzhang/CS689/PPDM-Chapter9.pdf · 2015. 3. 30. ·...
Transcript of Jun Zhang - University of Kentuckyjzhang/CS689/PPDM-Chapter9.pdf · 2015. 3. 30. ·...
Jun ZhangDepartment of Computer Science
University of Kentucky, USA
1
Outline
�Social Network Background
�Privacy Challenges
�Data Privacy and Data Utility
�Clustering-Based and Heuristic Algorithms
�Case Study
�Conclusions
2
What is “social network” ?
�Information circulation: blog photo
news
�Information sharing: data friendship communication
professional
Social network is a computerized interactive structure with the purpose of promoting information circulation and sharing , aided by computer devices and internet media.
Section 1: Background
3
1971, Beginning of Internet
As a precursor of current internet, ARPANET just connected 18 academic and governmental partners.
Source: Richard T. Griffiths. History of the Internet, Internet for Historians
Section I: Background
4
2011, everyone’s Internet
Section 1: Background
5
What is social networking
Many People are virtually connected in
one way or another
Section 1: Background
6
Growing Popularity� More than 160 major social network websites in the world [1]
� In 3rd quarter of 2014, 1.35 billion active users in Facebook, 728 million users with daily login, each account has an average of 130 friends [3]
� In 2008, account creation of Facebook was majorly contributed by users with 35-year-old and older[4] (270% increase)
� At the beginning of 2008, Twitter had only 0.5 million users, at the end of 2008, the number turned to be 4.43 million. [5] (752% increase) In 4th quarter of 2014, it has 288 million active users
� Top global social network (2015): Facbook 1,415 million, QQ 829 million, WhatsApp 700 million, Qzone 629 million, Facebook Messenger 500 million, WeChat 468 million, LinkIn 347 million, Skype 300 million, Google+ 300 million, Instgram 300 million, Baidu Tieba 300 million, Twiter 288 million,
Section 1: Background
7
What are they doing onlineSection 1: Background
8
Why research on social network?� Essential reason: social network is an abstract but
effective representation of the real society.
� This abstract representation can help us understand social development, economical depression, information circulation, epidemic spread, etc.
� A better understanding of social networks can promote social benefits and support policy-making.
Section 1: Background
9
Representation of real societyExample 1 [1][2]: � Social network can really
represent a real society trend in some cases
� The correlation between social network and real society means wide demographic involvement in social network
[1]. http://openlook.org/blog/2007/12/21/cb-1195/ (in Korean), [2]. Lee et al. Googling hidden interactions: Web search engine based weighted network construction, 2007.
Section 1: Background
10
Revolutions and Social Network
[1]. http://socialtuts.cim, [2]. News.mgid.com. [3]. Hellotrade.com.
Section 1: Background
11
We are close to each other
Six Degree of Separation (small-world theory) :In a society, any two persons can be connected by no more than 6 friends.
Example 2: In 1967,Dr. Stanley
Milgram at Harvard University verified it by a seminal experiment.
On Facebook, the degrees of separation is 4.74 (average users). It is shrinking quickly
Source:http://en.wikipedia.org/wiki/Six_Degrees_of_Separation_%28film%29
Section 1: Background
12
Academic Interest
Section 1: Background
13
Why study Privacy Preservation�User engagement
Wide demographic engagement benefits social networks. Malicious users, like hackers, can exploit it。
� Information circulation
Information circulation advances data utilization. In the process of circulation, information can be damaged.
� Information sharing
Information sharing can promote mutual benefits. But how to control the level and boundary of sharing? Pirating
Section 1: Background
14
Privacy in social network� Case 1::::
Source: Sweeney. k-anonymity: a model for protecting privacy, 2002.
Dr. Sweeney in CMU could exactly locate the medical record of the Governor by analyzing the public voter registration table (right circle) of Massachusetts, USA, and GIC anonymized insurance information (left circle) .
Section 1: Background
15
Privacy in social network
Section 1: Background
16
Privacy in social network
Section 1: Background
17
Challenge 1: How to store/handle social data
� Huge volume: 1.415 billion active accounts in Facebook, 50% of active accounts with daily login, each account has an average of 130 friends
� Heterogeneity:numerical (age, salary, frequency),discrete (political affiliation), string (blog, comment),multimedia (photo, audio, video), relational data (friendship, membership)
� Rapid Update
Section 2: Challenge
18
Challenge 2: How to represent background
Any information can be used as background information to benefit privacy attacks
� topological structure
� profiles
� friendship
� membership
� contact frequency
� more…
Section 2: Challenge
19
Challenge 3: How to measure data loss� To preserve privacy, it is necessary to
modify/perturb original social networks.
� In the process of modification, some original information/patterns will be lost.
� How to mathematically measure data loss? Where is the baseline of data loss?
Section 2: Challenge
20
Challenge 4: How to design algorithms� Globalization of data utility
� Balance of data privacy and data utility
� Multi-utilization
Section 2: Challenge
21
What is Data Privacy?� An open problem
� In the field of social networks, privacy is any information that can be used to link a social network user and a real society identity.
� Attack is any projection that can establish such linkages between a social network user and a real identity.
Section 3: Privacy & Utility
22
Purpose
Section 3: Privacy & Utility
Cut off any information connection
23
Naïve PrivacyDirectly identifiable information:
� Identifiable number (passport,drive license)
� Real name, address, affiliation
� Special experiences (chairman of a department)
Section 3: Privacy & Utility
24
Node Degree Attack
A
E
CD
B
F
Background:Bob makes friends with everyone. Or Bob is the most popular person in this group.
Haha, C is Bob.
Source:Backstrom et al. WWW ‘07, Hay et al. VLDB ‘08, Liu & TerziSIGMOD ’08, Narayanan SP ’08, ‘09
Section 3: Privacy & Utility
25
Neighborhood Attack
A
E
CD
B
F
Background:Bob has 2 friends, Alice and Carl , who know each other. Bob has another 2 friends, Dunn and Lily, who do not know each other. Lily has many other friends, but Bob knows nobody.
Source:Zhou et al. Preserving Privacy in Social Networks Against Neighborhood Attacks. ICDE 2008.
C is Bob, B is Lily, and A is Dunn.
Section 3: Privacy & Utility
26
Membership Attack
A
E
CD
B
F
“C should be male and around 20.
Source:Zheleva & Getoor PinKDD ‘07, Korolova et al. CIKM ‘08
Background:A={male, 32-year-old}, B={male, 16}, E={female,21}, F={female, 22}, D={male, 20}.C={??,??}
College Congress
Hiking Club
Section 3: Privacy & Utility
27
Link Label Attack
A
E
C
DB
F
Background:Bob is a hearing-impaired person.
Haha, B is Bob.
Source:Cormode et al., VLDB2008.
Section 3: Privacy & Utility
28
Association Attack
Section 3: Privacy & Utility
29
I don’t want my friendsKnow Peter is my brother
Sybil Attack
Source:Backstrom et al., WWW 2007.Photo: http://www.problogger.net/archives/2007/06/28/what-social-networking-sites-do-you-use-how-do-they-benefit-your-blog/
Bob
Theory: k=(2+δ)log(n) sybil nodes can breach most node identities
Experiment: As to a real social network with 4m nodes, just 7 sybilnodes can locate 2400 real identities.
Section 3: Privacy & Utility
30
What is data utility?
�Another open question
�Data utility is considered as any knowledge/patterns from analyzing social networks which can facilitate the understanding of social networks or society.
Section 3: Privacy & Utility
31
Purpose
Section 3: Privacy & Utility
Maintain usefulnessof data
32
Degree distribution
Source:[1] Newman et al. Email networks and the spread of computer viruses. 2002 (photo)[2] Albert et al. Error and attack tolerance of complex networks. Nature 2000.
Section 3: Privacy & Utility
33
Giant community
Source: Newman et al. Email networks and the spread of computer viruses. 2002
Section 3: Privacy & Utility
34
Shortest paths (data utility)
Source:[1] Liu et al., Privacy Preservation in Social Networks with Sensitive Edge Weights ,SDM2009. [2]. Das et al. Anonymizing Edge-Weighted Social Network Graphs, ICDE 2010.
Section 3: Privacy & Utility
35
Eigenvalue
Source:Ying et al. On Randomness Measures for Social Networks, SDM2009.
Section 3: Privacy & Utility
36
SQL query� Select count(distinct *)
from social network G
group by node.major (count the number of majors
in G)
� Select avg(node.age)
from social network G
where node.interest=“computer game”
(count average age of users
interested in computer game)
� more…… (data utility and SQL query)
Section 3: Privacy & Utility
37
Clustering-based algorithm
A={male,49}
E={female,22}
C={??,??}D={male,21}
B={male,16}
F={female,20}
College Congress
Hike Club
A={*,49}
E={female,[16-30]}
C={??,??}D={male, [16-30]}
B={*,16}
F={female, [16-30]}
Hike Club
Source:Campan, PinKDD08, Hay VLDB08, Cormode VLDB08, VLDB09
Section 4: Algorithms
College Congress
38
Heuristic
A
E
CD
B
F
d(G)={4,4,2,2,2,1,1,1,1}v(G)={D,C,E,F,B,A,I,G,H}
Goal:the number of nodes with the same degree should be more than 3.
d(G)={4, 4, 2, 2,2, 1, 1, 1, 1}
I
G
H
Source:Liu et al. Towards Identity Anonymization on Graphs, SIGMOD 2008.
Section 4: Algorithms
39
Heuristic
A
E
CD
B
F
d(G)={4,4,2,2,2,1,1,1,1}v(G)={D,C,E,F,B,A,I,G,H}
Goal:the number of nodes with the same degree should be more than 3.
d(G)={4, 4, 2, 2, 2, 1, 1, 1, 1}
d(G)={4, 4, 4, 4, 2, 2, 2, 2, 2}v(G)={D,C,E,F,B,A,I,G,H}
I
G
H
Source:Liu et al. Towards Identity Anonymization on Graphs, SIGMOD 2008.
Section 4: Algorithms
40
Heuristic
A
E
CD
B
F
d(G)={4,4,2,2,2,1,1,1,1}
Goal:the number of nodes with the same degree should be more than 3 (k=3).
DA(d[i,j])=optimal solution to make [d[i],d[j]] satisfy the goalI(d[i,j])=optimal solution to make [d[i],d[j]] have the same degree
I(d[i,j])=
For i<2kDA(d[1,i])=I(d[1,i])
For i>=2kDA(d[1,i])= {DA(d[1,t])+I(d[t+1,i]), I(d[1,i])}
I
G
H∑
=
−
j
ikkdid ))()((
minkitk −<=<=
Section 4: Algorithms
41
Weight privacy
42
In a weighted social network, a large weight probably implies a close personal relationship which many people do not want to become public
Section 5: Case Study
Social Network Weight Privacy
43
In a social network, the frequency of communications (chats, messages, e-mails) between users is weights. If you have several online friends, girlfriends, you probably do not want to disclose your communication frequency with each one of them.
100
20
40
35
Social Network in real life
44
T
PrincipalParents
Spend minimum amount of money, get maximum amount of benefit
Person 1
Person 2
Person n
In order to get his child enrolledin a good school, a parent wantsto get connected with the Principal
¥500
¥100
¥300¥200
¥400
¥500
Data Privacy and Data Utility
45
In the following presentation, • Data Privacy --- all edge’s weights • Data Utility --- shortest paths and lengths due to its rich application.
• Perturb weights as much as possible, • Keep the shortest paths (and lengths) the same as the original ones as much as possible.
Section 5: Case Study
46
Weight Privacy in Business
New
Su
pp
lier
Walm
art
Agent A Agent D
Agent B
Agent C Agent E
Unit=Million Dollars/Month
40
10
43
60
85 50
90
70
6648
Find the cheapest supply chain from New Supplier to Walmart
This is to find the shortest path
Challenges
47
Theorem: There does NOT exist a perfect scheme to modify all weights but maintain all shortest paths (and lengths). *
* Formal proposition and mathematic proof are referred to Proposition 1 in our paper.
Data Utility
Data Privacy
Challenges:Data Utility (i.e., the shortest paths and lengths) is global property.Data Privacy (i.e., individual weights) is local information.
How can we carefully change local weights without unacceptable impact on shortest paths and lengths?
Section 5: Case Study
48
Gaussian Perturbation
New
Su
pp
lier
Wal-M
art
Agent 1 Agent 5
Agent 2
Agent 3 Agent 4
Unit=million/month
40
10
43
60
85 50
90
70
66
w*i,j = wi,j (1-xi,j),
Here xi,j is a randomly generated number from the Gaussian distribution N(0,σ2).
New
Su
pp
lier
Wal-M
art
Agent 1 Agent 5
Agent 2
Agent 3 Agent 4
Unit=million/month
35
32
33
70
70 65
67
70
7048 36
• Privacy: Almost all weights are changed.• Utility: Same shortest path between New Supplier and Wal-Mart and length is 99.
Section 5: Case Study
HOW TO MODIFY WEIGHTS AND KEEP SP
� Gaussian Perturbation
For a path, its connecting edges
may be changed in a negative or
positive way. Totally, change may
be very close to zero for a path.
97.7% xi,j and 99.9% xi,j are
resided in 2σ and 3σ from zero.
Analysis on Gaussian perturbation
50
Claim 2: Let the length of a path be L in original networks and L* be the length of the corresponding path in perturbed networks.
1. Approximately 68% L satisfy ,
2. Approximately 98% L satisfy
3. Approximately 99.7% L satisfy
for a given value of σ
* Formal theorem/corollary and mathematic proofs are referred to Theorem 2 and Corollary 3 in our paper, respectively.
Section 5: Case Study
Analysis on Gaussian perturbation
51* Formal theorem/corollary and mathematic proofs are referred to Theorem 2 and Corollary 3 in our paper, respectively.
Claim 3: Let di,j be the length of the shortest path between node i and node j, and di,j
second be the length of the second shortest path between same node pair.
For two given nodes i and j, if the ratio βi,j= is greater than 2σ, the
shortest path is highly possible to be preserved after Gaussian perturbation. *
Recall Claim 2. Approximately 98% L satisfy
Section 5: Case Study
If the shortest path and the second shortest path differ by a large length,the shortest path is very likely to be preserved after the perturbation
An example
52
The shortest path, length is 21
The second shortest path, length is 30
Section 5: Case Study
An exampleThe shortest path, length is 21
The second shortest path, length is 30
σ = 0.15
Gaussian perturbation
β1,6 = (30-21)/21 = 0.429 >= 2σ. So the shortest path between v1 and v6 can be maintained no matter how you choose the random value from Gaussian distribution.
Section 5: Case Study
RESULTS WITH GAUSSIAN PERTURBATION
σ=0.1 on
EIES
σ=0.15 on
EIES
σ=
0.2
on
EIE
S
At x-axis 0.15, for example, the dashed
point (length) is 0.8699 and the solid
point (weight) is 0.8565. It means that,
in the Gaussian algorithm, 85.65% w*I,j
fall into wi,j (1 ± 0.15), and 86.99% d*I,j
fall into di,j (1 ± 0.15).
Greedy Perturbation: Discussion
55
• Gaussian Perturbation is quick and independent with global structure. But it cannot always keep the same shortest paths when σ is not large.
• We propose a Greedy Perturbation which can keep the exact shortest paths, and make sure that their corresponding lengths are similar to the original ones.
Section 5: Case Study
Edge Categorization
56
the shortest path p1,6 the shortest path p4,6 the shortest path p3,6
V1
V
2
V3
V
4
V
5
V
6
V1
V
2
V3
V
4
V
5
V
6
V1
V
2
V3
V
4
V
5
V
6
H={p1,6 , p4,6 , p3,6}.
Constraints: the shortest paths in H cannot be changed after perturbation.
Section 5: Case Study
Edge Categorization
57
the shortest path p1,6 the shortest path p4,6 the shortest path p3,6
non-visited edgespartially-visited edges
all-visited edgesV1
V
2
V3
V
4
V
5
V
6
6
9
6
7
5
13
25
10
10
V1
V
2
V3
V
4
V
5
V
6
V1
V
2
V3
V
4
V
5
V
6
V1
V
2
V3
V
4
V
5
V
6
Section 5: Case Study
Non-Visited Edge
58
Claim 3: For a non-visited edge, increasing its weight will NOT change all shortest paths (and lengths) in H. *
*Formal definition is referred to Proposition 7.
V1
V
2
V3
V
4
V
5
V
6
6
65
10
10
7�10
25
13
9
P1,6 (no change)
P4,6 (no change)
P3,6 (no change)
Section 5: Case Study
All-Visited Edge
59
Claim 4: For an all-visited edge, decreasing its weight will NOT change all shortest paths in H, but decrease the length of corresponding shortest paths. *
*Formal definition is referred to 8.
V1
V
2
V3
V
4
V
5
V
6
6
9
65
13
25
10
10�5
7
P1,6 (no change)
P4,6 (no change)
P3,6 (no change)
Section 5: Case Study
Greedy Edge Perturbation (2)
60
Claim 5: For a partially-visited edge, if we want to increase its weight by t, we should guarantee the shortest paths, which go through it, will still go by this edge after perturbation. *
* How do we guarantee it (i.e., impose some constraints over the weight increasing) will be shown
as Proposition 9 in our paper.
V1
V
2
V3
V
4
V
5
V
6
6
9
6
7
5�16
13
25
10
10
P1,6 (probably change to P-1.6 )
P4,6 (no change)
P3,6 (probably change to P-3,6 )
P-1,6 , the shortest path
between V1 and V6 in G- (G delete the edge between V2
and V5)Constraints: the weight increment t should be smaller than the diff. between di,j and d-
I,j .
Section 5: Case Study
Partially-Visited Edge
61
Claim 6: For a partially-visited edge, if we want to decrease its weight by t, we should guarantee the shortest paths, which do not go through it, will not change after perturbation. *
* How do we guarantee it (i.e., impose some constraints over the weight decreasing) will be shown
as Proposition 10 in our paper.
V1
V
2
V3
V
4
V
5
V
6
6
9
6
7
5�2
13
25
10
10
P1,6 (no change)
P4,6 (probably change to P+4,6)
P3,6 (no change)
P+4,6 , the shortest path
between V4 and V6 and through edge (V2 � V5)
Constraints: the weight decrement t should be larger than the diff. between d+
i,j and di,j .
Section 5: Case Study
62
1. Increase non-visited edges and decrease all-visited edges.
2. Sort all partial-visited edges in a descending order, in terms of the number of shortest paths going through them.
3. For a given partial-visited edge, whether increasing or decreasing depends on the comparison between the real length and the current (perturbed) length.
4. For a given partial-visited edge, the modified value is chosen as the boundary value of constraint inequalities.
Greedy Algorithm
* For the detailed algorithm, please refer to Algorithm 1 in our paper.
Section 5: Case Study
RESULTS WITH GREEDY PERTURBATION (1)
For example, at x-axis 0.15, the dashed line point (length) is 60% and the solid
point (weight) is 54%. It means that, after the greedy perturbation, 54% w*I,j
of the perturbed edges fall into wi,j (1 ± 0.15), and 60% d*I,j of the perturbed
shortest path lengths fall into di,j (1±0.15), in addition to the shortest paths
of all targeted pairs in H being exactly preserved.
RESULTS WITH GREEDY PERTURBATION (2)
RESULTS WITH GREEDY PERTURBATION (3)
Discussion on Experiments
66
Data Utility Data Privacy
Gaussian Perturbation Lengths of the shortest paths are better preserved, cannot guarantee maintain the exact shortest path.
Low
Greedy Algorithm Length is not well preserved compared to Gaussian. But the shortest paths are exactly maintained.
High
Section 5: Case Study
Study Case Remarks(What do we want to do?)
Keep weight privacy and the shortest path utility.
(Why do we want to do?) Weights in some social cases are sensitive and confidential.
(How do we do?)Gaussian perturbation and greedy perturbation are proposed to achieve the balance between data utility and data privacy in different conditions.
(What we do is applicable?)It seems that the two strategies do meet the expectation of our purpose.
67
Section 5: Case Study
Conclusion� Social networks and social network research are
promising
� Privacy issues in social network analysis should be emphasized
� Social network privacy preservation, data utility, social network analysis algorithms, need further research and study
Section 6: Conclusion
68
Funding Agencies and
Student Researchers
69
•US National Science Foundation
•Kentucky Science andEngineering Foundation
•US National Institutes Of Health
Privacy-Preserving Social Network with
Sensitive Information
Contact Information:
Dr. Jun Zhang (张骏)
E-mail: [email protected]
[email protected] (中文)
http://www.cs.uky.edu/~jzhang
Phone: 13540021323 (中国手机)
70