Jun Zhang - University of Kentuckyjzhang/CS689/PPDM-Chapter9.pdf · 2015. 3. 30. ·...

Post on 12-Apr-2021

1 views 0 download

Transcript of Jun Zhang - University of Kentuckyjzhang/CS689/PPDM-Chapter9.pdf · 2015. 3. 30. ·...

Jun ZhangDepartment of Computer Science

University of Kentucky, USA

1

Outline

�Social Network Background

�Privacy Challenges

�Data Privacy and Data Utility

�Clustering-Based and Heuristic Algorithms

�Case Study

�Conclusions

2

What is “social network” ?

�Information circulation: blog photo

news

�Information sharing: data friendship communication

professional

Social network is a computerized interactive structure with the purpose of promoting information circulation and sharing , aided by computer devices and internet media.

Section 1: Background

3

1971, Beginning of Internet

As a precursor of current internet, ARPANET just connected 18 academic and governmental partners.

Source: Richard T. Griffiths. History of the Internet, Internet for Historians

Section I: Background

4

2011, everyone’s Internet

Section 1: Background

5

What is social networking

Many People are virtually connected in

one way or another

Section 1: Background

6

Growing Popularity� More than 160 major social network websites in the world [1]

� In 3rd quarter of 2014, 1.35 billion active users in Facebook, 728 million users with daily login, each account has an average of 130 friends [3]

� In 2008, account creation of Facebook was majorly contributed by users with 35-year-old and older[4] (270% increase)

� At the beginning of 2008, Twitter had only 0.5 million users, at the end of 2008, the number turned to be 4.43 million. [5] (752% increase) In 4th quarter of 2014, it has 288 million active users

� Top global social network (2015): Facbook 1,415 million, QQ 829 million, WhatsApp 700 million, Qzone 629 million, Facebook Messenger 500 million, WeChat 468 million, LinkIn 347 million, Skype 300 million, Google+ 300 million, Instgram 300 million, Baidu Tieba 300 million, Twiter 288 million,

Section 1: Background

7

What are they doing onlineSection 1: Background

8

Why research on social network?� Essential reason: social network is an abstract but

effective representation of the real society.

� This abstract representation can help us understand social development, economical depression, information circulation, epidemic spread, etc.

� A better understanding of social networks can promote social benefits and support policy-making.

Section 1: Background

9

Representation of real societyExample 1 [1][2]: � Social network can really

represent a real society trend in some cases

� The correlation between social network and real society means wide demographic involvement in social network

[1]. http://openlook.org/blog/2007/12/21/cb-1195/ (in Korean), [2]. Lee et al. Googling hidden interactions: Web search engine based weighted network construction, 2007.

Section 1: Background

10

Revolutions and Social Network

[1]. http://socialtuts.cim, [2]. News.mgid.com. [3]. Hellotrade.com.

Section 1: Background

11

We are close to each other

Six Degree of Separation (small-world theory) :In a society, any two persons can be connected by no more than 6 friends.

Example 2: In 1967,Dr. Stanley

Milgram at Harvard University verified it by a seminal experiment.

On Facebook, the degrees of separation is 4.74 (average users). It is shrinking quickly

Source:http://en.wikipedia.org/wiki/Six_Degrees_of_Separation_%28film%29

Section 1: Background

12

Academic Interest

Section 1: Background

13

Why study Privacy Preservation�User engagement

Wide demographic engagement benefits social networks. Malicious users, like hackers, can exploit it。

� Information circulation

Information circulation advances data utilization. In the process of circulation, information can be damaged.

� Information sharing

Information sharing can promote mutual benefits. But how to control the level and boundary of sharing? Pirating

Section 1: Background

14

Privacy in social network� Case 1::::

Source: Sweeney. k-anonymity: a model for protecting privacy, 2002.

Dr. Sweeney in CMU could exactly locate the medical record of the Governor by analyzing the public voter registration table (right circle) of Massachusetts, USA, and GIC anonymized insurance information (left circle) .

Section 1: Background

15

Privacy in social network

Section 1: Background

16

Privacy in social network

Section 1: Background

17

Challenge 1: How to store/handle social data

� Huge volume: 1.415 billion active accounts in Facebook, 50% of active accounts with daily login, each account has an average of 130 friends

� Heterogeneity:numerical (age, salary, frequency),discrete (political affiliation), string (blog, comment),multimedia (photo, audio, video), relational data (friendship, membership)

� Rapid Update

Section 2: Challenge

18

Challenge 2: How to represent background

Any information can be used as background information to benefit privacy attacks

� topological structure

� profiles

� friendship

� membership

� contact frequency

� more…

Section 2: Challenge

19

Challenge 3: How to measure data loss� To preserve privacy, it is necessary to

modify/perturb original social networks.

� In the process of modification, some original information/patterns will be lost.

� How to mathematically measure data loss? Where is the baseline of data loss?

Section 2: Challenge

20

Challenge 4: How to design algorithms� Globalization of data utility

� Balance of data privacy and data utility

� Multi-utilization

Section 2: Challenge

21

What is Data Privacy?� An open problem

� In the field of social networks, privacy is any information that can be used to link a social network user and a real society identity.

� Attack is any projection that can establish such linkages between a social network user and a real identity.

Section 3: Privacy & Utility

22

Purpose

Section 3: Privacy & Utility

Cut off any information connection

23

Naïve PrivacyDirectly identifiable information:

� Identifiable number (passport,drive license)

� Real name, address, affiliation

� Special experiences (chairman of a department)

Section 3: Privacy & Utility

24

Node Degree Attack

A

E

CD

B

F

Background:Bob makes friends with everyone. Or Bob is the most popular person in this group.

Haha, C is Bob.

Source:Backstrom et al. WWW ‘07, Hay et al. VLDB ‘08, Liu & TerziSIGMOD ’08, Narayanan SP ’08, ‘09

Section 3: Privacy & Utility

25

Neighborhood Attack

A

E

CD

B

F

Background:Bob has 2 friends, Alice and Carl , who know each other. Bob has another 2 friends, Dunn and Lily, who do not know each other. Lily has many other friends, but Bob knows nobody.

Source:Zhou et al. Preserving Privacy in Social Networks Against Neighborhood Attacks. ICDE 2008.

C is Bob, B is Lily, and A is Dunn.

Section 3: Privacy & Utility

26

Membership Attack

A

E

CD

B

F

“C should be male and around 20.

Source:Zheleva & Getoor PinKDD ‘07, Korolova et al. CIKM ‘08

Background:A={male, 32-year-old}, B={male, 16}, E={female,21}, F={female, 22}, D={male, 20}.C={??,??}

College Congress

Hiking Club

Section 3: Privacy & Utility

27

Link Label Attack

A

E

C

DB

F

Background:Bob is a hearing-impaired person.

Haha, B is Bob.

Source:Cormode et al., VLDB2008.

Section 3: Privacy & Utility

28

Association Attack

Section 3: Privacy & Utility

29

I don’t want my friendsKnow Peter is my brother

Sybil Attack

Source:Backstrom et al., WWW 2007.Photo: http://www.problogger.net/archives/2007/06/28/what-social-networking-sites-do-you-use-how-do-they-benefit-your-blog/

Bob

Theory: k=(2+δ)log(n) sybil nodes can breach most node identities

Experiment: As to a real social network with 4m nodes, just 7 sybilnodes can locate 2400 real identities.

Section 3: Privacy & Utility

30

What is data utility?

�Another open question

�Data utility is considered as any knowledge/patterns from analyzing social networks which can facilitate the understanding of social networks or society.

Section 3: Privacy & Utility

31

Purpose

Section 3: Privacy & Utility

Maintain usefulnessof data

32

Degree distribution

Source:[1] Newman et al. Email networks and the spread of computer viruses. 2002 (photo)[2] Albert et al. Error and attack tolerance of complex networks. Nature 2000.

Section 3: Privacy & Utility

33

Giant community

Source: Newman et al. Email networks and the spread of computer viruses. 2002

Section 3: Privacy & Utility

34

Shortest paths (data utility)

Source:[1] Liu et al., Privacy Preservation in Social Networks with Sensitive Edge Weights ,SDM2009. [2]. Das et al. Anonymizing Edge-Weighted Social Network Graphs, ICDE 2010.

Section 3: Privacy & Utility

35

Eigenvalue

Source:Ying et al. On Randomness Measures for Social Networks, SDM2009.

Section 3: Privacy & Utility

36

SQL query� Select count(distinct *)

from social network G

group by node.major (count the number of majors

in G)

� Select avg(node.age)

from social network G

where node.interest=“computer game”

(count average age of users

interested in computer game)

� more…… (data utility and SQL query)

Section 3: Privacy & Utility

37

Clustering-based algorithm

A={male,49}

E={female,22}

C={??,??}D={male,21}

B={male,16}

F={female,20}

College Congress

Hike Club

A={*,49}

E={female,[16-30]}

C={??,??}D={male, [16-30]}

B={*,16}

F={female, [16-30]}

Hike Club

Source:Campan, PinKDD08, Hay VLDB08, Cormode VLDB08, VLDB09

Section 4: Algorithms

College Congress

38

Heuristic

A

E

CD

B

F

d(G)={4,4,2,2,2,1,1,1,1}v(G)={D,C,E,F,B,A,I,G,H}

Goal:the number of nodes with the same degree should be more than 3.

d(G)={4, 4, 2, 2,2, 1, 1, 1, 1}

I

G

H

Source:Liu et al. Towards Identity Anonymization on Graphs, SIGMOD 2008.

Section 4: Algorithms

39

Heuristic

A

E

CD

B

F

d(G)={4,4,2,2,2,1,1,1,1}v(G)={D,C,E,F,B,A,I,G,H}

Goal:the number of nodes with the same degree should be more than 3.

d(G)={4, 4, 2, 2, 2, 1, 1, 1, 1}

d(G)={4, 4, 4, 4, 2, 2, 2, 2, 2}v(G)={D,C,E,F,B,A,I,G,H}

I

G

H

Source:Liu et al. Towards Identity Anonymization on Graphs, SIGMOD 2008.

Section 4: Algorithms

40

Heuristic

A

E

CD

B

F

d(G)={4,4,2,2,2,1,1,1,1}

Goal:the number of nodes with the same degree should be more than 3 (k=3).

DA(d[i,j])=optimal solution to make [d[i],d[j]] satisfy the goalI(d[i,j])=optimal solution to make [d[i],d[j]] have the same degree

I(d[i,j])=

For i<2kDA(d[1,i])=I(d[1,i])

For i>=2kDA(d[1,i])= {DA(d[1,t])+I(d[t+1,i]), I(d[1,i])}

I

G

H∑

=

j

ikkdid ))()((

minkitk −<=<=

Section 4: Algorithms

41

Weight privacy

42

In a weighted social network, a large weight probably implies a close personal relationship which many people do not want to become public

Section 5: Case Study

Social Network Weight Privacy

43

In a social network, the frequency of communications (chats, messages, e-mails) between users is weights. If you have several online friends, girlfriends, you probably do not want to disclose your communication frequency with each one of them.

100

20

40

35

Social Network in real life

44

T

PrincipalParents

Spend minimum amount of money, get maximum amount of benefit

Person 1

Person 2

Person n

In order to get his child enrolledin a good school, a parent wantsto get connected with the Principal

¥500

¥100

¥300¥200

¥400

¥500

Data Privacy and Data Utility

45

In the following presentation, • Data Privacy --- all edge’s weights • Data Utility --- shortest paths and lengths due to its rich application.

• Perturb weights as much as possible, • Keep the shortest paths (and lengths) the same as the original ones as much as possible.

Section 5: Case Study

46

Weight Privacy in Business

New

Su

pp

lier

Walm

art

Agent A Agent D

Agent B

Agent C Agent E

Unit=Million Dollars/Month

40

10

43

60

85 50

90

70

6648

Find the cheapest supply chain from New Supplier to Walmart

This is to find the shortest path

Challenges

47

Theorem: There does NOT exist a perfect scheme to modify all weights but maintain all shortest paths (and lengths). *

* Formal proposition and mathematic proof are referred to Proposition 1 in our paper.

Data Utility

Data Privacy

Challenges:Data Utility (i.e., the shortest paths and lengths) is global property.Data Privacy (i.e., individual weights) is local information.

How can we carefully change local weights without unacceptable impact on shortest paths and lengths?

Section 5: Case Study

48

Gaussian Perturbation

New

Su

pp

lier

Wal-M

art

Agent 1 Agent 5

Agent 2

Agent 3 Agent 4

Unit=million/month

40

10

43

60

85 50

90

70

66

w*i,j = wi,j (1-xi,j),

Here xi,j is a randomly generated number from the Gaussian distribution N(0,σ2).

New

Su

pp

lier

Wal-M

art

Agent 1 Agent 5

Agent 2

Agent 3 Agent 4

Unit=million/month

35

32

33

70

70 65

67

70

7048 36

• Privacy: Almost all weights are changed.• Utility: Same shortest path between New Supplier and Wal-Mart and length is 99.

Section 5: Case Study

HOW TO MODIFY WEIGHTS AND KEEP SP

� Gaussian Perturbation

For a path, its connecting edges

may be changed in a negative or

positive way. Totally, change may

be very close to zero for a path.

97.7% xi,j and 99.9% xi,j are

resided in 2σ and 3σ from zero.

Analysis on Gaussian perturbation

50

Claim 2: Let the length of a path be L in original networks and L* be the length of the corresponding path in perturbed networks.

1. Approximately 68% L satisfy ,

2. Approximately 98% L satisfy

3. Approximately 99.7% L satisfy

for a given value of σ

* Formal theorem/corollary and mathematic proofs are referred to Theorem 2 and Corollary 3 in our paper, respectively.

Section 5: Case Study

Analysis on Gaussian perturbation

51* Formal theorem/corollary and mathematic proofs are referred to Theorem 2 and Corollary 3 in our paper, respectively.

Claim 3: Let di,j be the length of the shortest path between node i and node j, and di,j

second be the length of the second shortest path between same node pair.

For two given nodes i and j, if the ratio βi,j= is greater than 2σ, the

shortest path is highly possible to be preserved after Gaussian perturbation. *

Recall Claim 2. Approximately 98% L satisfy

Section 5: Case Study

If the shortest path and the second shortest path differ by a large length,the shortest path is very likely to be preserved after the perturbation

An example

52

The shortest path, length is 21

The second shortest path, length is 30

Section 5: Case Study

An exampleThe shortest path, length is 21

The second shortest path, length is 30

σ = 0.15

Gaussian perturbation

β1,6 = (30-21)/21 = 0.429 >= 2σ. So the shortest path between v1 and v6 can be maintained no matter how you choose the random value from Gaussian distribution.

Section 5: Case Study

RESULTS WITH GAUSSIAN PERTURBATION

σ=0.1 on

EIES

σ=0.15 on

EIES

σ=

0.2

on

EIE

S

At x-axis 0.15, for example, the dashed

point (length) is 0.8699 and the solid

point (weight) is 0.8565. It means that,

in the Gaussian algorithm, 85.65% w*I,j

fall into wi,j (1 ± 0.15), and 86.99% d*I,j

fall into di,j (1 ± 0.15).

Greedy Perturbation: Discussion

55

• Gaussian Perturbation is quick and independent with global structure. But it cannot always keep the same shortest paths when σ is not large.

• We propose a Greedy Perturbation which can keep the exact shortest paths, and make sure that their corresponding lengths are similar to the original ones.

Section 5: Case Study

Edge Categorization

56

the shortest path p1,6 the shortest path p4,6 the shortest path p3,6

V1

V

2

V3

V

4

V

5

V

6

V1

V

2

V3

V

4

V

5

V

6

V1

V

2

V3

V

4

V

5

V

6

H={p1,6 , p4,6 , p3,6}.

Constraints: the shortest paths in H cannot be changed after perturbation.

Section 5: Case Study

Edge Categorization

57

the shortest path p1,6 the shortest path p4,6 the shortest path p3,6

non-visited edgespartially-visited edges

all-visited edgesV1

V

2

V3

V

4

V

5

V

6

6

9

6

7

5

13

25

10

10

V1

V

2

V3

V

4

V

5

V

6

V1

V

2

V3

V

4

V

5

V

6

V1

V

2

V3

V

4

V

5

V

6

Section 5: Case Study

Non-Visited Edge

58

Claim 3: For a non-visited edge, increasing its weight will NOT change all shortest paths (and lengths) in H. *

*Formal definition is referred to Proposition 7.

V1

V

2

V3

V

4

V

5

V

6

6

65

10

10

7�10

25

13

9

P1,6 (no change)

P4,6 (no change)

P3,6 (no change)

Section 5: Case Study

All-Visited Edge

59

Claim 4: For an all-visited edge, decreasing its weight will NOT change all shortest paths in H, but decrease the length of corresponding shortest paths. *

*Formal definition is referred to 8.

V1

V

2

V3

V

4

V

5

V

6

6

9

65

13

25

10

10�5

7

P1,6 (no change)

P4,6 (no change)

P3,6 (no change)

Section 5: Case Study

Greedy Edge Perturbation (2)

60

Claim 5: For a partially-visited edge, if we want to increase its weight by t, we should guarantee the shortest paths, which go through it, will still go by this edge after perturbation. *

* How do we guarantee it (i.e., impose some constraints over the weight increasing) will be shown

as Proposition 9 in our paper.

V1

V

2

V3

V

4

V

5

V

6

6

9

6

7

5�16

13

25

10

10

P1,6 (probably change to P-1.6 )

P4,6 (no change)

P3,6 (probably change to P-3,6 )

P-1,6 , the shortest path

between V1 and V6 in G- (G delete the edge between V2

and V5)Constraints: the weight increment t should be smaller than the diff. between di,j and d-

I,j .

Section 5: Case Study

Partially-Visited Edge

61

Claim 6: For a partially-visited edge, if we want to decrease its weight by t, we should guarantee the shortest paths, which do not go through it, will not change after perturbation. *

* How do we guarantee it (i.e., impose some constraints over the weight decreasing) will be shown

as Proposition 10 in our paper.

V1

V

2

V3

V

4

V

5

V

6

6

9

6

7

5�2

13

25

10

10

P1,6 (no change)

P4,6 (probably change to P+4,6)

P3,6 (no change)

P+4,6 , the shortest path

between V4 and V6 and through edge (V2 � V5)

Constraints: the weight decrement t should be larger than the diff. between d+

i,j and di,j .

Section 5: Case Study

62

1. Increase non-visited edges and decrease all-visited edges.

2. Sort all partial-visited edges in a descending order, in terms of the number of shortest paths going through them.

3. For a given partial-visited edge, whether increasing or decreasing depends on the comparison between the real length and the current (perturbed) length.

4. For a given partial-visited edge, the modified value is chosen as the boundary value of constraint inequalities.

Greedy Algorithm

* For the detailed algorithm, please refer to Algorithm 1 in our paper.

Section 5: Case Study

RESULTS WITH GREEDY PERTURBATION (1)

For example, at x-axis 0.15, the dashed line point (length) is 60% and the solid

point (weight) is 54%. It means that, after the greedy perturbation, 54% w*I,j

of the perturbed edges fall into wi,j (1 ± 0.15), and 60% d*I,j of the perturbed

shortest path lengths fall into di,j (1±0.15), in addition to the shortest paths

of all targeted pairs in H being exactly preserved.

RESULTS WITH GREEDY PERTURBATION (2)

RESULTS WITH GREEDY PERTURBATION (3)

Discussion on Experiments

66

Data Utility Data Privacy

Gaussian Perturbation Lengths of the shortest paths are better preserved, cannot guarantee maintain the exact shortest path.

Low

Greedy Algorithm Length is not well preserved compared to Gaussian. But the shortest paths are exactly maintained.

High

Section 5: Case Study

Study Case Remarks(What do we want to do?)

Keep weight privacy and the shortest path utility.

(Why do we want to do?) Weights in some social cases are sensitive and confidential.

(How do we do?)Gaussian perturbation and greedy perturbation are proposed to achieve the balance between data utility and data privacy in different conditions.

(What we do is applicable?)It seems that the two strategies do meet the expectation of our purpose.

67

Section 5: Case Study

Conclusion� Social networks and social network research are

promising

� Privacy issues in social network analysis should be emphasized

� Social network privacy preservation, data utility, social network analysis algorithms, need further research and study

Section 6: Conclusion

68

Funding Agencies and

Student Researchers

69

•US National Science Foundation

•Kentucky Science andEngineering Foundation

•US National Institutes Of Health

Privacy-Preserving Social Network with

Sensitive Information

Contact Information:

Dr. Jun Zhang (张骏)

E-mail: jzhang@cs.uky.edu

jzhang111@msn.com (中文)

http://www.cs.uky.edu/~jzhang

Phone: 13540021323 (中国手机)

70