Ph.D. defense: semantic social network analysis

73
SEMANTIC SOCIAL NETWORK ANALYSIS Guillaume Erétéo Ph.D. Thesis defense supervisors: Michel Buffa, Kewi/I3S, UNSA/CNRS Fabien Gandon, Edelweiss, INRIA Sophia Antipolis Patrick Grohan, Orange Labs

description

This thesis proposes to help analyzing the characteristics of the heterogeneous social networks that emerge from the use of web-based social applications, with an original contribution that leverages Social Network Analysis with Semantic Web frameworks. Social Network Analysis (SNA) proposes graph algorithms to characterize the structure of a social network and its strategic positions. Semantic Web frameworks allow representing and exchanging knowledge across web applications with a rich typed graph model (RDF), a query language (SPARQL) and schema definition frameworks (RDFS and OWL). In this thesis, we merge both models in order to go beyond the mining of the flat link structure of social graphs by integrating a semantic processing of the network typing and the emerging knowledge of online activities. In particular we investigate how (1) to bring online social data to ontology-based representations, (2) to conduct a social network analysis that takes advantage of the rich semantics of such representations, and (3) to semantically detect and label communities of online social networks and social tagging activities. these results are implemented in mnemotix technologies (http://twitter.com/mnemotix)

Transcript of Ph.D. defense: semantic social network analysis

SEMANTIC SOCIAL NETWORK ANALYSIS

Guillaume Erétéo

Ph.D. Thesis defense supervisors:

Michel Buffa, Kewi/I3S, UNSA/CNRS Fabien Gandon, Edelweiss, INRIA Sophia Antipolis Patrick Grohan, Orange Labs

OUTLINE

1.  Context and Scientific Objectives 2.  State of the Art on Social Network Analysis & Semantic Social

Networks 3.  SemSNA: Analysing Social Networks with Semantic Web

Frameworks 4.  Community Detection: SemTagP, Semantic Tag Propagation

2

CONTEXT ISICIL: Information Semantic Integration through Communities of Intelligence onLine

3

  enterprise 2.0   semantic web   business intelligence

 pluridisciplinary: ergonomists, sociologists, mathematicians, ontologists, computer scientists

 ANR-08-CORD-011

SEMANTIC INTRANET OF PEOPLE

"the use of emergent social software platforms within companies, or between companies and their partners or customers"

[McAfee 2006] 4

represent, exchange and analyse data accross applications to deliver information in a way that matters to people and to their communities.

[Berners-Lee et al., 2001]

SCIENTIFIC OBJECTIVES extend social network analysis with semantic formalisms to reveal and exploit the rich social structures embedded in the emerging social data of web 2.0 applications:

  how to represent, link and access online social networks accross applications?

  how to enable classical operators of social network analysis to consider the semantics of these networks?

  how this semantics could be exploited to create new algorithms?

5

OUTLINE

1.  Context and Scientific Objectives 2.  State of the Art on Social Network Analysis & Semantic

Social Networks 3.  SemSNA: Analysing Social Networks with Semantic Web

Frameworks 4.  Community Detection: SemTagP, Semantic Tag Propagation

6

SOCIAL NETWORK ANALYSIS graph algorithms to characterize the structure of a social network, strategic positions/actors, and the distribution of networking activities.

applications:

  monitor information flow   foster communication   focus notifications in information systems   create project teams   identify experts

7

SOCIAL NETWORKS AND GRAPHS actors are represented by nodes and relations by edges G=(V, E), n=|V|, m=|E|

8

1

0.5

1 2

3

4

1,5

collaborate

colleague

manages

follows

manages

sameInterest

follows

NETWORK STRUCTURE e.g. density and diameter highlight cohesion of the network

9

diam(G) = length(g(e1,e2));∀e3,e4 ∈ EG;length(g(e3,e4 )) ≤ length(g(e1,e2))

[Scott 2000]

[Zachary 1977]

STRATEGIC POSITIONS & ACTORS degrees reveal local popularities

10

[Shaw1954]

STRATEGIC POSITIONS & ACTORS directed degree differenciates support and influence

11

Din (y) = x;∃ x,y( )∈ E{ }[Nieminem1973]

STRATEGIC POSITIONS & ACTORS n-degree for variable neighborhood

12

[Garrison 1960] [Pitts 1965]

STRATEGIC POSITIONS & ACTORS betweenness centrality reveals intermediaries & brokers

13

[Freeman1977]

highly strategic position in communication [Shimbel 1953] [Cohn & Marriott 1958] [Burt 1992]

STRATEGIC POSITIONS & ACTORS Closeness centrality measures reachability

14

[Leavitt 1951]

ONLINE SOCIAL DATA ARE MORE COMPLEX TO REPRESENT multiple & spread roles, context, profile, etc. distributed across applications

15

LINK STRUCTURE IS NOT ENOUGH who has the best betweeness centrality?

knows in passing

has met

has met

works With

works With

has supervisor

16

SEMANTICS MATTER! how can we consider different types of relations?

knows in passing

has met

has met

works with

works with has supervisor

17

RESOURCE DESCRIPTION FRAMEWORK make assertions and describe resources with triples (subject, predicate, object) like "the subject, verb and object of an elementary sentence“ [Berners-Lee 2001]

18

ONTOLOGY

"a set of representational primitives with which to model a domain of knowledge or discourse. The representational primitives are typically classes (or sets), attributes (or properties), and relationships (or relations among class members). The definitions of the representational primitives include information about their meaning and constraints on their logically consistent application”

[Gruber 1993] [Gruber 2009]

19

RESOURCE DESCRIPTION FRAMEWORK SCHEMA

20

set of primitives to define the classes of a domain knowledge, taxonomical relations, and classes of resource that apply to properties

SPARQL PROTOCOL AND RDF QUERY LANGUAGE query language, protocol and format to send queries and exchange results across the web

21

PREFIX foaf: < http://xmlns.com/foaf/0.1/> SELECT ?person ?name WHERE { ?person rdf:type foaf:Agent ?person foaf:firstName ?name }

22

CLASSIC SNA ON SEMANTIC WEB rich graph representations reduced to simple un-typed graphs for analysis

[Paolillo & Wright 2006]

foaf:knows

foaf:interest

[San Martin & Gutierrez 2009]

23

(guillaume)=5

Gérard  

Fabien  Mylène  

Michel  Yvonne  

cow

orke

r

d

24

(guillaume)=?

Gérard  

Fabien  Mylène  

Michel  Yvonne  

cow

orke

r

<family> d

25

(guillaume)=3

parent sibling

mother father brother sister

colleague

knows Gérard  

Fabien  Mylène  

Michel  Yvonne  

cow

orke

r

<family> d

26

directed network

weighted network

labelled network

parameterized operators network size

Graph Theory ✔ ✔ ✔ 106 nodes 107 edges

[Brandes 2009] ✔ ✔ ✔ 104 nodes

[Paolillo & Wright 2006] ✔ ✔ ~ 104 nodes

~ 105 edges

[San Martin & Gutierrez 2009] ✔ ✔ ~ 104 nodes

~ 105 edges

27

OUTLINE

1.  Context and Scientific Objectives 2.  State of the Art on Social Network Analysis & Semantic Social

Networks 3.  SemSNA: Analysing Social Networks with Semantic Web

Frameworks 4.  Community Detection: SemTagP, Semantic Tag Propagation

28

SEMANTIC SNA FRAMEWORK exploit the semantic of social networks and parameterize SNA operators

29

parameterized SNA operators

SPARQL formalization of operators

SemSNA ontology: annotate social data with results of analyses

PARAMETERIZED DENSITY proportion of the maximum possible number of properties of type <rel> (or subtype)

30

number of actors of a given type (or subtype)

number of pairs of resources linked by a property of type <rel> (or subtype)

PARAMETERIZED N-DEGREE number of paths of properties of type <rel> (or subtype) having y at one end and with a length smaller or equal to dist

31

parameterized path: a list of nodes of a graph G each linked to the next by a relation of type <rel> (or subtype)

PARAMETRIZED DIAMETER length of the longest geodesic in the network for a property of type <rel> (or subtype)

32

geodesic: a shortest path between two resources for a given relation of type <rel> (or subtype)

SPARQL FORMALIZATION OF PARAMETERIZED OPERATORS

  SPARQL is designed to query RDF data

  CORESE semantic search engine implementing semantic web languages using graph-based representations

 Automatic processing of semantic inference (e.g. subsumption)

 Graph querying extension (e.g. paths) [Corby et al 2004] [Corby 2008]

33

SPARQL FORMALIZATION parameterized density

34

SELECT merge count(?x) as ?nbactor WHERE{ ?x rdf:type param[type] }

SELECT cardinality(?p) as ?card WHERE { { ?p rdf:type rdf:Property filter(?p ^ param[rel]) } UNION { ?p rdfs:subPropertyOf ?parent filter(?parent ^ param[rel]) }

}

SPARQL FORMALIZATION parameterized n-degree

35

SELECT ?y count(?x) as ?degree WHERE { {?x (param[rel])*::$path ?y filter(pathLength($path) <= param[dist])} UNION {?y param[rel]::$path ?x filter(pathLength($path) <= param[dist])}

} GROUP BY ?y

SPARQL FORMALIZATION parameterized diameter

36

SELECT pathLength($path) as ?length WHERE { ?y s (param[rel])*::$path ?to

} ORDER BY desc(?length) LIMIT 1

component

in-degree

diameter

closeness Centrality

betweenness Centrality

number of geodesics between from and to

degree

number of geodesics between from and to going through b

37

Ipernity 38

ANALYSED DATASET ipernity.com dataset extracted in RDF:

61 937 actors & 494 510 relationships:

–  18 771 family links between 8 047 actors

–  136 311 friend links implicating 17 441 actors

–  339 428 favorite links for 61 425 actors

–  2 874 170 comments from 7 627 actors

–  795 949 messages exchanged by 22 500 actors

39

INTERPRETATIONS OF RESULTS validated with managers of ipernity.com

 friendOf, favorite, message, comment small diameter, high density  family as expected: large diameter, low density  favorite: highly centralized around Ipernity

animator.  friendOf, family, message, comment: power law

of degrees and betweenness centralities, different strategic actors  knows: analyze all relations using subsumption

40

PERFORMANCES & LIMITS Knows! 0.71 s ! 494 510!Favorite! 0.64 s ! 339 428!Friend! 0.31 s ! 136 311!Family! 0.03 s ! 18 771!Message! 1.98 s ! 795 949!Comment! 9.67 s ! 2 874 170!Knows! 20.59 s ! 989 020!Favorite! 18.73 s ! 678 856!Friend! 1.31 s ! 272 622!Family! 0.42 s ! 37 542!Message! 16.03 s ! 1 591 898!Comment! 28.98 s! 5 748 340!

Shortest paths used to calculate

Knows! Path length <= 2: 14m 50.69s !Path length <= 2: 2h 56m 34.13s Path length <= 2: 7h 19m 15.18s !

100 000!1 000 000!2 000 000!

Favorite! Path length <= 2: 5h 33m 18.43s! 2 000 000!Friend! Path length <= 2: 1m 12.18 s !

Path length <= 2: 2m 7.98 s!1 000 000!2 000 000!

Family! Path length <= 2 : 27.23 s!Path length <= 2 : 2m 9.73 s!Path length <= 3 : 1m 10.71 s!Path length <= 4 : 1m 9.06 s!

1 000 000!3 681 626!1 000 000!1 000 000!

time projections

41

SEMSNA SCHEMA annotating the networks with analysis results

high  centrality  

42

SEMSNA AN ONTOLOGY OF SNA h6p://ns.inria.fr/semsna/2009/06/21/voc  

43

SemSNA CORE

44

colleague

mother isDefinedForProperty

4  

fath

er

Philippe  

hasCentralityDistance 2  

colleague

colleague  

supervisor  

Degree  

Guillaume  

Gérard  

Fabien  

Mylène  

Michel  

Yvonne  

Ivan  Peter  

45

Directed networks

Weighted networks

Labelled network

Parametrized operators Network size

Graph Theory ✔ ✔ ✔ 106 nodes 107 edges

[Brandes 2009] ✔ ✔ ✔ 104 nodes

[Paolillo & Wright 2006] ✔ ✔ ~ 104 nodes

~ 105 edges

[San Martin & Gutierrez 2009] ✔ ✔

~ 104 nodes ~ 104 - 105

edges

SEMSNA ✔ … ✔ ✔ 104 nodes ~ 105 edges

46

SEMSNA: CONCLUSION • directed typed graph structure of RDF/S

well suited to represent social knowledge & socially produced medata accross applications and networks

• parameterized SNA operators & SPARQL formalization enable us to exploit the diversity and the semantic structure of social data

• SemSNA Ontology organize and structure social data

47

OUTLINE

1.  Context and Scientific Objectives 2.  State of the Art on Social Network Analysis & Semantic Social

Networks 3.  SemSNA: Analysing Social Networks with Semantic Web

Frameworks 4.  Community Detection: SemTagP, Semantic Tag Propagation

48

DISTRIBUTION OF ACTIVITIES? e.g. ademe's Ph.D. thesis fundings and collaborations

49

COMMUNITY DETECTION helps understanding the repartition of actors and activities in a social network

50

SOA algorithms strategy mine linking structure in order to detect densely connected group of actors

HIERARCHICAL ALGORITHMS output a dendrogram: a hierarchical tree of denser and denser communities from top to bottom.

•  agglomerative algorithms start from the leaves, and group nodes in larger and larger communities: [Donetti & Munoz 2004] [Zhou & Lipowsky 2004] [Xu et al 2007] [Newman 2004]

• divisive algorithms start from the root of the tree, and group nodes in denser and denser communities: [Girvan & Newman 2002] [Radicchi et al 2004]

51

HEURISTIC BASED ALGORITHMS heuristics related to the community structure of networks and to community characteristics:

•  similarity with electrical networks [Wu 2004]

•  random walk [Dongen 2000] [Pons et al 2005]

•  label propagation [Raghavan et al 2007] 52

MODULARITY MEASURES COMMUNITY PARTITION QUALITY fraction of the edges that fall within communities minus the expected such fraction if edges were distributed at random

With: •  m be the number of edges of the network •  d<i> the degree of vertex i •  Aij the number of edges between i and j •  ci the community of i,

53

[Newman 2004]

Q =1m

[Aij −d< i>d< j>

m]

i, j∈V , ci =c j

LABEL PROPAGATION / RAK (1) assigns a unique random label to each node. (2) each node n replaces its label by the label most used by its neighbours. (3) if at least one node changed its label, go to step 2 (4) else nodes that share the same label form a community.

54

[Raghavan et al 2007]

opportunity replace random labels by tags in order to exploit not only the link structure but also the semantics of actors’ vocabulary!

FOLKSONOMIES each tag may represent a community of interest

55

pollution

pollutions du sol

has narrower

polluant énergie

related related

flat folksonomie thesaurus social tagging

[Limpens 2010]

TAG PROPAGATION exploit folksonomy for label assignement

a d f

e

g c

b

wiki

sweetwiki

isicil

inria

isicil

mediawiki

56

isicil

"interaction creates similarity, while similarity creates

interaction" [mika 2005]

TAG PROPAGATION wiki:1, sweetwiki: 1, mediawiki: 1

a d f

e

g c

b

wiki

sweetwiki

isicil

isicil

inria

isicil

mediawiki

57

SEMANTIC TAG PROPAGATION wiki:3, sweetwiki: 1, mediawiki: 1

a d f

e

g c

b

wiki

sweetwiki

wiki

isicil

inria

isicil

mediawiki

58

wiki

sweetwiki mediawiki

skos:narrower

SEMANTIC TAG PROPAGATION 2 communities labelled with wiki & isicil

a d f

e

g c

b

wiki

wiki

wiki

isicil

isicil

isicil

wiki

59

wiki

sweetwiki mediawiki

skos:narrower

ALGORITHM SEMTAGP

Algorithm SemTagP(RDFGraph network, Type relation) 1.  DO 2.  old_network = network 3.  //propagate tags (i.e. compute new partitions) 4.  FOREACH user IN network.users 5.  user.tag = mostUsedNeighborTag(user, relationType) 6.  END FOREACH 7.  WHILE modularity(network) > modularity(old_network) 8.  RETURN old_network

60

PARAMETRIZED SPARQL QUERY delegate all the semantic processing to a semantic graph engine to exploit semantic relations between tags and to parameterize the analyzed relation

61

SELECT ?user ?tag ?y WHERE{ ?user param[rel] ?neighbor {{?neighbour scot:hasTag ?tag } UNION {?neighbour scot:hasTag ?tag2 ?tag skos:narrower ?tag2

filter(exists{?x scot:hasTag ?tag})} } ORDER BY ?user ?tag

PROBLEM « bad » generalizations •  ubiquitous tags •  too broad tags •  semantic errors

62

environment

SOLUTION user control to disable semantic relations with given tags, which stengthen others narrower tags

63

nanotechnology

APPLIED TO ADEME PH.D. NETWORK  1,853 agents  1,597 academic supervisors  256 ADEME engineers.

 13,982 relationships  10,246 rel:worksWith  3,736 rel:colleagueOf

 6,583 tags  3,570 skos:narrower

relations between 2,785 tags

64

MODULARITY COMPARISONS X axis: propagation iterations, Y axis: modularity

65

MODULARITY LIMITS •  “the ‘optimal partition’, imposed by mathematics, does not

necessarily capture the actual community structure of the network” confirmed by experiments

• modularity optimization might miss important substructures when:

• modules are very fuzzy • modules have more than edges (which is the case for

half of ADEME’s detected communities)

• perspectives: measuring the average quality of each community

[Fortunato & Barthélemy 2007] 66

2m

RESULT 1. pollution

2. sustainable development

3. energy

4. chemistry

5. air pollution

6. metals

7. biomass

8. wastes

67

•  engineer •  supervisor •  community node size = degree

« POLLUTION » AREA

68

SEMTAGP: CONCLUSION •  SemTagP: semantic community detection and controlled labelling

•  applied to reveal the repartition of ADEME Ph.D fundings

•  many perspectives to integrate more semantics: •  investigate other semantics, e.g. skos:related, skos:closematch •  propagate tags through different types of relations •  propagate multiple tags and detect overlapping communities

69

CONCLUSION

CONTRIBUTIONS

• leveraging online social networks to ontology-based representations

• extending social network analysis to ontology-based representations

• semantic community detection and labelling

71

PERSPECTIVES scaling to large network

sampling, parallel, iterative algorithms

considering temporal data in the analysis representing and analysing temporal data

enrich social activities with SemSNA results better management of resources and relationships

72

QUESTIONS

International conference  Erétéo G., Gandon F., Corby O., Buffa M., “Analysis of a Real Online Social Network Using Semantic Web Frameworks”. ISWC2009, Washington D.C., USA.  Erétéo G., Gandon F., Corby O., Buffa M., “Semantic Social Network Analysis”. Web Science 2009, Athens, Greece.

Book chapter  Erétéo, G., Buffa, M., Gandon, F., Leitzelman, M., Limpens, F., Sanders, P., “Semantic Social Network Analysis, a concrete case”. Handbook of Research on Methods and Techniques for Studying Virtual Communities: Paradigms and Phenomena. A book edited by Ben Kei Daniel, IGI Global 2011. National conference  Leitzelman M., Erétéo, G., Grohan,, P., Herledan, F., Buffa, M., Gandon, F., “De l'utilité d'un outil de veille d'entreprise de seconde génération”. poster in IC2009, Hammamet, Tunisia.

Workshop  Erétéo, G., Buffa, M., Gandon, F., Leitzelman, M., Limpens, F., "Leveraging Social data with Semantics", W3C Workshop on the Future of Social Networking, Barcelona, Spain.  Erétéo, G., Buffa, M., Gandon, F., Grohan, P., Leitzelman, M., Sander, P., "A State of the Art on Social Network Analysis and its Applications on a Semantic Web", SDoW2008, Karlsruhe, Germany.

73