Mining and Supporting Community Structures in Sensor Network Research

44
UCLA | LANL CENTER FOR EMBEDDED NETWORKED SENSING Mining and supporting community structures in sensor network research Alberto Pepe (University of California at Los Angeles) Marko A. Rodriguez (Los Alamos National Laboratory) CENS Friday Seminar | May 2, 2008

description

 

Transcript of Mining and Supporting Community Structures in Sensor Network Research

  • Mining and supporting community structures in sensor network research Alberto Pepe (University of California at Los Angeles) Marko A. Rodriguez (Los Alamos National Laboratory) CENS Friday Seminar | May 2, 2008
  • Outline.
    • Studying Collaboration at CENS
      • Introduction to Data Practices
      • Detection of Structural Communities
      • Data Set and Methods
      • Results
    • Supporting Collaboration at CENS
      • Introduction to the Semantic Web
      • Semantic Networks and Graph Databases
      • Analyzing Semantic Networks
      • Demo
    Alberto Marko
  • Data practices group.
    • Background research questions:
      • What are CENS data?
      • What context data is necessary to support interpretation during re-use?
      • How can we automate the capture of context data?
      • How can we link scholarly and scientific data into meaningful aggregations/chains?
      • What are the social and academic settings that yield the production of scientific and engineering data/knowledge?
  • Current study.
    • Question: how do collaboration communities differ from socioacademic communities?
    • Method : comparative analysis of coauthorship network community structure and selected socioacademic community structures (e.g. academic department, affiliation, country of origin, academic position)
    Rodriguez, M.A., Pepe, A., On the relationship between the structural and socioacademic communities of a coauthorship network, Journal of Informetrics, in press, 2008.
  • Steps of the study.
    • Gather bibliographic and socioacademic data.
    • Generate coauthorship network.
    • Determine structural communities in the coauthorship network.
    • Test for statistical independence between the structural and socioacademic communities.
  • Steps of the study.
    • Gather bibliographic and socioacademic data.
    • Generate coauthorship network.
    • Determine structural communities in the coauthorship network.
    • Test for statistical independence between the structural and socioacademic communities.
  • Gather data.
    • Population data :
      • Collected from eScholarship repository
      • 291 CENS and non-CENS authors
      • Multi-institutional and interdisciplinary
      • 560 manuscripts (379 conference papers, 163 journal articles)
      • Published over a ten year period (1998-2007)
      • Gathered academic department, academic affiliation, country of origin, and academic position
  • Steps of the study.
    • Gather bibliographic and socioacademic data.
    • Generate coauthorship network.
    • Determine structural communities in the coauthorship network.
    • Test for statistical independence between the structural and socioacademic communities.
  • Generate coauthorship network.
    • @article{
    • author={Marko A. Rodriguez and Alberto Pepe },
    • title={On the relationship },
    • journal={Journal of Informetrics },
    • year=2008,
    • editor={Leo Egghe },
    • }
    Alberto Marko coauthor
  • CENS population statistics. Socioacademic communities
  • Study model. Alberto Marko coauthor Affiliation: UCLA Department: IS Origin: Italy Position: PhD Student Affiliation: LANL Department: CS Origin: USA Position: PostDoc
  • Steps of the study.
    • Gather bibliographic and socioacademic data.
    • Generate coauthorship network.
    • Determine structural communities in the coauthorship network.
    • Test for statistical independence between the structural and socioacademic communities.
  • Structural communities.
    • Structural communities are c liquish subgraphs composed by groups of vertices that are highly connected between them, but poorly connected to other vertices.
    Girvan, M., & Newman, M. E. J., Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99, 7821, 2002.
  • Community detection methods.
    • edge betweenness [1]
    • walktrap (random walks) [2]
    • spinglass [3]
    • leading eigenvector [4]
    [1] Girvan, M., & Newman, M. E. J. Community structure in social and biological networks, Proceedings of the National Academy of Sciences, 99:7821, 2002. [2] Pons, P., & Latapy, M., Computing communities in large networks using random walks, Journal of Graph Algorithms and Applications, 10:2, 2006. [3] Reichardt, J., & Bornholdt, S, Statistical mechanics of community detection, Physical Review E, 74 (016110), 2006. [4] Newman, M. E. J., Finding community structure in networks using the eigenvectors of matrices. Physical Review E, 74, 2006.
  • Coauthorship network map. 27 structural detected CENS communities (LEV).
  • Coauthorship network statistics.
    • Typical clustering coefficients:
    • mathematics: 0.34
    • physics: 0.56
    • biology: 0.60
    • less-cliquish, sparse collaboration patterns
    • CENS community fragmented in research agenda
    • Newman, M. E. J.,The structure and function of complex networks, SIAM Review, 45, 167, 2003.
  • Steps of the study.
    • Gather bibliographic and socioacademic data.
    • Generate coauthorship network.
    • Determine structural communities in the coauthorship network.
    • Test for statistical independence between the structural and socioacademic communities.
  • Chi square test.
    • Chi square test determines whether two nominal/categorical properties are statistically independent.
    Alberto Marko coauthor Community: A Affiliation: UCLA Department: IS Origin: Italy Position: PhD Student Community: B Affiliation: LANL Department: CS Origin: USA Position: PostDoc
  • Chi square analysis. N.B. p-value greater than 0.05 is considered statistically independent leading eigenvector (LEV), walktrap (WT), edge betweenness (EB), spinglass (SG).
  • Anecdotal example.
  • Anecdotal example.
  • Remarks.
    • Findings :
      • Community structure is representative of department and affiliation
      • Academic position and country of origin are independent of the structural community of the scholar.
    • Generalization :
      • Policy recommendations to increase interdisciplinarity
      • Extension to other coauthorship network and other socioacademic (demographic) variables
      • Useful to predict or infer topological/socioacademic configuration when data is scarce
  • Metadata reuse.
    • Metadata can be used to support scholarly collaboration.
  • Everything is metadata. Borgman Article2 JCDL Pepe Italy UCLA CENS writtenBy writtenBy member country attended hasLab Article1 Sensor Networks cites topic researches contains member member
  • Introduction to the Semantic Web.
    • The World Wide Web is used to link documents, where documents are given universal identifiers/locators called URIs (e.g. URL).
      • The structure is machine processable, but the documents/elements are primarily human processable.
    • The Semantic Web is used to link data, where data is given universal identifiers/locators called URIs (e.g. URL).
      • The structure and the data are both human and machine processable.
    T. Berners-Lee, J. Hendler. Publishing on the Semantic Web. Nature, 410(6832):10231024, April 2001.
  • The Uniform Resource Identifier.
    • Resource = Anything.
      • Anything that can be identified.
        • Some discrete entity.
    • The Uniform Resource Identifier (URI):
      • : [ ? ] [ # ]
        • http://www.lanl.gov
        • urn:uuid:550e8400-e29b-41d4-a716-446655440000
        • urn:issn:0892-3310
        • http://www.lanl.gov#MarkoRodriguez
          • prefix it to make it easier on the eyes -- lanl:MarkoRodriguez
    • The Semantic Web
      • first identify it, then relate it!
    W3C/IETF. URIs, URLs, and URNs: Clarifications and recommendations 1.0, September 2001.
  • The undirected network.
    • There is the undirected network of common knowledge.
      • Sometimes called an undirected single-relational network.
      • e.g. vertex i and vertex j are related.
    • The semantic of the edge denotes the network type.
      • e.g. friendship network, collaboration network, etc.
    i j
  • Example undirected network. Herbert Marko Aric Ed Zhiwu Alberto Jen Johan Luda Stephan Whenzong
  • The directed network.
    • Then there is the directed network of common knowledge.
      • Sometimes called a directed single-relational network.
      • For example, vertex i is related to vertex j , but j is not related to i .
    i j
  • Example directed network. Muskrat Bear Fish Fox Meerkat Lion Human Wolf Deer Beetle Hyena
  • The semantic network.
    • Finally, there is the semantic network
      • Sometimes called a directed multi-relational network.
      • For example, vertex i is related to vertex j by the semantic s , but j is not related to i by the semantic s .
    i j s
  • Example semantic network. SantaFe Marko NewMexico Ryan California UnitedStates LANL livesIn worksWith cityOf originallyFrom stateOf stateOf locatedIn hasLab Cells Atoms madeOf madeOf researches Oregon southOf hasResident Arnold governerOf northOf
  • The technologies of the Semantic Web.
    • Resource Description Framework (RDF): The foundation technology of the Semantic Web. RDF is a distributed, semantic network data model. In RDF, URIs and literals (e.g. ints, doubles, strings) are related to one another in triples.
    • RDF Schema (RDFS) and the Web Ontology Language (OWL): The ontology is to the Semantic Web as the schema is to the relational database.
      • Anything of rdf:type lanl:Human can lanl:drive anything of rdf:type lanl:Car .
    • Triple-Store : The triple-store is to semantic networks what the relational database is to the data table.
      • a.k.a. semantic repository, graph database, RDF database.
  • RDF and RDFS. lanl:marko lanl:cookie lanl:Human lanl:Food lanl:isEating rdf:type rdf:type lanl:isEating rdfs:domain rdfs:range ontology instance RDF is not a syntax. Its a data model. Various syntaxes exist to encode RDF including RDF/XML, N-TRIPLE, TRiX, N3, etc.
  • RDF, RDFS, and OWL. lanl:fluffy lanl:marko lanl:Pet lanl:Human lanl:hasOwner rdf:type rdf:type lanl:hasOwner rdfs:domain rdfs:range ontology instance _:0123 rdfs:subClassOf owl:onProperty 1 owl:maxCardinality lanl:bob lanl:hasOwner owl:Restriction rdf:type
  • General-purpose modeling. next next next item item item item key value key value entry entry el el el el el el List Map Set
  • General-purpose computing. next value test PC item heap el Program Virtual Machine false true next next stack el next item next el Rodriguez, M.A., General-Purpose Computing on a Semantic Network Substrate, in review, Journal of Web Semantics, LA-UR-07-2885, April 2007.
  • A web of data and process. 127.0.0.1 127.0.0.0 127.0.0.2 127.0.0.3
  • The triple-store. SELECT ?a ?c WHERE { ?a type human ?a wrote ?b ?b type article ?c wrote ?b ?c type human ?a != ?c }
    • There are two primary ways to distribute information on the Semantic Web.
      • 1.) publish a serialized RDF document on a web server.
      • 2.) expose a public interface to an RDF triple-store.
    • The triple store is to semantic networks what the relational database is to data tables.
      • Storing and querying triples in a triple store.
      • SPARQLUpdate query language.
        • like SQL, but for triple-stores.
    INSERT ?a coauthor ?c WHERE { ?a type human ?a wrote ?b ?b type article ?c wrote ?b ?c type human ?a != ?c } DELETE ?s ?p ?o WHERE { ?s ?p ?o }
  • Triple-store vs. relational database. Triple-store Relational Database SQL Interface SPARQL Interface SELECT ?x1 ?x2 WHERE { ?x1 lanl:hasFriend ?x2 . ?x2 lanl:worksFor ?x3 . ?x3 lanl:collaboratesWith ?x4 . ?x4 lanl:hasEmployee ?x1 . } SELECT friendTable.personId1, friendTable.personId2 FROM personTable, authorTable, articleTable, friendTable, hasEmployeeTable, organizationTable, worksForTable, collaboratesWithTable WHERE personTable.id = authorTable.personId AND personTable.id = friendTable.personId1 AND friendTable.personId2 = worksForTable.personId AND worksForTable.orgId = collaboratesWithTable.orgId2 AND collaboratesWithTable.ordId2 = personTable.id Give me all pairs of people that are friends, but whom work for collaborating companies. Now!
  • Triple-store and graph-analysis.
    • Nearly all network analysis algorithms can be decomposed into a graph traversal problem.
      • Spreading activation and the energy diffusion.
      • PageRank and the random walker.
      • Geodesics and the breadth-depth search.
    • Relational database is not optimized for graph traversal.
      • Indexes are not appropriate for graph traversal.
      • Every traversal is a table join.
    • Triple-store is more optimized for graph analysis.
      • While the triple-store is optimized for graph pattern matching, it is more optimal for graph traversal than the relational database.
      • Hybrid statement/linked-list databases are good at both pattern matching and traversal.
    • Graph analysis can be used for ranking and recommendation.
    Rodriguez, M.A., "A Multi-Relational Network to Support the Scholarly Communication Process", International Journal of Public Information Systems, volume 2007, issue 1, pages 13-29, ISSN: 1653-4360, LA-UR-06-2416, March 2007.
  • Modeling the scholarly community.
    • Agents : humans and groups.
    • Artifacts : articles, books, journals, proceedings, conferences, datasets, software, websites, [sensors, deployments].
    • Relationships : citations, authorship, publisher, contains, attends, coauthor, members.
    Rodriguez, M.A., Bollen, J., Van de Sompel, H., A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage, 2007 ACM/IEEE Joint Conference on Digital Libraries, pages 278-287, Vancouver, Canada, ACM/IEEE Computing, doi:10.1145/1255175.1255229, LA-UR-07-0665, June 2007.
  • Demonstration.
  • Conclusion.
    • Thank you for coming. Good life.