CINET: A CyberInfrastructure for Network Science
-
Upload
ndsslvt -
Category
Devices & Hardware
-
view
122 -
download
4
Transcript of CINET: A CyberInfrastructure for Network Science
CINET: A CyberInfrastructure for Network Science
S.M.Shamimul Hasan On behalf of CINET team
Technical Report # 15-‐060
Network Dynamics and SimulaBon Science Lab (NDSSL) Virginia BioinformaBcs InsBtute
Virginia Tech
CINET Team • Virginia Tech: Keith Bisset, Abhijin Adiga, Edward Fox,
Maleq Khan, Chris Kuhlman, Henning Mortveit, Madhav Marathe, Samarth Swarup, Anil VullikanB
• Indiana University: Geoff Fox, Judy Qiu, Stephen Wu • SUNY Albany: S.S. Ravi • Jackson State University: Richard Aló, Chris Cassidy • University of Houston Downtown: Ongard Sirisaengtaksin • Argonne NaBonal Lab and U. Chicago: Pete Beckman • VT Students: S.M. Shamimul Hasan, Md Hasanuzzaman, S M
Arifuzzaman, Maksudul Alam, Sherif Abdelhamid, Zalia Shams, Tirtha Bhaaacharjee
• Persistent Systems: Harsha, Gaurav, Tanmay, Rakhi, Abhijeet, Niranjan and Team
CINET: Team (cont.) • Several evaluators are incorporaBng CINET into courses – S. S. Ravi at the University at Albany, SUNY – Edward Fox at Virginia Tech – Anil VullikanB at Virginia Tech – Henning Mortveit at Virginia Tech – Aravind Srinivasan at University of Maryland – Albert Esterline (NCAT)
• Other evaluators planning to use CINET in research – Zsuzsanna Fagyal at UIUC – Maa Macauley at Clemson University – T. M. Murali at Virginia Tech
Network
“Network is a group or system of interconnected people or things” -‐ Oxford DicBonaries
“Network science is the study of network representaBons of physical, biological, and social phenomena” -‐ NaBonal Research Council
Network Science
• Research in network science has been increasing very rapidly in the last decade, in many different scienBfic fields.
• Networks can be very large: ~108 nodes, ~1010 edges, requiring HPC for analysis
• There is a need for middleware, i.e., an interface layer o Domain experts don’t need to become experts in graph theory, data
mining, and high-‐performance compuBng o Provides an abstracBon layer that allows separaBon of innovaBon
above and below this layer
CINET: Vision • Self-‐sustainable
– Users can contribute new networks, data, algorithms, hardware, and research results
• Self-‐manageable – End users will be insulated from the complexiBes of resource allocaBon,
scheduling, cross-‐plahorm interacBons, and other low-‐level concerns
• Repeatable Science – The exact version of a model that produced a result is kept – All model input parameters are captured – Any system configuraBon informaBon is captured – All input data versions are kept – The enBre set of configuraBon informaBon for an experiment (mulBple
runs) should be accessible by providing a URL – Encourage users of the system to include pointers to results in published
work
System Architecture
• Provides over 150+ networks, 18 graph generators and 80+ measures
• New improved UI for Granite • Components (apps) that allow researchers to interact with CINET:
VisualizaBon of networks, Adding networks, Adding structural analysis tools
• Structural analysis using Galib, NetworkX and SNAP • Version 1.0 of a Python-‐based DSL for compuBng complex
workflows • Resource manager 1.0 completed: allows mulBple computaBonal
and analyBcal resources to be used and selected • Website with addiBonal resources (course notes, etc.).
Version 2.0
Digital Library
Digital Library: v Support network science research v Manage conBnuously produced, large-‐scale scienBfic output
v Provide simulaBon-‐specific services to support science
v Manage large network graphs and workflow of content collecBons
Digital Library Data: – List of networks & metadata. – List of measures & metadata. – Parameters for measures. – List of generators & metadata. – Parameters for generators. Services: — MemoizaBon: Record details of every experiment run — IncenBvizaBon: Report how many Bmes a parBcular graph was used
— Browsing and Searching: graphs, measures, results
TransacBonal Data
• Following data is stored in database – Users – Details Network Analysis run by users including parameters set for
each – Details Generator Analysis run by users including parameters set for
each
• Following is stored in file system – Output files of Network & Generator Analysis.
• Mapping exists between data stored in database and file system
Performance Improvements
• Blackboard is used ONLY for placing job request
• Simpler & fewer number of components • Components are fully distributed – Web-‐app, blackboard, brokers exist on separate VMs
• Brokers are no more required to poll the data but directly noBfied by blackboard container.
Resource Manager
• Decides what is the best resource for a given job request – Through a set of defined rules
• Tracks the health of and load on compute resources – And, considers this knowledge in determining the best resource(s)
Granite Structural Analysis of Complex
Networks
Graph Analysis Resources and Challenges
• Resources : – StaBc Analysis tools: Provide efficient implementaBons of various graph measures or algorithms (e.g., Galib, NetworkX).
– Large collecBon of Data Sets (of networks) • Challenge 1: How can we make an analyBc engine that will
– Reduce programming overhead, – Reuse exisBng resources
• Challenge 2: Provide a simple computaBonal interface to Domain Experts to use available resources and program interacBvely
CINET -‐ Granite
• Granite allows users to run various network measures on a variety of networks – Measures can either be staBc (e.g., degree distribuBon, cluster coefficient) or dynamic (e.g., disease diffusion)
– Network size can range from Bny (10s of nodes) to very large (100s of millions of nodes)
• Granite automaBcally picks best implementaBon of specified measure
• Granite automaBcally picks most appropriate compute resource
• Granite includes modules from three graph algorithm libraries: – Galib (developed at NDSSL) – NetworkX (developed at Los Alamos NaBonal Lab) – SNAP (developed at Stanford University)
Graph Libraries
CINET: A CyberInfrastructure for Network Science
Graph Centrality Measures in CINET u Degree list <Node-‐ID, Degree> u Degree statistics u Degree distribution u Average neighbor degree u Hub-‐authority u Pagerank
u Clustering coefficient distribution
u Streaming-‐based CC distribution (apprx.)
u Betweenness centrality
u Closeness centrality u Degree centrality u Eigenvalue centrality
u k-‐core u k-‐crust u k-‐corona u k-‐clique coefficient u Core number
u Ro distribution
u Coreness of nodes <ID, coreness> u CC list <Node-‐ID, CC> u External-‐memory CC algorithm
(exact)
u Parallel CC algorithm
u Generate degree sequence u Closeness centrality -‐ weighted
u Ro distribution u Closeness vitality –
unweighted
u Closeness vitality -‐ weighted
u Communicability centrality
u In-‐degree centrality u Out-‐degree centrality
Graph Shortest path and ConnecBvity Measures in CINET
u Number of connected components
u Component graph
u Component size distribution
u Strongly connected component
u Weakly connected component
u Bi-‐connected component
u Check bi-‐connectivity
u BFS tree / forest
u BFS predecessor list u BFS successor list u Partitioning by BFS traversal u DFS predecessor list u DFS Successor list u DFS: nodes in post-‐order
visits
u DFS Tree u Articulation point u Bridge edges u Diameter
u Center u Periphery u Check connectivity u Eccentricity
u Radius u DFS: nodes in pre-‐order visits u Check if graph is s DAG
u Topological sort
Weighted Shortest Path and MoBf counBng
u Minimum spanning tree
u Single source shortest path
Weighted shortest path related u Shortest path tree/forest u Weighted diameter (exact and approx.)
u Average pairwise distance (exact and approx.)
u Distribution of pair-‐wise distance (exact and approx.)
Subgraph / Motif counting u Count triangle
u Clique counts (specialized) u Graph transitivity u All maximal clique
u Clique number
u Largest clique containing a node
Flow u Maximum flow
u Minimum cut
CINET: A CyberInfrastructure for Network Science
Other Measures
u Shuffle edges
u Degree-‐assortative shuffle
u Age-‐assortative shuffle
u Compare graphs
u Remove nodes
u Remove edges
u Remove high degree nodes (top x%)
u Remove high degree nodes (degree >=x)
u Check if a degree sequence is graphical
u Compare graphs
u Isolated nodes u Vertex cover u Dominating set
u Minimum edge dominating set
u Check graph consistency u Check if bipartite graph
u Check if chordal graph u Maximal independent set
u Number of common neighbors
CINET: A CyberInfrastructure for Network Science
Simple GeneraBve Models of Networks in CINET
u Random graph generators u Erdos-‐Renyi random graph
u G(n, p) graph u G(n, p) component
u G(n, m) graph
u G(n, r) graph u Watts-‐Strogatz small-‐world graph
u Waxman random graph u Chung-‐Lu
u Havel-‐Hakimi
u Preferential Attachment
u Small world
u Circle u Star u Chain u Lattice
u Deterministic graph generators u Binary tree graph u Star u Wheel
u Grid u Torus u Hypercube u Petersen
Currently Available Networks • 150+ small and large networks
– Sizes vary from 100 edges to 110M edges – Social contact networks
• Chicago, Washington DC, Detroit, New York, Seattle – Multi-‐modal urban transportation networks (e.g., subway, cars,
buses). • Portland, OR
– Adolescent friendship networks • High school in New River Valley
– Blog and other online networks • Slashdot, Epinions
– Infrastructure networks • Ad hoc and mesh, phone call, electrical power
– Biological networks
Networks in CINET (cont.) Types of Networks u Web graph u Autonomous System/Internet u Road/transport networks u Collaboration networks u Co-‐appearance networks u Social networks u Biological networks u Infrastructure(e.g. power) u Others
u Stanford SNAP u Pajek Dataset u http://www-‐personal.umich.edu/~mejn/netdata/ u Some others publicly available sources
Original Sources
List of Networks
Autonomous System/Internet Web Graph u Autonomous systems -‐ Oregon-‐1 -‐ 010331 u Autonomous systems -‐ Oregon-‐1 -‐ 010407 u Autonomous systems -‐ Oregon-‐1 -‐ 010414 u Autonomous systems -‐ Oregon-‐1 -‐ 010421 u Autonomous systems -‐ Oregon-‐1 -‐ 010428 u Autonomous systems -‐ Oregon-‐1 -‐ 010505 u Autonomous systems -‐ Oregon-‐1 -‐ 010512 u Autonomous systems -‐ Oregon-‐1 -‐ 010519 u Autonomous systems -‐ Oregon-‐1 -‐ 010526 u Autonomous systems -‐ Oregon-‐2 -‐ 010331 u Autonomous systems -‐ Oregon-‐2 -‐ 010407 u Autonomous systems -‐ Oregon-‐2 -‐ 010414 u Autonomous systems -‐ Oregon-‐2 -‐ 010421 u Autonomous systems -‐ Oregon-‐2 -‐ 010428 u Autonomous systems -‐ Oregon-‐2 -‐ 010505 u Autonomous systems -‐ Oregon-‐2 -‐ 010512 u Autonomous systems -‐ Oregon-‐2 -‐ 010519 u Autonomous systems -‐ Oregon-‐2 -‐ 010526 u The Internet Topology Zoo -‐ AboveNet u The Internet Topology Zoo -‐ AGIS
u California Web Graph u EPA Web Graph u EuroSiS web mapping study u Web Graph of Berkeley and Stanford
Collaboration Graph
u Condense Matter collaboration network u Condensed Matter collaborations 1999 u Condensed Matter collaborations 2003 u Condensed Matter collaborations 2005 u CS PhD supervision relation graph u Erdos Collaboration Network u General Relativity and Quantum Cosmology
collaboration network u High-‐Energy Theory Collaboration Network 2001 u High-‐Energy Theory Collaboration network 2003 u Network Science Collaboration u Phenomenology Collaboration Network
Social, Proximity and Infrastructure Networks
u Miami Chung-‐Lu u Miami Contact Network u Portland Contact Network u Primary School Cumulative
Networks 1 u Primary School Cumulative
Networks 2 u Seattle Contact Network u Slashdot Social Network 2008 u Slashdot Social Network 2009 u Youtube Social Network
Road/Transport/Infrastructure Networks
u Airlines u California transportation Network u Pennsylvania transportation
network u Texas transportation network u US Air Lines u US Power Grid u Western States Power Grid
u Dolphins' Social Network in NZ u Brightkite Friendship network u Enron Email Data with Manager-‐Subordinate
Relationship Metadata u Enron email Network u Enron Giant Component u Epinions Scoical Network u Giant Component of Brightkite Network u Giant Component of Epinions Networks u Giant Component of Gowalla Network u Giant Component of Max Planck's Facebook
Network u Giant Component of Slashdot0811 Network u Giant Component of Slashdot0902 Network u Gowalla friendship network u Hypertext 2009 dynamic contact network u Hyves Social Network u Infectious SocioPatterns -‐ 2009-‐04-‐28 u Infectious SocioPatterns -‐ 2009-‐04-‐29 u Karate network u LiveJournal Social Network u Max Planck -‐ Flickr Social Network
List of Networks (Contd.)
Biological Networks Co-‐appearance/co-‐purchase Networks
• C. Elegans Neural Network • Yeast PPI network
Games/Sports Networks
• American College Football Network
• Soccer WorldCup'98
• Les Miserables • Network Gloassary • PoliBcs books • Word adjacencies
Others/misc. Networks
• Dynamic Java code • Small World Network
Making Granite Self-‐Sustainable: Concept of Services and Apps
User Management
• User can request account. Account is operaBonal only aser Admin acBvates it.
• Admin can acBvate or deacBvate accounts. • User can change password. • All the enBBes – Networks, Measures, Generators, Analyses – have owners.
User Management
Add Network • User can add network by uploading network file • Uploaded network is validated • For valid networks, edges & nodes are automaBcally
calculated • Networks are converted into .gph & .nx format – • User can specify metadata for the uploaded network • User can specify if the network is –
– Public : available to all users for analysis. – Private: available to only the owner, which is the default opBon
Add Network
VisualizaBon • CINETViz app fully integrated in Granite. • User can submit visualizaBon job for a network. • VisualizaBon process is scalable & abstracted from backend through middleware (blackboard & brokers)
• Once visualizaBon job is completed, user can view & download generated visualizaBon.
• VisualizaBon has 2 user interfaces in Granite – Quick view while selecBng network for analysis – Detailed view in VisualizaBon tab
Features – VisualizaBon
VisualizaBon of Networks (Contd.)
Karate Club Network Miami Graph
VisualizaBon of Networks (Contd.)
Amazon Co-purchase Network
CINET website • Central locaBon of CINET • Portal for course materials • Web address hJp://www.vbi.vt.edu/ndssl/cinet
CINET: A CyberInfrastructure for Network Science
Graph Dynamical Systems Calculator (GDSC)
• Provide a Web ApplicaBon to enable users to compute dynamics for their systems.
• Evaluate arbitrary (small) graphs, a range of vertex funcBons, and update schemes.
• GDSC is an applicaBon in CINET.
Overview
Future Work
• Add graph modificaBon algorithms – Remove edges – Swap edges
• Add data model to manage system workflow • Domain specific language • Registry Service
Digital Library to support
ComputaBonal Epidemiology Datasets
SyntheBc InformaBon Based Epidemiological Laboratory (SIBEL)
The Problem
• ComputaBonal epidemiology employs computer models and informaBcs tools to reason about the spaBo-‐temporal spread of diseases.
• Studies are conducted, in general, through the use of a simulaBon and require informaBon on the populaBon structure, agent behavior, disease transmission, and a model of the disease.
• The heterogeneous content includes metadata, text, tables, spreadsheets, experimental descripBons, and large result files.
NDSSL’s networked epidemiology data repository
Category Data Size Representation Synthetic Population
Household, Person Activity
566 GB Relational
Social Network and Output
Contact Network, Simulation Output
1.84 TB File
Experiment Experiment 240 GB Relational
The Problem (cont.) • Data access and digital library services in current setups are
cumbersome due to heterogeneity and fragmentaBon across datasets.
• There is no accepted framework that allows unified access to such content.
• The diversity of models, data sources, data representaBons, and modaliBes that are collected, used, and modified moBvate the development of a digital library (DL) framework to support computaBonal epidemiology.
• We propose a data mapping framework for digital library systems for computaBonal epidemiology datasets.
• The proposed framework provides a unified view to access and query complete epidemiology workflow data.
Unified View to Access and Query Complete Epidemiology Workflow Data
Resource DescripBon Framework (RDF)
• Directed labeled graphs • Model elements
– Resource: These are the things being described by RDF expressions.
– Property: Is a specific aspect, characterisBc, aaribute or relaBon used to describe a resource Value
– Statement: A statement in RDF consists of resource + property + value subject predicate object
RDF Example
• For the statement “Shamimul Hasan is the creator of the web page www.vt.edu/~shasan2.
• We have RDF statement as
• Node and arc diagram as
Subject(resource) www.vt.edu/~shasan2
Predicate(property) creator
Object(literal) “Shamimul Hasan”
www.umr.edu/~shasan2 Shamimul Hasan creator
Framework • Data mapping provides us the flexibility to switch between various
databases and execute queries on them.
Experimental Study
• We considered a real-‐Bme epidemiology simulaBon study conducted in the Seaale area. The study assumed that influenza transmits in various regional populaBons through person-‐person contact.
• We use the D2RQ Mapping Language to convert relaBonal and file data to RDF graphs, Virtuoso Open-‐Source EdiBon 6.1.6 as RDF data engine, and the SPARQL query language.
Experimental Study (cont.)
Databases RDF Graph Size (GB)
Number of Triples
RDF Graph Generation
Time (Minutes)
Seattle Synthetic Population
177 661,848,662 317
Output 3.10 12,979,996 6
Experiment 0.01 66,654 0.37
Experimental Study (cont.) Queries Bottom-‐up Approach
(SPARQL Query Runtime in Seconds)
Top-‐down Approach (SPARQL Query
Runtime in Seconds) How many people of a particular demographic are sick?
0.04 7.18
Find who infected whom of a particular Demographic
0.38 9.18
How many people get infected on a particular simulation day?
0.03 5.76
Reference • Sherif Hanie El Meligy Abdelhamid, Md. Maksudul Alam, Richard Aló, Shaikh Arifuzzaman, Peter H.
Beckman, Tirtha Bhaaacharjee, Md Hasanuzzaman Bhuiyan, Keith R. Bisset, Stephen Eubank, Albert C. Esterline, Edward A. Fox, Geoffrey Fox, S. M. Shamimul Hasan, Harshal Hayatnagarkar, Maleq Khan, Chris J. Kuhlman, Madhav V. Marathe, Natarajan Meghanathan, Henning S. Mortveit, Judy Qiu, S. S. Ravi, Zalia Shams, Ongard Sirisaengtaksin, Samarth Swarup, Anil Kumar S. VullikanB, Tak-‐Lon Wu: CINET 2.0: A CyberInfrastructure for Network Science. eScience 2014: 324-‐331
• S. M. Shamimul Hasan, Sandeep Gupta, Edward A. Fox, Keith R. Bisset, Madhav V. Marathe: Data mapping framework in a digital library with computaBonal epidemiology datasets. JCDL 2014: 449-‐450
• S. M. Shamimul Hasan, Keith R. Bisset, Edward A. Fox, Kevin Hall, Jonathan Leidig, Madhav V. Marathe: An Extensible Digital Library Service to Support Network Science. ICCS 2013: 419-‐428
• Sherif Elmeligy Abdelhamid, Richard Aló, S. M. Arifuzzaman, Peter H. Beckman, Md Hasanuzzaman Bhuiyan, Keith R. Bisset, Edward A. Fox, Geoffrey Charles Fox, Kevin Hall, S. M. Shamimul Hasan, Anurodh Joshi, Maleq Khan, Chris J. Kuhlman, Spencer J. Lee, Jonathan Leidig, Hemanth MakkapaB, Madhav V. Marathe, Henning S. Mortveit, Judy Qiu, S. S. Ravi, Zalia Shams, Ongard Sirisaengtaksin, Rajesh Subbiah, Samarth Swarup, Nick Trebon, Anil VullikanB, Zhao Zhao:
• CINET: A cyberinfrastructure for network science. eScience 2012: 1-‐8 • Resource DescripBon Framework (RDF) developed by World Wide Web ConsorBum (W3C)-‐ hap://
bit.ly/1aXP5k2
Student AcBvity
• Please Visit Granite website: hap://ndssl.vbi.vt.edu/apps/cinet/
• Launch App • Login
– Username: demo – Password: demo1234
• Start a New Analysis with “Karate” network and “PageRank” measure.
• Check analysis report.
Many Thanks!
AddiBonal Slides
Extensible MemoizaBon Service
• Query a set of digital objects that exactly match a metadata paaern
• UBlizaBon – EducaBon – students – Baseline scenarios – Comparisons, body base, similar regions
Architecture
Architecture (Cont.)
Architecture (Cont.)
• Small |G| < 100,000 – Example: RND-‐G(n,p) Random Graph 1 (nodes:1,000, edges: 4,971)
• Medium 100,000 ≤|G|<10,000,000 – Example: RND-‐G(n,p) Random Graph 500 (nodes: 500,000, edges: 5.00E+06)
• Large |G|≥10,000,000 – Example: Seaale contact network (nodes: 3,207,037, and edges: 8.66E+07).
Network Category
Performance § Shadowfax (Virginia Tech) § 912 cores, 5 TB RAM, 80 TB storage, 7168 CUDA cores § 100+ networks § 100+ measures
Performance