Computational Methods for Testing Adequacy and Quality of Massive Synthetic Proximity Social...
-
Upload
bethanie-elliott -
Category
Documents
-
view
216 -
download
1
Transcript of Computational Methods for Testing Adequacy and Quality of Massive Synthetic Proximity Social...
Computational Methods for Testing Adequacy and Quality of Massive Synthetic Proximity Social Networks
Huadong Xia, Christopher Barrett, Jiangzhuo Chen, Madhav Marathe
IEEE BDSE2013
Network Dynamics and Simulation Science LaboratoryVirginia Tech
NDSSL TR-13-153
We thank our external collaborators and members of the Network Dynamics and Simulation Science Laboratory (NDSSL) for their suggestions and comments.This work has been partially supported by DTRA Grant HDTRA1-11-1-0016, DTRA CNIMS Contract HDTRA1-11-D-0016-0001, NIH MIDAS Grant 2U01GM070694-09, NSF PetaApps Grant OCI-0904844, NSF NetSE Grant CNS-1011769.
Acknowledgement
• Background and Contributions• Methods: Network Synthesis• Comparison of Large Scale Networks• Conclusions
Outline
• Pandemics cause substantial social, economic and health impacts– 1918 flu pandemic, killed 50-100 million people or 3
to 5 percent of world population.– …– SARS 2003, H1N1 2009, Avian flu (H7N9) 2013
• Mathematical and Computational models have played an important role in understanding and controlling epidemics – controlled experiments are not allowed for ethic
consideration.– understand the space-time dynamics of epidemics
Importance of Computational Epidemiological Models
• Heterogeneous• Spatial-Temporal features of populations• Massive, Irregular, Dynamic and Unstructured• Social contact networks are usually synthesized
Networked Epidemiology
(Figure From the Internet)
Volume
Facts in Delhi13.85M Population
2.67M Households
>200M Contacts
2.64M Locations
The Four V’s in Networked Epidemiology
VelocityInteractions Change every second
Node Status changes every second
They are modeled in minute scale
Variety• Demographics• Geographic• Temporal Feature• Virus Infectivity• … …
Veracity• DataDo we collect enough raw data to render a clear picture?
• MethodDo we extract all useful information out of available raw data?
9am
7am
3pm
8pm
• The Veracity of the network one makes depends on: – Time available to make such a network (human, computational)– The data available to make the network– The specific question that one would like to investigate
• Different level of networks may be retrieved for the same region.
• How do we evaluate networks that span large regions?– How to compare two networks constructed for the same
population?– When is the synthesized network adequate?
Social Contact Network Modeling and Analysis
• Propose a number of network measurements to understand and compare urban scale social contact networks which are extremely large, dynamics and unstructured.
• Explore quantitatively the adequacy standards in modeling proximity networks.
Contributions
• Background and Contributions• Methods: Network Synthesis• Comparison of Large Scale Networks• Conclusions
Outline
Synthetic Populations and Their Contact Networks
Goal: Determine who are where
and when.
Process: Create a statistically
accurate baseline population
Assign each individual to a home
Estimate their activities and where these take place
Determine individual’s contacts & locations throughout a day.
• Networks capture social interaction pertinent to the disease• We focus on flu like diseases and the appropriate network is a social
contact network based on proximity relationship.
What Is a Network
Edge attributes:• activity type: shop, work,
school• (start time 1, end time 1)• (start time 2, end time 2)• …
Vertex attributes:• (x,y,z)• land use• …
Locations
Vertex attributes:• age• household size• gender• income• …
People
Two Sets of Data Sources and Generation Methods for Delhi Synthetic Population and Network
Data & Methods the coarse network the detailed network
data
demographics India census 2001India census 2001 + micro-
data (India Human Development Survey - UMD)
geographic data LandScan 2007 MapMyIndia
activity generic activity templatesThane travel survey
residential contact survey
method
people distribution distribution/IPF
locations density Real locations+ home along roads
activity schedules categorized templates
decision tree + templates
configuration model
activity locations gravity model
Population for the coarse network Population for the detailed network
Population Synthesis
M47
F22F4
M71
M17 F22
F11
F46
M23
M33
F2
M53
Split into HHs
F36
F6
M13
M65
M47
F22
F4
M71M17
F22
F11 F46
M23
M13
F2
M53F36
F6
M13
M65
M47
F22
F6
M65
M17F22
F2
F46
M23M1\23
F4 M53
F36 F11M33
M71
Extract individuals
M47
F22
F4
M71M17
F22
F11 F46
M23
M21
F2
M53F36
F6
M13
M65
• Metrics– Entity level: the population, built infrastructure and their layout– Collective level: validate against aggregate statistics.– Network level: structural properties– Epidemic dynamics level: policy effects
How to Compare Two Networks
Individual level age-gender structure
Comparison for Synthetic Populations
Household level demographic structure
Entropy: 1.35 v.s. 1.02
the Coarse Network
Precision of Location Distribution
the Detailed Network
LandScan GridSynthetic Locations Real Locations
Note: First Row: the coarse network; Second Row: the detailed network
Temporal Visiting Degree in Random Selected Locations
GPL: Structural Properties
• The people-location network GPL: the degree of a large portion of nonhome Locations have a power law like distribution.
Disease Spread in a Social Network
• Within-host disease model: SEIR
• Between-host disease model:– probabilistic transmissions along edges of social contact
network– from infectious people to susceptible people
Epidemic Simulations to Study the Delhi Population
• Disease model Flu similar to H1N1 in 2009: assume R0=1.35, 1.40, 1.45, 1.60
(only the results when R0=1.35 are shown, but others are similar) SEIR model: heterogeneous incubation and infectious durations 10 random seeds every day
• Interventions Vaccination: implemented at the beginning of epidemic; compliance rate 25% Antiviral: implemented when 1% population are infectious; covers 50% population;
effective for 15 days School closure: implemented when 1% population are infectious; compliance rate
60%; lasts for 21 days Work closure: implemented when 1% population are infectious; compliance rate
50%; lasts for 21 days
• Total five configurations (including base case). Each configuration is simulated for 300 days and 30 replicates
Comparison in Epidemic Simulations
• Impact to Epidemic Dynamics (R0=1.35):– The coarse network exploits generic activity schedules, where people travel much more frequently.
Therefore, the two networks show very different epidemic dynamics in base case.
• Similarities of two networks:– Vaccination is still most effective strategy.– Pharmaceutical interventions is more effective than the non-pharmaceutical.– School closure is more effective than work closure
• Differences of two networks– Severity is significantly different– In delaying outbreak of disease, school closure is more effective than Antiviral in the
coarse network, which is on the contrary in the detailed network.
Epidemic Simulation Results: Interventions
Categories Metrics
Underlying Synthetic Population
Household Structure
Location Layout
Duration of Activities
Number of Daily Activities
Travel Distance
Radius of Gyration
GPL
Temporal Degree of Random Locations
Degree of People-Location Graphs
GPDegree, Clustering Coefficient, Contact Duration, Shortest Path
Epidemic DynamicsNo Interventions
Pharmaceutical InterventionsNon-Pharmaceutical Interventions
Metrics Review
• Novel methodologies in creating a realistic social contact network for a typical urban area in developing countries
• Comparison to a coarser network suggests:– Similarity reflects generic properties for social contact networks– Region specific features are captured in the detailed model– The epidemic dynamics of the region is strongly influenced by activity
pattern and demographic structure of local residents– A higher resolution social contact network helps us make better public
health policy
• A realistic representation of social networks require adequate empirical input. We propose the criteria of adequacy:– Does the new input decrease uncertainty of the system?– Does the new input significantly change epidemics and intervention
policy?
Conclusions
• Calibrate R0 to be 1.35• Vulnerability is defined as: Normalized number of infected over 10,000 runs of random
simulations• Vulnerability distribution of the detailed network is flat comparing to the coarse network,
and it is less vulnerable due to less frequent travel.
Epidemic Simulation Results: Vulnerability
• Case study:– Delhi (NCT-I): a representative south Asian city that was never studied
before.• Statistics:
– 13.85 million people in 2001; 22 million in 2011– Most populous metropolis: 2nd in India; 4th in the world– 573 square miles, 9 regions (refer to the pic)– The Yamuna river going through urban area.
• Unique socio-cultural characteristics: – Large slum area– Tropical weather– Environmental hygiene
Delhi: National Capital Territory of India
Two Versions of Delhi Networks
• The coarse network:– Based on very limited data– Generic methodology applicable to any region in world
• The detailed network:– Requires household level micro sample data and other detailed data,
not available for all countries
• Improvement on results is expected:– to evaluate the network generation model;– to understand importance of different levels of details.
• Population generationInput: Joint distribution of age and gender of the population in Delhi (from the India
census 2001)
Algorithm:– Normalize the counts in the joint distribution of age and gender into a joint
probability table– Create 13.85 million individuals one by one.
For each individual: Randomly select a cell c with the probability of each cell of the city.Create a person with the age and gender corresponding to the cell c.
End
Output: 13.85 million individuals are created, each individual is associated with disaggregate attributes of gender and age.
V1: Synthetic Population Generation
• Demographic Data: basic census data + India Micro-Sample– India Census 2001– Micro sample for household structure: India Human Development Survey 2005 by the University
of Maryland and the National Council of Applied Economic Research, which tells about each household sample: hh size, hh head’s age, hh income, house types, animal care; and also for each individual in the hh: demographic details, religion, work, marital status, relationship to head, etc.
• Activity Data: Thane travel survey + residential contacts survey– Activity templates from 2001 Household Travel Survey statistics for Thane, India, and
2005-2009 school attendance statistics from the UNESCO Institute of Statistics (UIS)o Activity templates are extracted with CART, and assigned to synthetic population with
decision tree.
– Survey on residential area contacts in India, conducted by NDSSLo Approximate 40% adults in India do not travel to work. The survey focused on them.o Collected people’s age, gender, and contact durations/frequencies near their home.
• Location Data: MapMyIndia data– Ward-wise statistics for population and households.– Coordinates for locations such as schools, shopping centers, hotels etc.– Infrastructures such as roads, railway stations, land use etc.– Boundary for each city, town and ward.
Data Input
• Same methodology as we did for US populations:Input: total # of households
Aggregate distribution of demographic properties from Census: hh size, householder’s age
Household micro-samplesOutput: Synthetic population with household structure. Each individual is assigned an age and gender.Algorithm:
1. Estimate joint distribution of household size and householder’s age: 1) construct a joint table of hh size and householder’s age: fill in # of samples for each cell2) multiply total # of households to distributions to calculate marginal totals for the table3) run IPF to get a convergent joint table4) normalize: divide counts in each cell with (total # of samples), it’s probability for each
cell.(illustrated in next slide)
2. create the synthetic households and population:1) randomly select a cell with the probability in joint table2) select a household sample h from all samples associated with that cell uniformly at
random3) create a synthetic household H, so that H has same members as h, each member in H has
same demographic attributes as those in h.4) repeat step 2.1-2.3, until # of synthetic households is equal to the total # of households
from Census.
V2: synthetic population creation method
IPF example
Row Adjustment Column Adjustment
Iteration 129.62 39.61 30.76 35 40 25
20 8.00 8.00 4.00 20.78 9.45 8.08 3.2530 8.57 10.71 10.71 29.65 10.13 10.82 8.7135 11.25 12.50 11.25 35.06 13.29 12.62 9.1415 1.80 8.40 4.80 14.51 2.13 8.48 3.90
Iteration 234.81 40.09 25.10 35 40 25
20 9.10 7.77 3.13 20.02 9.15 7.76 3.1230 10.25 10.95 8.81 30.00 10.30 10.92 8.7735 13.27 12.60 9.13 35.01 13.34 12.57 9.0915 2.20 8.77 4.03 14.98 2.21 8.75 4.02
Iteration 3: Finished34.99 40.00 25.00 35 40 25
20 9.14 7.75 3.11 20.00 9.14 7.75 3.1130 10.30 10.92 8.78 30.00 10.30 10.92 8.7735 13.34 12.57 9.09 35.00 13.34 12.57 9.0915 2.21 8.76 4.02 15.00 2.21 8.76 4.02
Row Column
20 3530 4035 2515
Start35 40 25
20 6 6 330 8 10 1035 9 10 915 3 14 8
V2: household distribution – a snapshot
• Households are distributed along real streets/community blocks.• V2 avoids to distribute households on rivers, lakes and green land etc. (V1 distribute them
uniformly within each 1(miles)*1(miles) block)
• Activity templates generation
Flowchart: Generating Activity Sequences based on Thane Survey for Delhi-V2
Frequency distribution of
reported activity sequences
Demographics of the Thane sample
population;UIS stat
1) Demographics2) Act template:Activity sequenceActivity duration
Commute categories
Activity sequences
sampling
Data sources:
Outcome:
samplingdecision tree
Frequency distribution of trips:
Trip start time Trip length
• Motivation of the residential contact network: – Approximate 40% adults in India do not travel to work. The network model interaction
among them around their homes (within residential area).
• Survey data collected:– age, gender of staying at home people: node label– contact durations/frequencies of each person near their home: edge label/node
degree
• Formal question: generate a random network s.t.– Given degree distribution of a bunch of nodes– Given label of each node– Assumption: network tend to be homophilous (nodes of the similar labels is connected
with higher probability )
• Method: – Configuration model with the added feature of node homophilous.– Refer to the next slide for details.
Generation of the Residential Network
For each edge-type in (long-dur, mid-dur, short-dur), do:1. Initialize each node with a degree drawn i.i.d. from the degree distribution
according to its label (age/gender)2. Form a list of “stubs” – connections of nodes that haven’t be matched with
neighbors. Call it stubList.3. Pick a starting node v0 randomly.
4. For each of v0’s stubs, choose an element v1 from the stubList as described in following:1) v1 is chosen randomly from the stubList;
2) if v1 is same as v0 or already connected to v0, go to 4.1).3) with a probability p (>0.5), we do
test if v1 is similar to v0,
if not, go to 4.1) and repeat the selection.4) create an edge between v0 and v1, its duration is computed randomly based on the edge-type (long, mid or short duration)
Done.
Random Network Generation: configuration model with the added feature of node homophilous.