A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

21
Risk Analysis DOI: 10.1111/j.1539-6924.2012.01790.x A Bayesian Method to Mine Spatial Data Sets to Evaluate the Vulnerability of Human Beings to Catastrophic Risk Lianfa Li, 1,2, Jinfeng Wang, 1 Hareton Leung, 2 and Sisi Zhao 1 Vulnerability of human beings exposed to a catastrophic disaster is affected by multiple fac- tors that include hazard intensity, environment, and individual characteristics. The traditional approach to vulnerability assessment, based on the aggregate-area method and unsupervised learning, cannot incorporate spatial information; thus, vulnerability can be only roughly as- sessed. In this article, we propose Bayesian network (BN) and spatial analysis techniques to mine spatial data sets to evaluate the vulnerability of human beings. In our approach, spatial analysis is leveraged to preprocess the data; for example, kernel density analysis (KDA) and accumulative road cost surface modeling (ARCSM) are employed to quantify the influence of geofeatures on vulnerability and relate such influence to spatial distance. The knowledge- and data-based BN provides a consistent platform to integrate a variety of factors, including those extracted by KDA and ARCSM to model vulnerability uncertainty. We also consider the model’s uncertainty and use the Bayesian model average and Occam’s Window to av- erage the multiple models obtained by our approach to robust prediction of the risk and vulnerability. We compare our approach with other probabilistic models in the case study of seismic risk and conclude that our approach is a good means to mining spatial data sets for evaluating vulnerability. KEY WORDS: Bayesian network; data mining; spatial analysis; vulnerability 1. INTRODUCTION Vulnerability of human beings to catastrophic risk denotes the degree of an individual subject to the damage arising from a catastrophic disaster. A great degree of vulnerability will result in consid- erable damage to the individuals exposed to catas- trophic hazards. In China, the last record event was the Wenchuan earthquake of 8 M w on May 12, 2008, 1 LREIS, Institute of Geographical Sciences and Natural Re- sources Research, Chinese Academy of Sciences, Beijing, China. 2 Department of Computing, The Hong Kong Polytechnic Univer- sity, Hong Kong. Address correspondence to Lianfa Li, LREIS, Institute of Ge- ographical Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing, China; [email protected]. that resulted in casualties of about 300,000 people (more than 90,000 were dead or missing) and prop- erty loss of U.S. $20 billion. (1,2) Another catastrophic event was the tsunami occurring on December 26, 2004, beneath the Indian Ocean west of Sumatra, Indonesia, with the loss of more than 300,000 lives and the displacement of over 1,000,000 people. (3) These two catastrophic events, which caused a huge loss of lives and properties, resulted from the emer- gency nature of such catastrophic hazards; vulnera- bility from unpredictable natural disasters is the pri- mary reason for such huge casualties and loss. It is undeniable that it is difficult to monitor and pre- dict natural hazards precisely because of our lim- ited knowledge of their mechanisms and underlying causes. However, given historical patterns of natu- ral hazards, we can obtain useful information about 1 0272-4332/12/0100-0001$22.00/1 C 2012 Society for Risk Analysis

Transcript of A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

Page 1: A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

Risk Analysis DOI: 10.1111/j.1539-6924.2012.01790.x

A Bayesian Method to Mine Spatial Data Sets to Evaluatethe Vulnerability of Human Beings to Catastrophic Risk

Lianfa Li,1,2,∗ Jinfeng Wang,1 Hareton Leung,2 and Sisi Zhao1

Vulnerability of human beings exposed to a catastrophic disaster is affected by multiple fac-tors that include hazard intensity, environment, and individual characteristics. The traditionalapproach to vulnerability assessment, based on the aggregate-area method and unsupervisedlearning, cannot incorporate spatial information; thus, vulnerability can be only roughly as-sessed. In this article, we propose Bayesian network (BN) and spatial analysis techniques tomine spatial data sets to evaluate the vulnerability of human beings. In our approach, spatialanalysis is leveraged to preprocess the data; for example, kernel density analysis (KDA) andaccumulative road cost surface modeling (ARCSM) are employed to quantify the influenceof geofeatures on vulnerability and relate such influence to spatial distance. The knowledge-and data-based BN provides a consistent platform to integrate a variety of factors, includingthose extracted by KDA and ARCSM to model vulnerability uncertainty. We also considerthe model’s uncertainty and use the Bayesian model average and Occam’s Window to av-erage the multiple models obtained by our approach to robust prediction of the risk andvulnerability. We compare our approach with other probabilistic models in the case study ofseismic risk and conclude that our approach is a good means to mining spatial data sets forevaluating vulnerability.

KEY WORDS: Bayesian network; data mining; spatial analysis; vulnerability

1. INTRODUCTION

Vulnerability of human beings to catastrophicrisk denotes the degree of an individual subject tothe damage arising from a catastrophic disaster. Agreat degree of vulnerability will result in consid-erable damage to the individuals exposed to catas-trophic hazards. In China, the last record event wasthe Wenchuan earthquake of 8 Mw on May 12, 2008,

1LREIS, Institute of Geographical Sciences and Natural Re-sources Research, Chinese Academy of Sciences, Beijing, China.

2Department of Computing, The Hong Kong Polytechnic Univer-sity, Hong Kong.

∗Address correspondence to Lianfa Li, LREIS, Institute of Ge-ographical Sciences and Natural Resources Research, ChineseAcademy of Sciences, Beijing, China; [email protected].

that resulted in casualties of about 300,000 people(more than 90,000 were dead or missing) and prop-erty loss of U.S. $20 billion.(1,2) Another catastrophicevent was the tsunami occurring on December 26,2004, beneath the Indian Ocean west of Sumatra,Indonesia, with the loss of more than 300,000 livesand the displacement of over 1,000,000 people.(3)

These two catastrophic events, which caused a hugeloss of lives and properties, resulted from the emer-gency nature of such catastrophic hazards; vulnera-bility from unpredictable natural disasters is the pri-mary reason for such huge casualties and loss. It isundeniable that it is difficult to monitor and pre-dict natural hazards precisely because of our lim-ited knowledge of their mechanisms and underlyingcauses. However, given historical patterns of natu-ral hazards, we can obtain useful information about

1 0272-4332/12/0100-0001$22.00/1 C© 2012 Society for Risk Analysis

Page 2: A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

2 Li et al.

vulnerability to catastrophic disasters using adaptiveevaluation methods and increasing spatiotemporaldata sets. But in the case of the two disasters de-tailed earlier, vulnerability was so poorly evaluatedthat people had no time to prepare for the sud-den occurrence of these catastrophic disasters, whichin turn caused huge casualties. If we had a preciseevaluation of vulnerability to catastrophic risk, evengiven the occurrence of such disaster, more loss couldhave been avoided and more lives could have beensaved.

Conversely, vulnerability is affected by multipleuncertain factors that include hazard intensity, envi-ronment, and individual characteristics;(4) thus, eval-uation of vulnerability presents ongoing challenges.Many existing approaches to vulnerability assess-ment are based on the aggregate-area method andunsupervised learning, that is, observations with as-sumed latent variables,(5−9) making these approachesunable to fully use spatial information. Vulnerability,therefore, is evaluated only roughly, for the followingreasons:

• First, in the aggregate-area-based method,related factors are aggregated into census-designated statistical areas. The size of the ag-gregated area has a significant influence on theresult of statistical analysis (different conclu-sions might be obtained where there are ag-gregated areas of different sizes).(10) Conse-quently, approaches based on the aggregatearea cannot satisfy the requirement of specificlocation of vulnerability, such as rescue of hu-man beings, accurate assessment, and planningfor the disaster.

• Second, some existing approaches to vulnera-bility assessment, such as fuzzy comprehensiveevaluation,(11,12) are based primarily on do-main knowledge and anecdotal evidence gath-ered by local experts in which the empiricaldata used in the model are difficult to validate.The result obtained thereby indicates simplywhere the vulnerability is high or low at thecensus-based statistical area but does not showthe vulnerability of the landscape at a fineresolution.

• In these approaches, such techniques of spatialanalysis have seldom been used, and this hasmade it difficult to incorporate spatial infor-mation. Spatial analysis can detect patterns ofgeofeatures or events occurring in geographicspace according to their spatial or nonspatial

attributes.(13,14) Techniques of spatial analy-sis such as kernel density analysis (KDA)and accumulative road cost surface modeling(ARCSM) can quantify such spatial patternsthat are used with other quantitative factors invulnerability assessment. This extends the setof predictive factors that can improve vulnera-bility prediction.

In this article, we propose a spatial data miningapproach to modeling the vulnerability of human be-ings exposed to catastrophic hazards. Data miningis the process of extracting patterns from data. It isan important tool for transforming the data into use-ful information. In terms of vulnerability assessment,spatial data mining processes geospatial data to ob-tain the prediction models (structure and parame-ters) that relate predictive factors to risk and vul-nerability. Technically, we use a Bayesian network(BN) to make the uncertainty analysis and compu-tation of vulnerability. Our approach is based ona grid data set supported by a geographical infor-mation system (GIS). Using the GIS-supported griddata set, spatial analysis such as KDA and ARCSMcan be applied to derive new predictive factors fromthe related geofeatures, obtain a table of all the cellsin the grid data set (with each as a sample for un-certainty inference), amass knowledge from the BNto predict new cases, and, as well, visualize the re-sults in GIS. In this article, we make the followingcontributions:

• Use spatial analysis techniques such as KDAand ARCSM to derive new predictive fac-tors from related geofeatures that are usedwith other factors to predict vulnerability.The aggregate-area method does not con-sider the differences among individuals. Ourmethod can preserve the original number ofobservations without the need to either ag-gregate or average them. Thus, it is less bi-ased than the aggregate-area-based methods.Given derivation of additional predictive fac-tors, the set of predictors can be strength-ened and the model’s performance can beimproved.

• Propose a data mining approach to vulnera-bility assessment. A BN is the major frame-work for integrating spatiotemporal data anddomain knowledge from different sources. Wegeneralize the generic framework of the BNand spatial modeling techniques for vulner-ability assessment. Our approach may target

Page 3: A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

Mining Spatial Data Sets for Evaluating Vulnerability 3

a more specific location of vulnerability sinceit is based on a fine scale. Also, our ap-proach is able to distinguish among the differ-ent environmental and individual characteris-tics that contribute to the vulnerability of theexposed objects. In this term, our approachshould perform better than many existingapproaches.

• Consider the uncertainties of predictive mod-els in vulnerability assessment. We employheuristic-learning algorithms to study localtopologies based on the BN framework andthen use a Bayesian model average (BMA) toget a robust prediction. We use several typicalpredictive models to assess vulnerability andexamine their applicability.

The following outlines the structure of thisarticle. Section 2 briefly describes the Bayesianframework for vulnerability assessment. Section 3presents our modeling methods for vulnerability as-sessment. Section 4 gives an introduction to our studycase and then presents the results. Section 5 discussesthe implications of our approach. Section 6 presentsour conclusions.

2. BAYESIAN FRAMEWORK FORVULNERABILITY ASSESSMENT

This section mainly describes the factors in-volved (Section 2.1) and our Bayesian framework forvulnerability assessment (Section 2.2).

2.1. Factors Involved

There are two sets of factors involved in vulner-ability analysis:

1. Factors related to exposure are directly re-lated to occurrence of natural disasters. Forinstance, movements of the earth’s crustsmay cause earthquakes; extremely intensewind may cause typhoons; heavy rainfalls maycause floods or/and landslides.

2. Factors related to vulnerability refer to envi-ronmental and system-resistance factors. En-vironmental factors characterize the environ-ment that breeds the disasters. They are ableto mitigate or amplify the destructive powerof a hazard. For instance, a good water-soilconservation capability can mitigate the de-structive effect of mudslide. System-resistancefactors are the characteristics of an individual

or system that mitigate against the damage ofa natural disaster. For example, a house witha lightweight steel structure may better with-stand the destructive power of an earthquakethan one constructed of wood or brick; peopleliving closer to a shelter may reduce their vul-nerability to damage from a disaster; educa-tion about the causes and nature of tsunamiscan increase citizens’ vigilance against thistype of disaster.

In our previous article’s(15) table 1, we reportedon exposure-related and vulnerability-related factorsand the empirical modeling methods for five naturaldisasters: earthquake, flood, typhoon, mudslide, andavalanche.

2.2. BN of Vulnerability Assessment

A BN is a kind of directed acyclic graph with con-ditional probabilistic dependence:

BS = G(V, E), (1)

where BS is network structure; V is the set of randomvariables (r.v.); E ∈ V × V is the set of directed edgesthat indicate the probabilistically conditional depen-dency relationships between r.v. nodes and satisfy theMarkov property;(16) and

BP = {γu : �u × �πu → [0 . . . 1]|u ∈ V} (2)

is a set of assessment functions, where the state space,�u, is the finite set of values of u; πu is the set of par-ent nodes of u; if X is a set of variables, �X is theCartesian product of the state spaces of all the vari-ables in X; and γu uniquely defines the joint prob-ability distribution P(u|πu) of u conditional on itsparent set, πu.

BN is based on the Bayesian inference princi-ple of the a posteriori probability (that is, belief )of a hypothesis from the evidence. For vulnera-bility assessment, evidence derives from exposure-or vulnerability-related factors, and the hypothesisrefers to the state of risk, expressed as damagestates from low to high levels. Let t be such ahypothesis variable of damage risk and its statespace �r would have, say, seven states, �r = {none,slight, light, moderate, heavy, major, destroyed}. Ina specified BN, if some factors are known, a pos-teriori probability or belief of the target variablet being in a certain state can be estimated by

Page 4: A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

4 Li et al.

R: release scenarioMP: modelingparameters of exposure

E: exposure intensity and derived hazards

ARCS: accumulated roadcost surface

HD: human damage

HC: human characteristics suchas sex, income, age, residential

environment and health state etc.

Exposure

System resistence

Vulnerability

States of the target variable, HD: none, slight, light, moderate, major, destroyed

V: vulnerability

set of random nodes utility node link of a single variable link of (possible) multiple variables

Fig. 1. Generic Bayesian modeling framework for vulnerability assessment.

calculating the marginal probability:

Belief (x, t) =∑

ui ∈V,ui �=r

p(u1, u2, . . . , r, . . . , un), (3)

where x is an entity, namely, a cell of the grid surfacethat represents the persons exposed to the disaster; tthe value of a certain state of the damage; and t∈�r,p(u1, . . . , un) = ∏

ui ∈V p(ui |πui ) is the joint probabil-ity over V.

To construct a BN, we suggest the following foursteps: (1) identification of the factors to quantify theproblem framework; (2) establishment of the inter-dependencies between the r.v. nodes; (3) assignmentof the states or quantities to nodes; and (4) assess-ment of the conditional probabilities.

Fig. 1 describes the BN framework for riskassessment of human beings. In this framework,R denotes release scenario, MP the set of exposure-related factors, E the exposure intensity and itsderived hazards (random nodes), HC human char-acteristics related to the vulnerability of human be-ings (random nodes), ARCS the accumulative roadcost surface (a random node), HD the casualty dam-age caused to human beings (a random node), andV the vulnerability index (a utility node) that will bedefined in Equation (13) of Section 3.6. HD is thetarget variable; we aim to estimate its probabilities in

various states. In this framework, we use capitalizeditalicized letters to represent a univariate and capital-ized roman-type letters to represent the set of multi-variates. So in this framework, R, ARCS, and HD de-note a univariate but MP, E, and HC denote a set ofmultiple variables. Also in this figure, the double linewith an arrow represents the link from multiple ran-dom nodes to the arrowed node or the set of nodes,whereas the single line represents the link from a sin-gle random node to the arrowed node or the set ofnodes.

We construct the framework according to do-main knowledge,(5,9,11,17,18) that is, an extension ofour previous framework in risk assessment for build-ings.(15) In this framework, it is assumed that releasescenarios (R) and related parameters (MP) of ex-posure have deterministic influences on the expo-sure intensity or derived hazards (E); human dam-age (HD) is assumed to be determined by exposure(E), accumulated road cost surface (ARCS), and hu-man characteristics (HC). For the set of multivari-ates (MP, E, HC), there may be interdependenciesbetween these variables in addition to their depen-dencies described in Fig. 1. Similarly, we can con-struct their interdependencies according to domainknowledge or learning from the training data. Oncean initial BN framework is constructed, we can

Page 5: A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

Mining Spatial Data Sets for Evaluating Vulnerability 5

Fig. 2. Data mining procedure for vulnerability assessment.

refine the interdependencies between local r.v.s inMP, E, and HC and estimate the conditional assess-ment parameters.

3. MINING TECHNIQUES FORVULNERABILITY ASSESSMENT

3.1. Data Set Format and Mining Procedure

Fig. 2 shows the procedure for vulnerability as-sessment. Our techniques are based on the grid dataset. To obtain the grid data set from multiple hetero-geneous sources, we apply preprocessing steps, e.g.,converting the vector data and resampling the gridinto the target grid at the standardized resolution andprojection. We perform these steps in a GIS environ-ment (Fig. 2a), such as ARCGIS or SuperMap. Thefollowing sections describe major techniques of theprocedure.

In our approach, spatial analysis and BNs areused to make the vulnerability assessment. In the griddata set (Fig. 2b), each cell corresponds to an inde-pendent sample for training or a new unit that haspredictive factors and the target variable. We can ex-tract the table of multiple attributes (Fig. 2c) fromthe data set. The grid-based format enables us to col-

lect a variety of data from different sources and tointegrate it within a data mining system using geospa-tial techniques such as rasterization, resampling, re-projection, KDA, and ARCSM. Rasterization, re-sampling, and reprojection are traditional techniquesand relatively mature, but KDA and ARCSM haveonly recently been used in modeling as new quantifi-cation techniques of spatial analysis. Thus, our de-scription of methodology is focused on KDA andARCSM techniques of spatial analysis and Bayesianmodeling.

In the spatial data mining environment, we usenot only GIS to manage and visualize spatial databut also employ spatial analysis techniques that maybe loaded on GIS, such as KDA and ARCSM, toextract predictive factors from relevant geofeatures.Although KDA and ARCSM may be loaded on GIS,they are distinct from GIS basic functionality. As thefirst law of geography states: “Everything is related toeverything else, but near things are more related thandistant things.”(19) Natural disasters, as geospatialevents, occur in geographical areas, and they shouldalso have such spatial characteristics; thus, Tobler’sfirst law of geography is applicable for them. Weuse KDA and ARCSM in vulnerability assessment

Page 6: A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

6 Li et al.

that considers influence of points such as pollutionsources or polylines, such as an earthquake’s activefaults on the surroundings; the influence graduallydecreases with the increase of the distance from thesources. KDA models such influences and derives in-fluence factors from the geofeatures.(20,21) ARCSMcalculates the accumulative cost for each cell to itsneighbor shelters and the cost, determined by rele-vant geographical factors, has an important influenceon the vulnerability of the individuals exposed to thehazards.(22−24)

3.2. Spatial Analysis Techniques

Spatial analysis is used to deal with spatial ornonspatial attributes of geographic features to iden-tify generic spatial patterns. These techniques in-clude but are not limited to local/global spatialautocorrelation, G-statistics, KDA, ARCSM, andthe like. In vulnerability analysis, spatial analysiscan be used to quantify the discrete or qualita-tive factors for use in combination with other quan-titative factors. For example, spatial autocorrela-tion is employed to detect similarity or dissimilar-ity of the damage or loss from the disaster, andG-statistics is used to detect hot spots of high risk. Inour approach, KDA and ARCSM are especially ap-plicable for processing those qualitative spatial fea-tures such as rivers or faults and quantifying them tobe used in combination with other analytic factors.Sections 3.2.1 and 3.2.2 mainly describe these twotechniques. Spatial autocorrelation and G-statisticsmay be useful, but the factors derived from themcannot be used directly in vulnerability assessment.Thus, our methodology primarily uses KDA andARCSM as quantification techniques.

3.2.1. Kernel Density Analysis

As a technique of spatial analysis, KDAtransforms a sample of observations recorded asgeographically referenced points or polylines into acontinuous surface, quantifying the intensity of indi-vidual observations over space.(10,25) Points or poly-lines closer to the center of the target entity areweighted more heavily than those away from the en-tity, which embodies Tobler’s first law of geography.The kernel weights vary within its “sphere of influ-ence” according to their distance from the centralpoint or polyline as the intensity is estimated: the sur-face value is highest at the location of the centraltarget geofeature and diminishes over the distance

from the geofeature. The z’s density estimate usingKD function is:

Density(z) = 1n

n∑i=1

Zi · Kλ(z, Zi ), (4)

where n is number of sample units, z is any unit in thegeographical area, Zi is the value of the sample unit,and Kλ(z, Zi ) is the kernel density (KD) function. Toget the KD function, we can use the normal functionto simulate it.(15)

The bandwidth or search radius, λ, is affectedby empirical knowledge and the goal. A large λ ismore generalized over the entire study area whereasa small λ means more localization over the area.Since our goal is to reflect the influence of relevantfactors on damage caused to individuals exposed toa disaster, in practice, λ can be set according to thebiggest influence range of the relevant factors.

As a smoothing technique, KDA can derive thecontinuous surface data from geofeatures. This facil-itates integration of quantitative factors in combina-tion with the qualitative factors deriving from somegeofeatures, such as points (e.g., pollution sources)and polylines (e.g., rivers or faults). Another advan-tage of KDA is that it avoids the drawback of that“aggregate” approach within which the estimated av-erage exposure in a particular region may serve as areasonable surrogate for the actual exposure of indi-viduals. Individual exposure levels cannot accuratelybe inferred from aggregated data; KDA helps to pre-serve the original number and intensity of observa-tions without the need either to aggregate or averagethem.(15,26)

Conversely, although in general KDA is appli-cable for smoothing geofeatures, given certain geo-graphical features such as gorges and chasms (whichmight result in noncontiguous geofeatures), it is pru-dent for us to use a continuity-based method, suchas KDA. This problem is common, especially in thecase of floods. Thus, if there are considerable uniquefeatures such as gorges, we need to identify them onthe grid and assign the corresponding cells values dif-ferent from other continuous surfaces.

3.2.2. Accumulative Road Cost Surface Modeling

ARCSM is used to estimate the cost, i.e., the“difficulty” to reach a shelter from a living placewhen a catastrophic event happens. The lowerthe cost, the more likely individuals vulnerable to

Page 7: A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

Mining Spatial Data Sets for Evaluating Vulnerability 7

damage can escape and avoid it, thus reducing theirvulnerability. ARCSM consists of two models:

• Road cost surface model. This model evalu-ates the road cost of each cell of the grid basedon given factors:

c(x) = e(x) · we + d(x) · wd + a(x) · wa

+ s(x) · ws +k∑

i=1

oi (x) · woi , (5)

where x denotes the cell, e(·) denotes the ob-stacle factor directly related to the exposure,d(·) the road surface density (e.g., the roadarea/the region area), a(·) road accessibilitydetermined by distance to the roads, s(·) theslope, and oi other factors related to the roadcost. we, wd, wa , ws , and woi , respectively, de-note the weights of relevant factors. If a factorpresents a more difficult obstacle for traffic, itis weighted more heavily.

• Accumulated cost surface model. Given thecost surface from Equation (5) and the op-tional serviceable shelters, the accumulatedcost surface model estimates the least accumu-lated cost with which the neighborhood shel-ters of each cell could be reached:

Ac(x) = minNS(x)∑

j=x

c( j), (6)

where x denotes the cell, Ac(·) the accumu-lated cost, NS(x) the neighborhood sheltersaround x, and c(j) the road cost of the cell, j.We can calculate Ac(·) by optimal operationalalgorithms.

Accumulated road cost surface modeling is espe-cially useful for evaluating the damage that may re-sult to human lives: less accumulated road cost meansthat vulnerable persons can more easily reach shel-ters and thus avoid potential harm. The indicatorextracted using ARCSM can also be used with thefactors extracted by KDA and other quantitative fac-tors for assessing the vulnerability of human beings.

3.3. Discretization

Discretization is employed to transfer the con-tinuous variables to discrete variables to be used inBN since BN contains only qualitative or discrete

variables for modeling and inference. We obtain thediscrete intervals according to domain knowledge oruse a discretization algorithm to discretize continu-ous factors such as elevation, slope, and KD of faultsor rivers. Discretization is significant since it enablesus to use quantitative factors along with qualitativeor discrete factors to strengthen the model’s pre-dictability of vulnerability in BN.

If the domain knowledge for discrete intervalsis unclear, the discretization algorithm can be usedto make an automatic division. The algorithm is de-signed according to the “recursion” idea by Fultonet al. (1995)(27) and the minimal description length(MDL) stopping criteria in Fayyad and Irani’s algo-rithm.(28) This discretization method was reported inour 2010 article;(15) readers can refer to the article fortechnical details.

3.4. Probabilistic Mining Models

This section briefly introduces the probabilisticmodels compared in this article.

3.4.1. Bayesian Network

BN involves different search algorithms for con-structing the network topology. Table I lists majormethods for construction, inference, and predictionof BN.

3.4.1.1. Learning the network topology. Learn-ing the BN topology requires establishing a score tomeasure the network’s quality. There are three kindsof score measures that bear a close resemblance: theBayesian approach, the information criterion, andthe minimum description length. In this study, weused the Bayesian approach, which uses the a pos-teriori probability of the learned structure given thetraining instances as a quality measure. The Bayesianapproach can achieve a good result as it is unaffectedby the specific structure, unlike other measures.(29)

Once a Bayesian quality measure is selected, weapply an algorithm to search the space of the net-work structures to find the network topology witha high-quality score, Q(BS, D). We can apply dif-ferent heuristic or general-purpose search strategies,as listed in Table I. The heuristic algorithms includeK2, hill climbing (HC), and TAN (tree augmentednaı̈ve (Bayes)); the general-purpose algorithms in-clude Tabu, simulated annealing (SA), and genetic

Page 8: A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

8 Li et al.

Table I. Methods for Construction, Inference and Prediction of BN

Steps Type Methods

Structure Domain-knowledge-based Construct BN according to domainor empirical knowledge

Dependency-analysis-based Conditional independence (CI)(16)

Search scoring based(29) Quality measure Bayesian approach, informationcriterion approach, andminimum description lengthapproach

Learning methods Heuristic search strategies: K2, hillclimbing (HC), and TAN, etc.

General-purpose searchstrategies: Tabu, simulatedannealing (SA), and geneticalgorithm (GA), etc.

Parameter learning Domain-knowledge-based Reports, statistics, andexperienced models

Distribution-based Dirichlet-based parameterestimator

With missing data(16) Expectation maximization, Gibbssampling

Inference Exact inference Joint probability, naı̈ve Bayesian,graph reduction, andpolytree(54)

Approximate inference Forward simulation, randomsimulation(54)

Prediction Three types of reasoning Causal, diagnostic, andintercausal(16)

algorithm (GA).(29) Six search strategies used in ourstudy are briefly described below:

• K2(30) adds arcs with a fixed topological order-ing of variables. In our implementation of K2,the ordering in the data set is initially set asa naı̈ve Bayes (NB) network in which the tar-get class variable (loss or damage risk) is madethe first in the ordering(29) since we know littleabout the relationship between local variablesof the training data set.

• HC(31) adds and deletes arcs with no fixed or-dering of variables. This procedure is iterateduntil the highest value of the local Bayes scoreis obtained.

• In TAN Bayes,(32) the tree is formed by calcu-lating the maximum-weight spanning tree us-ing Chow and Liu’s algorithm.(33)

• Tabu(29) search is an optimal algorithm ofHC. This search algorithm applies a Markovblanket correction to the network structure af-ter a network structure is learned. This ensuresthat all nodes in the network are part of the

Markov blanket of the classifier node while theoptimal network is acquired.

• SA(29) randomly generates a candidate net-work B′

S close to the current network BS. Itaccepts the network if it is better than the cur-rent one. Otherwise, it accepts the candidatewith the probability:

eti ·(Q(B′′S,D)−Q(BS,D)), (7)

where ti is the temperature at iteration i. Thetemperature starts at t0 and is slowly decreasedwith each iteration.

• GA.(34) With D as the set of BN structures fora fixed domain with n variables, and the alpha-bet S being {0,1}, a BN structure can be repre-sented by an n x n connectivity matrix C, whereits elements, cij

ci j ={

1, if i is a parent j,0, otherwise.

In GA, we represent an individualof the population by character string:

Page 9: A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

Mining Spatial Data Sets for Evaluating Vulnerability 9

{c11c21. . . cn1c12c22. . . cn2. . . c1nc2n. . . cnn} (alsoknown as chromosomes). GA will search thestructure space to find the individual withthe best “genetic material” by cross-over andmutation operators.

3.4.1.2. Learning the parameters. Once the op-timal structure has been found, the parameters ofthe BN (conditional probability table (CPT) for eachnode, BP) can be determined according to domainknowledge or can be learned from the database ofinstances. We used the simple Bayesian estimator,which assumes that the conditional probability ofeach r.v. node corresponding to its parent instan-tiation conforms to the Dirichlet distribution withlocal parameter independence: D(α1, . . . , αi , . . . , ατ )with αi being the hyperparameter for state i. ThisBayesian estimator directly produces estimates ofconditional probability from the data set, and it canbe used directly in the data-driven learning. TheBayesian estimator rather than BMA is used in theanalysis because BMA is not yet part of the standarddata analysis tool kit as its implementation presentsseveral difficulties.(35)

For the estimation method, the assumption oflocal parameter independence may not be realistic.This can result in slow and biased parameter learn-ing. We use the classification tree to improve theanalysis while avoiding this assumption.(16)

3.4.2. Other Probabilistic Models

3.4.2.1. Logistic regression (LR). LR is a tech-nique of probability estimation based on maximumlikelihood estimation. In this model, let Y be the risklevel as the target/dependent variable (e.g., Y = 1 in-dicating “high risk” and Y = 0 “low risk”), and Xi

(i = 1, 2, . . ., n) be the predictive factors. LR assumesthat Y follows a Bernoulli distribution, and the linkfunction relating Xi and Y is the logit or log-odds:

p(Y = 1|X, β) = μ = 11 + e−Xβ

, (8)

where X = (1, X1, X2, . . . , Xn), β = (β0, β1, β2, . . . ,βn)T and β can be estimated by the maximum like-lihood estimator that would make the observed datamost likely.(36)

3.4.2.2. Naı̈ve Bayes. NB assumes that the pre-dictive factors are conditionally independent given

the target/dependent variable (Y = risk level):

P(Y|X1, . . . , Xn) = p(Y)P(X)

∏i

P(Xi |Y). (9)

NB assumes that within each class, the numericpredictors are normally distributed. One can repre-sent such a distribution in terms of its mean (μ)and standard deviation (σ ) and thus can estimatethe probability of an observed value from suchestimates.

3.4.2.3. Normalized Gaussian radial basisfunction (RBF) network. Normalized Gaussian RBFnetwork uses the k-means clustering algorithm toprovide the basis functions and learns either anLR (discrete class problems) or linear regression(numeric class problems) on top of that. Symmetricmultivariate Gaussians are fit to the data from eachcluster. If the class is nominal, it uses the givennumber of clusters per class. It standardizes allnumeric attributes to zero mean and unit variance.

3.4.2.4. Multiple perception (MPer). MPer is acomputational model that tries to simulate the struc-ture and/or functional aspects of biological neuralnetworks. It consists of an interconnected group ofartificial neurons and processes information usinga connectionist approach to computation. In mostcases, an artificial neural network (ANN) is an adap-tive system that changes its structure based on ex-ternal or internal information that flows through thenetwork during the learning phase. In more practicalterms, neural networks are nonlinear statistical data-modeling tools. They can be used to model complexrelationships between inputs and outputs or to findpatterns in data.

3.5. Evaluation and Modeling Uncertainty

3.5.1. Evaluation Measures

We use four scalar measures, i.e., pd, balance,precision, and ROC area.

• Pd refers to the detection probability of highrisk: it measures the proportion of correctlypredicted positive instances among the actuallypositive ones. If a method achieves a higher pd,it can detect more positive instances (more cellunits of high risk detected).

Page 10: A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

10 Li et al.

• Balance between pd and pf (pf refers to theprobability of false alarms; a good methodshould have a low pf ):

balance = 1 −√

(pf ∗ pf + (1 − pd)∗(1 − pd))/2.

(10)

• Precision refers to the proportion of true pos-itives among the instances predicted as posi-tive, but it cannot measure how the methoddetects the proportion of correctly predictedpositive instances among the actually positiveones. Good precision does not always meana good pd. A method with high precision buta low pd is less useful since it cannot detectmore significant positive instances (less units of“high loss” risk detected).

• ROC area is the area between the horizontalaxis and the receiver operating characteristic(ROC) curve; it gives a comprehensive scalarvalue representing the model’s expected per-formance. The ROC area is between 0.5 and 1,where a value close to 0.5 is less precise, whilea value close to 1.0 is more precise.

3.5.2. Uncertainty of Models

To mitigate the sampling bias and model uncer-tainties (also avoiding the overfitting problem), weuse BMA and Occam’s Window(35,37,38,39) to producea robust prediction of the seismic risk.

Assume r to be the target variable of risk, D tobe the training data set, and Mi to be the ith modelof BN. We can then obtain the averaged value of theprobability of the target variable being in a certainstate using BMA:

pr(r |D) =K∑

k=1

p(r |Mk, D)p(Mk|D), (11)

where K is the number of models selected and

p(Mk|D) = p(D|Mk)p(Mk)∑Kk=1 p(D|Ml)p(Ml)

(12)

is the weight of Bayes factor that is ratios of marginallikelihoods or of posterior odds to prior odds for dif-ferent models. We use the BN’s inference algorithms(Table I to obtain p(D|Mk) and assume that the priorprobability of each model (p(Mk)) is the same.

While BMA can average the predictions of thelearning algorithms (models) (Table I), we can alsouse Occam’s Window to select the qualified mod-els and remove those poorly qualified models,

thus improving the computation efficiency. Occam’sWindow(35) has two principles: (1) if a model re-ceives much less support (e.g., the ratio of 20:1) thananother model with maximum posterior probability,then it should be dropped; (2) complex models thatreceive less support than their simpler counterpartsshould be dropped.

We use six search algorithms, shown in Table I(bold typeface), to get the local structures of qual-itative factors and use BMA and Occam’s Windowto average the qualified models, thus decreasing themodels’ bias and improving their robustness.

3.6. Vulnerability Assessment

Vulnerability assessment estimates the suscep-tibility of individuals exposed to natural hazards(9)

and corresponds it with the potential degree of dam-age to individuals exposed to natural hazards. Thehigher the vulnerability, the more damage the indi-vidual may experience.

There is no precise definition for vulnerability.In this study, we regard vulnerability as the com-prehensive damage index. In the BN framework(Fig. 1), the target variable is the damage state of hu-man beings that are affected by multiple factors suchas exposure intensity, physical or ecological environ-ment, accumulative road cost, and individual char-acteristics. Using BN, we can integrate a variety ofexposure- and vulnerability-related factors from dif-ferent sources to estimate the probabilities that thehuman being is in a certain damage state. Each state(none, slight, light, moderate, major or destroyed) ofthe target variable, HD, has a corresponding dam-age factor range, a central damage factor, and costcoefficients (Table II). Damage factor is the fractionor percentage of the damage caused to an individualexposed to a natural disaster. The cost coefficientsare used for calculating the vulnerability index.We assume that the cost coefficient is nonnegative,

Table II. Different States of the Target Variable, DamageFactors, and Cost Coefficients

Damage Factor Central DamageDamage State Range (%) Factor (%) Cost Efficient

None 0 0 0.1Slight 0–1 0.5 0.2Light 1–10 5 0.3Moderate 10–30 20 0.5Major 60–100 80 0.75Destroyed 100 90 >0.9

Page 11: A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

Mining Spatial Data Sets for Evaluating Vulnerability 11

ranging from 0 to 1; a condition with a larger dam-age factor has a larger cost coefficient. Table II givesa general definition of the cost coefficient; users canadjust it according to the study goal and empiricalknowledge.

Integrating the multiplication of the probabilityof the vulnerable individual being in a certain dam-age state, and the corresponding cost coefficients, wecan obtain the vulnerability index of the individual asa utility node of BN (the node of diamond in Fig. 1):

Vul(x) =∫ t=Max

t=MinCost(t)Belief (x, t)dt

≈t=destroyed∑

t=none

Cost(t)Belief (x, t), (13)

where x denotes a unit to be estimated, e.g., a cellin a grid data that represents the exposed persons, tthe damage state of the target variable, HD (t∈�r ={none, slight, light, moderate, major, destroyed}), Cost

the cost coefficients, and Belie f (x, l) the likelihoodor posterior probability of x being in a damage state.

4. THE STUDY CASE OF SEISMIC RISK

The 2008 Wenchuan earthquake is selected asour study case. In Section 4.1, we introduce the studyregion and goal. Section 4.2 describes the simula-tion of the peak ground acceleration (PGA) distribu-tion under different scenarios. We then use differentprobability models to compute the risk probabilityand compare them in Section 4.3. Section 4.4 presentsthe vulnerability index produced with our robust ap-proach. In Section 4.5, we perform uncertainty andsensitivity analysis.

4.1. Study Region and Goal

The study region of interest (ROI; Fig. 3a)is a rectangular region, located at Du Jiangyan,Sichuan province of China, between North Latitude30◦57′57.318′′ and 31◦1′12.768′′ and between EastLongitude 103◦35′19.657′′ and 103◦41′7.6′′. The studyregion is close to the catastrophic disaster of the May12, 2008, Wenchuan earthquake.

Fig. 3. (a) The study region of interest (ROI) and (b) the ROI’s background of seismicity.

Fig. 4. Steps of probabilistic seismic hazard analysis (PSHA).

Page 12: A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

12 Li et al.

The 2008 Wenchuan earthquake, also known asthe Great Sichuan Earthquake, was a deadly earth-quake that measured at 8.0 Ms and 7.9 Mw, occurringat 14:28:01.42 CST (02:28:01.42 EDT) on May 12,2008 in Sichuan province of China. The earthquakekilled at least 68,000 people. This earthquakeresulted from the interactive movements of theIndian and Eurasian plates in opposite directions.The seismicity of central and eastern Asia is the re-sult of the northward movement of the Indian plateat a rate of 5 cm/year and its collision with Eurasia,resulting in the uplift of the Himalaya and Tibetanplateaus, and associated earthquake activity. Theearthquake occurred along the Longmenshan fault,a thrust structure along the border of the Indo-Australian and Eurasian plates. Seismic activitieswere concentrated on its mid-fracture (known as theYingxiu-Beichuan fracture). The rupture lasted closeto 120 seconds, with the majority of energy releasedin the first 80 seconds. Starting from Wenchuan, therupture propagated at an average speed of 3.1 km/sec49◦ towards the northeast, rupturing a total area ofabout 300 km. Maximum displacement amountedto 9 m.

Fig. 3(b) represents the seismicity context of the2008 Wenchuan earthquake as described in the ear-lier paragraph, where the ROI is close to the activefaults that have a history of seismic activities. Thespecific goal of our study case is to use the historicseismic catalog to simulate under the seismicity con-text (Fig. 3b) the probabilistic seismic hazard risk,i.e., PGA (the traditional peak ground acceleration)values in ROI at two levels of exceedance probabil-ity, then use the two release scenarios to conduct vul-nerability analysis of human beings in the ROI. Inthis study case, besides the simulation of release sce-nario, we use critical covariates such as distances torivers and the active faults, accumulative road costsurface, and so on, to estimate the potential vulner-ability of human beings exposed to such scenarios.The assessment output of vulnerability is informa-tive for the relevant agencies or insurance companiesto make suitable plans and preparedness measuresagainst possible damage from future earthquakes.

4.2. Data Set

The data set is based on the grid format. Eachcell of the grid corresponds to a certain number ofhuman beings and their vulnerability. The factors in-volved were initially selected according to domainknowledge(40−43) and obtained from the National

Geomatics Center of China, using the resamplingtechnique. The data are acquired as follows:

• Factors related to exposure include releasescenario (rs), magnitude (m), distance (d),landslide risk (lsr), liquefaction risk (lfr), andground motion risk (pga). We obtained rs,m, d, and pga by exposure modeling accord-ing to the catalog of historical earthquakesand seismicity around this region (Fig. 3b).Also, lsr was quantified using the appropriatemethod(44,45) and relevant environmental fac-tors that included slope, soil type, PGA, andKDs of rivers and active faults. Due to the lackof specific spatial data for this region, lfr andits relevant factors were not included in thedata set.

• Factors related to system resistance include en-vironmental variables and human character-istics.(42) Environmental variables include soiltype (st), proximity to faults (kdf), proximityto rivers (kdr), accumulative road cost surface(cost), and slope (sl). We used the KD func-tion described in Section 3.2.1 to quantify kdf

and kdr and made a suitable classification ofthem (Table III). Human presence (hpr) is theprimary characteristic included for human be-ings. Due to lack of relevant data, we didn’t in-clude other variables of human characteristics,such as average age and disaster knowledge.But if such factors are available, they should beemployed.

• The data set is based on the grid format. Eachcell of the grid has two sides of 5 m × 5 m, giv-ing an area of 25 m2. This affords our data set afine resolution for simulating the practical situ-ation. The values of each cell’s relevant factorsare assigned this way: Spatial data of vectorssuch as faults and rivers are converted into thegrid surface using the KD method, and all thegrids are resampled to the target grid of 5 m ×5 m. Such a high resolution means more spe-cific and precise location of vulnerability fac-tors that are beneficial in the decision-makingprocess for early warning, mitigation, and pre-vention of disasters.

4.3. Hazard Analysis

Probabilistic seismic hazard analysis(PSHA)(42,46,47) was used in conducting the haz-ard analysis. The goal of PSHA is to quantify the

Page 13: A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

Mining Spatial Data Sets for Evaluating Vulnerability 13

Table III. Classification of Variables and Their Descriptions in BN of Vulnerability of Human Beings to Seismic Risk

Number of States or Intervals Source of PriorVariable States (unit) Probability Distribution

Release scenario (rs) 2 10% in 50 years; 10% in100 years

Set according to the PGAexceedance probabilityin T years

Magnitude (m) 6 0–5.0–5.5–6.5–7.0–7.5 –∞ The annual probabilitiescalculated using theGutenberg-Richtermagnitude recurrencerelationship(55)

Distance (d) 5 0–10–20–40–80–∞ (km) Distance is that from theseismic sources to thesite of interest

Ground motions (PGA)risk (pga)

11 0–30–50–150–250–350–450–550–650–750–850 –∞(gal)

Deterministic relations,calculated as a functionof magnitude, distance,soil type by PSHA(42)

Soil type (st) 5 Unknown; hard rock; softrock; medium soil; soft soil

Amplification factor forPGA and landslide risk:1.0 (unknown, mediumsoil), 0.55 (hard rock),0.7 (soft rock), 1.3 (softrock)(40,43)

Close to faults? (kdf) 6 0–1–2–3–4–5– Quantified using kerneldensity function (Section3.2); assume that closerto active faults, more riskof damages(20)

Close to rivers? (kdr) 11 0–100–200–300–400–500–600–700–800–900–1000–

Quantified using kerneldensity function (Section3.2.1); assume that closerto rivers, more risk ofdamages(20)

Slope (sl) 9 0–5–10–15–20–25–30–35–40–90

Slopes are assumed tocause mudslides thatcause damages tobuildings

Landslide risk (lsr) 3 Safe or slightly risk; moderaterisk; highly risk

Five factors, i.e., rivers,faults, soil, slope, andPGA are responsible forlandslides; modeledusing the fuzzymethod(44)

Liquefaction risk (lfr) 2 Ground amplification;liquefaction

Modeled using the methodof earthquakeengineering(40,43)

Human presence (hpr) 2 Yes/ no Assume that the priorprobability p = thepopulation within thecell/the area of each cell

probability of exceeding various ground-motionlevels at a site, e.g., a cell in a grid data set, givenall possible earthquakes. In our analysis, we usedPGA to simulate the ground motion in PSHA. PGA

is used to define lateral forces and shear stressesin the equivalent-static-force procedures of somebuilding codes and in liquefaction analyses. It is agood indicator of ground motions.

Page 14: A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

14 Li et al.

Fig. 5. The surfaces of 10% chance of exceedance of PGA (a) within 50 years and 10% chance of exceedance PGA (b) within 100 years.

Our 2010 article(15) described the techniques ofPSHA (Fig. 4) in hazard analysis of earthquakes indetail. Readers can refer to this article and relateddocuments for technical details.

In the study case, we obtained the PGA mapsof two levels of exceedance probability, i.e., a 10%chance of PGA exceedance within 50 years (Fig. 5a)and a 10% chance of PGA exceedance within100 years (Fig. 5b). These translate, effectively, to re-currence probabilities of 475 and 950 years.

4.4. Bayesian Modeling

This section mainly describes the specification ofBN that includes construction of the BN topologyand extraction of the CPT parameters.

We constructed the BN according to domainknowledge of earthquake engineering(40,43,48,49,50)

and the generic framework of Fig. 1. Fig. 6 presentsthe initial network topology.

As shown in Fig. 6, exposure-related factors in-clude rs, m, d, pga, lfr, and lsr. Among these factors, m

Page 15: A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

Mining Spatial Data Sets for Evaluating Vulnerability 15

Release scenario (rs)

Distance (d )

Liquid limit

Magnitude (m)

Clay content

Soil profile

Liquefactionsusceptibility

Soil type (st)

Accumulative roadcost surface (cost)

Humandamage (hd)

target variable indicators relevant to system resistance indicators related to exposure

Landsliderisk ( lsr)

Ground motions(PGA) risk ( pga)

Liquefactionrisk (lfr)

Close to rivers? (kdr)

Human presence (hpr)

Close to faults? (kdf)

Slope (sl)

Not modeled in the study case

Vulnerability

Fig. 6. Bayesian network topology of seismic vulnerability.

and d are modeling parameters of exposure, pga is theintensity indicator, and pga, lfr, and lsr are three riskfactors responsible for vulnerability of human beings,hd. Thus, we have three causal links from three riskfactors, pga, lfr, and lsr to the target variable, hd. Dueto lack of some soil data such as soil profile and claycontent, modeling of liquefaction risk was discarded.But these factors should be considered if such dataare available. Therefore, the soil variables are givenin Fig. 6 but are shown in dotted lines to indicate thatthey were not used in our case. Thus, in the test, bymodeling pga and lsr, we simulated the different vul-nerabilities of ROI under two scenarios of differentPGA.

In Fig. 6, the factors relevant to system resistanceinclude environmental variables and characteristicsof human beings. Environmental variables includest, kdf, kdr, sl, and cost. We used KD function (1) toquantify kdf and kdr and made a suitable classifica-tion of them (Table III). Characteristics of humanbeings include human presence (hpr), which was ap-proximated with the map of human density.

The target variable of the Bayesian model isdamage to human beings (hd), with a typology of sixstates, as described in Table III. Its correspondingdamage factors are established according to earth-quake engineering experience.(43,51)

Table III gives a brief description of the variables(factors) involved in our BN (Fig. 6) and the sourcesof the distributions of their prior probabilities.

Fig. 7(a) presents the surface of KD values of theroads in ROI, and Fig. 7(b) presents the accumu-lative road cost surface, with selected hospitals andpublic parks as the shelter services (the “plus” inFig. 7).

4.5. Evaluation of Models and Vulnerability

Using the models presented in Section 3.4 to pre-dict damage probabilities, we compared the models(Tables IV and V) with the practical situation in thesimulation of the return period of 950 years and usedEquation (13) to compute the vulnerability index foreach cell of the grid for the two simulations in ROI(Fig. 8).

In total, we used 11 algorithms to predict theROI’s damage under two scenarios and comparedthem with the practical situation under a scenariosimilar to the actual 2008 situation. Among the11 algorithms there were six BNs, respectively, re-sponding to the six search algorithms described inSection 3.4.1—K2, HC, TAN, Tabu, SA, and GA.We also used BMA that integrated the predictivevalues from the six BN algorithms and other fouralgorithms, including LR, NB, RBF network, andMPer. (The four methods have been described inSection 3.4.2.) The predictors used in these mod-els included the exposed-related factors (i.e., rs, m,d, pga, and lsr) and system-resistance factors (i.e., st,kdf, kdr, sl, cost, and hpr). Table IV shows the specific

Page 16: A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

16 Li et al.

0 .005 .01.0025km

±

(a)

0 .005 .01.0025km

±E

E

E

E

E

E

EEE

(b)

Fig. 7. The graduated categories of the kernel density values of the rivers within ROI (a) and the graduated categories of accumulativeroad cost surface corresponding to shelter services within ROI (the plus corresponding to the shelter service)(b).

Page 17: A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

Mining Spatial Data Sets for Evaluating Vulnerability 17

Table IV. Comparison of the Models for Prediction of the 950 Years with the Practical Damage Survey as the Training Datain Pd and Balance

Pd BalanceModelNo. ofState 1 2 3 4 5 6 1 2 3 4 5 6

BN (BMA) 0.88 0.79 0.84 0.91 0.82 0.91 0.88 0.85 0.88 0.93 0.85 0.87BN (K2) 0.95 0.69 0.75 0.84 0.73 0.79 0.96 0.77 0.82 0.87 0.79 0.84BN (HC) 0.82 0.68 0.74 0.83 0.74 0.80 0.87 0.77 0.81 0.87 0.81 0.85BN (TAN) 0.52 0.87 0.89 0.89 0.83 0.83 0.68 0.90 0.92 0.92 0.86 0.87BN (Tabu) 0.78 0.67 0.76 0.83 0.73 0.81 0.84 0.77 0.83 0.87 0.80 0.85BN (SA) 0.21 0.84 0.82 0.91 0.90 0.84 0.45 0.88 0.87 0.93 0.91 0.88BN(GA) 0.82 0.68 0.73 0.82 0.73 0.80 0.87 0.77 0.80 0.87 0.80 0.85LR 0.69 0.51 0.30 0.69 0.79 0.77 0.78 0.65 0.50 0.78 0.79 0.82NB 0.83 0.63 0.65 0.80 0.44 0.68 0.87 0.73 0.74 0.83 0.60 0.75RBF 0.73 0.67 0.74 0.79 0.68 0.74 0.81 0.77 0.82 0.84 0.75 0.78MPer 0.13 0.67 0.70 0.82 0.82 0.90 0.38 0.76 0.84 0.88 0.88 0.89

Note: In the No. of State row, 1: destroyed; 2: major; 3: moderate; 4: light; 5: slight; 6: none.

Table V. Comparison of the Models for Prediction of the 950 Years with the Practical Damage Survey as the Training Data in Precisionand ROC Area

Precision ROC AreaModelNo. ofState 1 2 3 4 5 6 1 2 3 4 5 6

BN (BMA) 0.62 0.75 0.78 0.76 0.88 0.90 1.00 0.98 0.98 0.98 0.96 0.94BN (K2) 0.31 0.66 0.63 0.61 0.81 0.85 0.99 0.96 0.97 0.97 0.91 0.91BN (HC) 0.26 0.66 0.63 0.60 0.83 0.86 0.99 0.96 0.97 0.97 0.91 0.90BN (TAN) 0.52 0.69 0.77 0.74 0.87 0.90 0.98 0.98 0.99 0.98 0.96 0.94BN (Tabu) 0.33 0.65 0.66 0.60 0.81 0.85 0.99 0.96 0.96 0.97 0.91 0.91BN (SA) 0.62 0.79 0.82 0.81 0.86 0.91 0.99 0.98 0.98 0.98 0.96 0.95BN (GA) 0.27 0.66 0.63 0.60 0.82 0.85 0.99 0.96 0.97 0.97 0.91 0.91LR 0.47 0.70 0.57 0.57 0.69 0.81 0.99 0.92 0.91 0.93 0.90 0.91NB 0.24 0.65 0.27 0.39 0.78 0.74 0.98 0.93 0.91 0.39 0.77 0.74RBF 0.54 0.70 0.68 0.61 0.72 0.75 0.99 0.95 0.94 0.92 0.90 0.90MPer 0.37 0.71 0.74 0.83 0.86 0.90 0.95 0.85 0.93 0.91 0.95 0.90

Note: In the No. of State row, 1: destroyed; 2: major; 3: moderate; 4: light; 5: slight; 6: none.

classification of these factors acquired by the dis-cretization algorithm in Section 3.3 and relevantmethods used to determine their prior probabilitydistributions. Our exploratory data analysis showedthat the linear correlations between these factorsranged from 0.0032 to 0.379. The slight linear cor-relations show that the multiple-collinearity prob-lem in the prediction models could be decreasedor avoided, even using all the factors as predictors.About the conditional probabilities or relevant pa-rameters of each predictor, given their prior prob-abilities known according to Table IV, they couldbe learned from the training data sets by the mod-els according to the principle of maximum likelihoodand minimum error. We developed these algorithmsbased on WEKA(52) and MatLab.(53)

In the simulation of the return period of950 years, we compared the result with the ROI’sdamage situation estimated by aerial photos andpractical surveys(1) of the Wenchuan earthquakeof May 12, 2008. It is found that our simulationhad a total prediction accuracy of 0.846, with theKappa statistic being 0.735—acceptable for practi-cal monitoring and forecast. Table IV lists pd andbalance for each damage levels; Table V presentsprecision and ROC areas for each damage level.As shown in Tables IV and V, the BMA pre-diction of BN achieved a nice pd with good bal-ance and a moderately good precision and ROCarea for each damage level, unlike other proba-bilistic models that achieved only a “good” perfor-mance for partial damage levels. Although the other

Page 18: A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

18 Li et al.

0 .0075.00375

km

±

(a) Prediction of the vulnerability for the scenario of the recurrence period of 475 years

0 .0075.00375

km

±

(b) Prediction of the vulnerability for the scenario of the recurrence period of 950 years

Fig. 8. Vulnerability index estimated by the Bayesian method: (a) 10% chance of exceedance of PGA probability within 50 years; (b) 10%within 100 years.

Page 19: A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

Mining Spatial Data Sets for Evaluating Vulnerability 19

Table VI. Shannon Mutual Information with VulnerabilityIndex, Vul, as the Target Variable, and Other Predictive Factors

Influencing Vul

Shannon MutualPredictive Factors Information

Close to faults (kdf) 0.189PGA risk (pga) 0.135Density of people (hpr) 0.131Accumulative road cost surface (cost) 0.115Slope (sl) 0.102Close to river? (kdr) 0.080Soil type (st) 0.021

models can have a slightly better value in someperformance measures for certain levels of damage(e.g., BN(SA)’s pd for level 4 in Table V), they per-form no better than our BMA prediction for otherlevels. In total, the BMA prediction has an accept-able range for pd (0.79–0.91), balance (0.85–0.88),precision (0.62–0.90), and ROC area (0.94–1.0) for ei-ther level of damage. Thus, our BMA prediction hasachieved a stable performance compared with othersingle probabilistic models.

Due to BMA’s robust prediction, we used it todo the vulnerability assessment. Fig. 8(a) presents thevulnerability index of human beings in ROI underthe scenario of a 10% chance of exceedance prob-ability of PGA within 50 years; Fig. 8(b) presentsthe vulnerability index under the scenario of a 10%chance of exceedance probability of PGA within100 years. The former corresponds to a recurrenceperiod of 475 years and the latter to a recurrenceperiod of 950 years, similar to the Wenchuan earth-quake on May 20, 2008. Comparing the vulnerabilityindices of the 475- and 950-year recurrence periods,we found that the vulnerability for the 950-year pe-riod is considerably larger than that for the 475-yearperiod, reflecting its higher vulnerability. The spatialdistribution of vulnerability for the 950-year periodis slightly different from that of the 475-year period(see Fig. 5a compared with Fig. 5b). This can be ex-plained by the difference in PGA: the 950-year PGAis stronger than that for 475 years.

4.6. Sensitivity Analysis

Table VI presents the results of the sensitivityanalysis: Shannon mutual information for eight vari-ables with vulnerability, with Vul as the target vari-able. From this table, we can see that kdf and pga havethe largest values, indicating the greatest influence onthe vulnerability index. Following these are hpr, cost,

sl, kdr, and st. As the epistemic uncertainty sourcesof the four factors (kdf, pga, hpr, and cost), this resultsuggests that decisionmakers should prioritize localdata-collection efforts on these factors rather than onother factors listed in the BN.

5. DISCUSSION

Probabilistic data mining models are a goodmeans for estimating vulnerability since their prob-ability outputs can be multiplied with the cost coeffi-cients of the damage states (Table III) to obtain a vul-nerability index. The performance of these modelsin predicting the probabilities of the damage statesis significant for estimation of the vulnerability in-dex. In our study case, we compared the probabilis-tic models, such as BNs, NB, LR, RBF network, andneural network. As shown in Tables V and VI, thesemodels were effective for prediction of certain lev-els of damage but performed poorly for other lev-els of damage. But by using the BMA to average thepredictions of BN, the uncertainty of the models wasdecreased, and the prediction performed stably andmoderately well. This validation illustrates that theaveraged prediction of BN is an acceptable methodfor modeling and estimating the vulnerability of hu-man beings to catastrophic risk.

As a means to probability inference, BN offersseveral specific advantages over other probabilisticmodels. BN supports a good platform for integrat-ing information sources from multidisciplinary spe-cialist fields and using them to make uncertaintyinference. We propose a generic Bayesian frame-work for modeling catastrophic risk (Fig. 1). As il-lustrated in our study of seismic risk, BN integratesthe output from the probability seismic hazard modeldesigned by seismic experts, assessment of the en-vironment, and system resistance by architects andengineers within a consistent modeling system. Also,BN can update the predictive values of risk given par-tial evidence, even with missing data. This offers de-cisionmakers good knowledge of the practical situa-tion or some potential scenarios in the future beforetaking action or making the emergency plans.

Another advantage of our method is to use spa-tial analysis and GIS to assess, visualize, and locatethe vulnerability. We used spatial analysis techniquesof KDA and ARCSM to quantify the influence ofgeofeatures such as faults, rivers, and roads. It isassumed that the closer the individuals exposed areto the active faults or rivers, the higher is the vulner-ability; the closer the individuals exposed are to the

Page 20: A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

20 Li et al.

rescue roads or/and shelter services, the lower is thevulnerability (because they have more opportunity toescape, thus decreasing vulnerability). This assump-tion is generally acceptable. With spatial-analysistechniques such as KDA and ARCSM, potential in-fluences from many risk-related, critical geofeaturesor indicators can be quantified and used in model-ing in combination with other quantitative variables.Further, use of GIS facilitates location of geograph-ical variations of the vulnerability and their uncer-tainties. Use of both spatial analysis and GIS in ourgrid-based approach has significant implication: theimportant risk-related factors are quantified for riskprediction, and the risk-prone sites at a fine scalecan be spatially located. This enables decisionmak-ers to plan more precisely for shelter services be-fore the disaster more effectively allocate resourcesduring the disaster, and make more informative riskmaps after the disaster for future precaution andpreparedness.

Further, our method can combine domainknowledge and learning from spatial data to im-prove the performance of risk and vulnerabilityassessment. Given enough representative trainingsamples, we can understand the interdependency re-lationships between variables in MP, E, and HC ofthe framework and use domain knowledge to adap-tively refine the relationships revealed.

6. CONCLUSION

Given the uncertain nature of natural disasterevents, we develop a generic Bayesian data miningapproach to vulnerability assessment of human be-ings to catastrophic risk. This method is applicablefor most natural disasters. In our method, BN pro-vides an integrative platform combining a variety ofinformation sources from different specialist fieldsand facilitates communications between different do-main specialists—for example, experts in natural dis-asters, architects, engineers, and economists. BN alsoserves a means of uncertainty analysis that imple-ments propagations of uncertainties from differentsources. Therefore, it is suitable for vulnerability as-sessment of natural disasters influenced by complexmultiple factors.

Use of spatial analysis and GIS makes possi-ble derivation of key risk-related quantitative predic-tive factors such as influences from active faults andrivers, and visualizes uncertainties in a spatial man-ner. Our study case of seismic risk illustrates thatdifferent ground motion, environmental conditions(e.g., slope and soil), and characteristics of human

beings’ behavior produce different vulnerability find-ings (Fig. 8). The estimated vulnerability is more ob-jective and spatially located.

Our Bayesian mining approach is generic(Section 3), and it can be applied to other types ofcatastrophic risk, such as flood, typhoon, mudslides,and avalanches, among other catastrophes, by incor-porating relevant domain knowledge and researchfindings for specification of BN, as illustrated in ourstudy case.

ACKNOWLEDGMENTS

This research is partially supported by grants41171344/D010703 from the Natural Science Foun-dation of China, grant 2011AA120305–1 from the Hi-tech Research and Development Program of China’sMinistry of Science and Technology (863), and grant2012CB955503 (Research of Identification of Suscep-tible Population and Risk Regionalization for Cli-mate Changes and Health) from the National Ba-sic Research Program of China’s Ministry of Scienceand Technology (973). We also thank the reviewersfor their constructive suggestions and the editors fortheir careful check and revisions, which further im-proved this article.

REFERENCES

1. Civil and Structural Groups of Tsinghua UniversityXaJiUaBJU. Analysis on seismic damage of buildings inthe Wenchuan earthquake. Journal of Building Structures (inChinese), 2008; 29(4):1–9.

2. Paterson E, Re D, Wang Z. The 2008 Wenchuan Earthquake:Risk Management Lessons and Implications. Beijing: RiskManagement Solutions, Inc., 2008.

3. Asian Development Bank. An Initial Assessment of the Im-pact of the Earthquake and Tsunami of December 26, 2004 onSouth and Southeast Asia. Philippines: Metro Manila, 2005.

4. Tamerius DJ, Wise KE, Uejio KC, McCoy LM, Comrie CA.Climate and human health: Synthesizing environmental com-plexity and uncertainty. Stochastic Environmental Researchand Risk Assessment, 2007; 21(5):601–613.

5. Alexander D. Natural Disasters. New York: Chapman &Hall, 1993.

6. Arnold M, Chen RS, Deichmann U et al. eds. Natural Disas-ters Hotspots: Case Studies. Disaster Risk Management Se-ries, Washington, DC: International Bank for Reconstructionand Development/World Bank, 2006.

7. Jiang H, Eastman JR. Application of fuzzy measures in multi-criteria evaluation in GIS. International Journal of Geograph-ical Information Science, 2000; 14(2):173–184.

8. Shi P. Theory on disaster science and disaster dynamics. Nat-ural Disasters (in Chinese), 2002; 11(3):1–9.

9. William JP, Arthur AA. Natural Hazard Risk Assessmentand Public Policy. New York: Springer-Verlag, 1982.

10. Kloog I, Haim A, Portnov AB. Using kernel density func-tions as an urban analysis tool: Investigating the associa-tion between nightlight exposure and the incidence of breastcancer in Haifa, Israel. Computers, Environment and UrbanSystems, 2009; 33:55–63.

Page 21: A Bayesian Method to Mine Spatial Data Sets to Evaluate ...

Mining Spatial Data Sets for Evaluating Vulnerability 21

11. Huang C. Risk Analysis of Natural Disasters. Beijing: BeijingNormal University Press, 2001.

12. Li L, Wang J, Wang C. Typhoon insurance pricing with spatialdecision support tools. International Journal of GeographicalInformation Science, 2005; 19(3):363–384.

13. Anselin L. Spatial data analysis with GIS: An introduction toapplication in the social sciences technical report 92–10. Sys-tems Research, 1992; 321(8):1605–1609.

14. Goodchild M, Haining R, Wise S. Integrating GIS and spatialdata analysis: Problems and possibilities. International Jour-nal of Geographical Information Systems, 1992; (6):407–423.

15. Li L, Wang J, Leung H. Using spatial analysis and Bayesiannetwork to model the vulnerability and make insurance pric-ing of catastrophic risk. International Journal of GeographicalInformation Science, 2010; 24(12):1759–1784.

16. Korb KB, Nicholson AE. Bayesian Artificial Intelligence.Boca Raton, FL: Chapman & Hall/CRC, 2004.

17. Amendola A, Ermoliev Y, Gitis VE, Koff G, Linnerooth-Bayger J. A systems approach to modeling catastrophic riskand insurability. Natural Hazards, 2000; 21:381–393.

18. Straub D. Natural hazards risk assessment using Bayesiannetworks. Pp. 2590–2516 in the 9th International Conferenceon Structural Safety and Reliability (ICOSSAR 05). SeriesNatural Hazards Risk Assessment Using Bayesian Networks.Rome, Italy, 2005.

19. Tobler WR. Cellular Geography, Philosophy in Geography.Dordrecht: Reidel, 1979.

20. Hastie T, Tibshirani R, Friedman J. The Elements of Statis-tical Learning: Data Mining, Inference and Prediction. NewYork: Springer-Verlag, 2001.

21. Miller HJ, Han J. Geographic Data Mining and KnowledgeDiscovery. London and New York: Taylor & Francis, 2001.

22. ESRI. Arcgis Spatial Analyst: Advanced GIS Spatial Analy-sis Using Raster and Vector Data. New York: EnvironmentalSystem Research Institute, 2001.

23. Torun A, Duzgun S. Using spatial data mining techniques toreveal vulnerability of people and places due to oil transporta-tion and accidents: A case study of Istanbul Strait. Pp. 43–48 in International Achieves of Photogrammetry, RemoteSensing, and Spatial Information Sciences (ISPRS), Techni-cal Commission II Symposium Series. Vienna, 2006.

24. Varnakovida P, Messina PJ. Hospital Site Selection Analysis.Michigan: Michigan State University, 2006.

25. Silverman BW. Density Estimation for Statistics and DataAnalysis. New York: Chapman and Hall, 1986.

26. McCoy J, Johnston K. Using Arcgis Spatial Analyst.Redlands: ESRI, 2001.

27. Fulton T, Kasif S, Salzberg S. Efficient algorithms for find-ing multi-way splits for decision trees. Pp. 244–251 in Proc.Twelfth International Conference on Machine Learning. SanFrancisco, CA: Kaufmann, 1995.

28. Fayyad U, Irani K. Multiple-interval discretization ofcontinuous-valued attributes for classification learning.Pp. 1022–1027 in Thirteenth International Joint Conferenceon Artificial Intelligence. San Mateo, CA: Kaufmann, 1993.

29. Bouckaert RR. Bayesian belief network: From constructionto inference, dissertation, Universiteit Utrecht, 1995.

30. Cooper FG, Herskovits E. A Bayesian method for the induc-tion of probabilistic networks from data. Machine Learning,1992; 9:309–347.

31. Buntine W. A guide to the literature on learning probabilisticnetworks from data. IEEE Transactions on Knowledge andData Engineering, 1996; 8:196–210.

32. Friedman N, GeiGer D. Bayesian network classifier. MachineLearning, 1997; 29:131–163.

33. Chow CK, Liu CN. Approximating discrete probability distri-butions with dependence trees. IEEE Transactions on Infor-mation Theory, 1968; 14(3):462–467.

34. Larranaga P, Poza M, Yurramendi Y, Murga HR, KuijpersC. Structure learning of Bayesian networks by genetic algo-rithms: A performance analysis of control parameters. IEEETransactions on Pattern Analysis and Machine Intelligence,1996; 18(9):912–926.

35. Hoeting AJ, Madigan D, Raftery EA, Volinsk TC. Bayesianmodel averaging: A tutorial. Statistical Science, 1999; 14(4):382–417.

36. Hosmer D, Lemeshow S. Applied Logistic Regression, 2nded. John Wiley and Sons, 2000.

37. Cox AL. Risk Analysis: Foundations, Models and Methods.Norwell MA: Springer, 2001.

38. Morales-Casique E, Neuman PS, Vesselinov VV. Maximumlikelihood Bayesian averaging of airflow models in unsat-urated fractured tuff using Occam and variance windows.Stochastic Environmental Research and Risk Assessment,2010; 24(6):843–880.

39. Neuman PS. Maximum likelihood Bayesian averaging ofuncertain model predictions. Stochastic Environmental Re-search and Risk Assessment, 2003; 17(5):291–305.

40. Bard P. Local effects of strong ground motion: Basic physicalphenomena and estimation methods for microzoning studies.Laboratoire Central de Ponts-et-Chausees and Observatoirede Grenoble.

41. Bayraktarli YY, Yazgan U, Dazio A, Faber HM. Capabilitiesof the Bayesian probabilistic networks approach for earth-quake risk management. Pp. 1–in First European Conferenceon Earthquake Engineering and Seismology Series. Geneva,Switzerland, 2006.

42. Cornell CA. Engineering seismic risk analysis. Bulletin of theSeismological Society of America, 1968; (58):1583–1606.

43. Day RW. Geotechnical Earthquake Engineering Handbook.New York: McGraw-Hill, 2001.

44. Chen X, Qi W, Ye H. Fuzzy comprehensive study on seismiclandslide hazard based on GIS. Acta Scientiarum NaturaliumUniversitatis Pekinensis, 2008; 44(3):434–438.

45. Jakob M. Morphometric and geotechnical controls of de-bris flow frequency and magnitude in southwestern BritishColumbia, Ph.D. thesis, University of British Columbia,Vancouver, 1996.

46. Algermissen ST, Perkins DM. A probabilistic estimate ofmaximum acceleration in rock in the contiguous UnitedStates, USGS, 1976.

47. McGuire RK. Frisk—A computer program for seismic riskanalysis. Department of Interior, Geological Survey, 1978.

48. Bayraktarli YY, Ulfkjaer J, Yazgan U, Faber HM. On the ap-plication of Bayesian probabilistic networks for earthquakerisk management. In the 9th International Conference onStructural Safety and Reliability (ICOSSAR 05). Serieson the Application of Bayesian Probabilistic Networks forEarthquake Risk Management, Rome, Italy, 2005.

49. Guo Z, Chen X. Strategies Against Earthquakes for Cities (inchinese). Beijing: Earthquake Press, 1992.

50. Kramer SL. Geotechnical Earthquake Engineering. NewJersey: Prentice Hall, 1996.

51. Yi Z. Forecast Methods of Seismic Disaster and Loss. Beijing:Geology Publisher, 1995.

52. Witten IH, Frank E. Data Mining: Practical Machine Learn-ing Tools and Techniques, 2nd ed. San Francisco: MorganKaufmann, 2005.

53. MathWorks MUMM. Matlab 2007 User Manual. Math-Works, 2007.

54. Pearl J. Probabilistic Reasoning in Intelligent Systems:Networks of Plausible Inference. San Francisco: MorganKaufmann, 1988.

55. Gutenberg B, Richter C-F. Frequency of earthquake in cali-fornia. Bulletin of the Seismological Society of America, 1944;(34):185–188.