University of Notre Dame, Notre Dame, IN 46556 · Di erential privacy (DP) provides a mathematical...
Transcript of University of Notre Dame, Notre Dame, IN 46556 · Di erential privacy (DP) provides a mathematical...
Construction of Microdata from a Set of Differentially PrivateLow-dimensional Contingency Tables through Solving Linear Equations
with Tikhonov Regularization
Evercita C. Eugenio and Fang Liu∗
Department of Applied and Computational Mathematics and Statistics
University of Notre Dame, Notre Dame, IN 46556
August 6, 2019
Abstract
When individual-level data are shared for research and public use, they are oftenperturbed to provide some level of privacy protection. A simple way to perturb a high-dimensional data set where individual-level data can be easily generated with goodutility is to sanitize the full contingency table or full-dimensional histogram. How-ever, it can be costly from the data storage and memory perspective to work withfull tables. In addition, most of the observed signals in the high-order interactionsamong all attributes are likely just sample randomness rather than being of statisti-cal significance and rarely of interest to practitioners. We introduce a new algorithm,CIPHER, which can reproduce individual-level data from a set of meaningful differen-tially private low-dimensional contingency (LDC) tables constructed from the originalhigh-dimensional data, through solving a set of linear equations with the Tikhonov reg-ularization. CIPHER is conceptually simple and requires no more than decomposingjoint probabilities via basic probability rules to construct the equation set and subse-quently solving linear equations. Compared to full table sanitization, the set of LDCtables that CIPHER works with has drastically lower requirements on data storageand memory. We run experiments to compare CIPHER with the full table sanitizationand the multiplicative weighting exponential mechanism (MWEM) which can also beused to generate individual-level synthetic data given a set of LDC tables.The resultsdemonstrate that CIPHER outperforms MWEM in preserving original information atthe same privacy budget and converges to the full-table sanitization in utility as thesample data size or the privacy budget increases.
Keywords: differentially private data synthesis (DIPS), multiplicative weighting, sign andstatistical significance (SSS), contingency tables, data storage and memory, Laplace mecha-nism
1The research is funded by the National Science Foundation grants #1546373 and #1717417.
arX
iv:1
812.
0567
1v2
[cs
.LG
] 5
Aug
201
9
1 Introduction
1.1 Background and Motivation
When releasing data sets for research and public use, protection of individual private infor-mation while still maintaining good utility of the data is of extreme importance. Even withdata anonymization, it is still possible for a data intruder to identify a subject in a releaseddata set. For example, the Netflix Prize data set that contained anonymous movie ratingsof 500,000 Netflix subscribers was used in conjunction with the public IMDB database tosuccessfully identify individual Netflix users (Narayanan and Shmatikov, 2006, 2008), un-covering political preferences and other sensitive information of the movie rates in Netflix.Other recent re-identification cases include the Washington state data for health records(Sweeney, 2013), the New York City Taxi and Limousine Commission data (Tockar, 2014),and the Australian de-identified open health dataset (Culnane et al., 2017). These exam-ples, together with other disclosure cases, have intensified the concerns on individual privacyand call for more rigorous and mathematically sound concepts and frameworks to protectindividual privacy when releasing data.
Differential privacy (DP) provides a conceptual framework to bring rigorous mathematicalguarantee for privacy protection without making strong or ad-hoc assumptions about theintruder’s background knowledge (Dwork et al., 2006; Dwork, 2008). There exist DP mech-anisms for general query release such as the Laplace mechanism (Dwork et al., 2006), theexponential mechanism (McSherry and Talwar, 2007; McSherry, 2009), the median mech-anism (Roth and Roughgarden, 2010), the Gaussian mechanism (Dwork et al., 2014; Liu,2019), and the generalized Gaussian mechanism (Liu, 2019). There are also DP mechanismsfor releasing specific statistical analyses, such as contingency tables (Barak et al., 2007),data cubes (Ding et al., 2011), empirical risk optimization (Chaudhuri et al., 2011), princi-pal component analysis (Chaudhuri et al., 2012), high-dimensional regression (Kifer et al.,2012), graphs and social networks (Kasiviswanathan et al., 2013; Yan et al., 2016; Li et al.,2017), and deep learning (Shokri and Shmatikov, 2015; Abadi et al., 2016), among others.
One of the applications of DP is to generate differentially private individual-level syntheticdata for release. Compared to releasing differently private queries upon request, which is bothburdensome for data curators and practically unsatisfactory for data users as the privacybudget can be quickly consumed with a limited number of queries, releasing differentiallyprivate individual-level data is more convenient for data curators and flexible for data users.On the other hand, differentially private data synthesis is not without limitation. First,some assumptions, whether data-dependent or data-independent, whether weak or strong,are often needed to generate synthetic data. Second, when synthetic data are large in size,it can be computationally costly to store them, especially when multiple sets are released asa way to account for the uncertainty introduced through the synthesis process.
In this paper, we propose a new data synthesis approach for multi-dimensional categoricaldata that does not have to rely on strong data-specific assumptions nor does it have a highdemand on data storage.
2
1.2 Related Work
A simple way that imposes minimal assumptions on the local data and is still able to perturbmulti-dimensional categorical data while maintaining good utility is the sanitization of thefull cross-tabulation, from which individual-level synthetic data can be easily generated. Theapproach is often used as a baseline to benchmark other differentially private methods foranswering queries or generating synthetic data in terms of utility. Despite its simplicityand offering good utility, the full table sanitization does have some drawbacks. First, thefull cross-tabulation among all attributes is likely to generate a lot of empty cells, and thehighest-order interactions among the attributes and the observed signals in the full tableare most likely just white noises and do not represent meaningful population-level signals ofstatistical significance. Second, it can be costly to store or release the full table when theoriginal data set is of moderate to high dimension. For example, the full table among p = 10attributes with 5 levels per attribute has 9, 765, 625 cells.
If the size of the set of the cell frequencies, from which synthetic data are generated, canbe reduced without affecting the population-level signals contained in the original data toa meaningful degree, it would be welcomed from a data storage perspective. There existssome work along this line. Barak et al. (2007) use Fourier transforms and linear program-ming to generate differentially private individual-level synthetic data from the low-ordercontingency tables. Though consistency, non-negativity, and privacy are ensured, solvingthe linear programming could be a bottleneck for this algorithm especially when p is large.Hay et al. (2010) introduce the universal histogram approach that benefits the utility oflow-order histograms, but at the expense of precision of the higher-order histogram. Chenet al. (2015) use a sampling-based framework to build attribute clusters and the syntheticdata are generated from the differentially private histograms formed by the attribute clus-ters. The formation of optimal attribute clusters is an NP-hard problem and the authorsintroduce an approximation algorithm that does not guarantee optimality particularly ifthere are any non-convexity issues. Zhang et al. (2014) introduce PrivBayes to differentiallyprivately construct a Bayesian networks, from which samples are taken to release. When por the degree of the network is large, the construction of the differentially private Bayesiannetwork can be time consuming. Liu (2016a) proposes the model-based approach (modips)to generate differentially private synthetic data in the Bayesian framework. The modips canbe computationally intensive in the large p setting. In addition, both the PrivBayes and themodips are subject to mis-specification of the synthesis models, which would lead to a biasedsynthetic sample. Abowd and Vilhuber (2008) propose to generate differentially private cat-egorical data from the Multinomial/Dirichlet model in the Bayesian framework. McClureand Reiter (2012) propose a slightly different approach to synthesize one-dimensional binarydata. Machanavajjhala et al. (2008) demonstrate that the Multinomial-Dirichlet synthesizerleads to poor inferences due to data sparsity when it is applied to release the commutingpatterns of the US population data Bowen and Liu (2016) also show that both approacheshave worse performance than the full table sanitization via the Laplace mechanism and themodips approach at the same privacy budget. Hardt et al. (2012) propose the iterativeMultiplicative Weights via Exponential Mechanism (MWEM) approach to generate a differ-entially private empirical distributions given a set of linear queries. Though not originally
3
proposed d for obtaining differentially private queries, synthetic data can be easily sampledfrom the differentially private empirical distributions. The MWEM algorithm achieves thenear optimal bound on the l∞ error for the queries ∈ Q for an optimal number of iterationsT . The downside of MWEM is that it is very sensitive to the choice of T and choosing theoptimal T can be challenging.
1.3 Our contributions
We propose a novel procedure, namely, Construction of Individual-level data from a set ofdifferentially Private low-dimensional contingency tables tHrough solving linear Equationswith Tikhonov Regularization (CIPHER), to generate differentially private empirical distri-butions which can be easily converted to the individual-level data or microdata.
Oftentimes the population-level signals in real-life data with categorical attributes are con-tained in a set of low-dimensional contingency tables. For example, suppose there are p = 6attributes in the original data. Seldom is the 6-way interaction among all 6 attributes mean-ingful or of interest. Meaningful signals in the data might well be summarized in a setof low-dimensional contingency tables, for example, (X1, X2) ⊥⊥ (X1, X3) ⊥⊥ (X2, X3) ⊥⊥(X3, X4, X5) ⊥⊥ X6. In addition, it is often the case that an attribute occurs in morethan one of the low-dimensional tables, such as X1, X2 and X3 in this example. If nosanitization is involved, then fitting a log-linear model with these interactions terms (e.g.,X1X2 +X1X+X2X3 +X3X4X5 +X6) to the data would lead to consistent estimates for thefull-table cell probabilities and frequencies. However, due to the injection of the differentialprivate noise, the marginal counts, say those of X3, would become inconsistent across thethree tables that involve X3. One would need a method can automatically correct for theinconsistency in the marginals, a goal that CIPHER can achieve without having to explicitlyincorporating the constraints by solving a set of equations.
CIPHER is conceptually simple and requires nothing than decomposing joint probabilitiesvia basic probability rules to construct a linear equation set Ax = b and subsequently solvingthe linear equations. The computational cost for solving the equation set is expected to below once the equation sets are constructed. Since A is block-diagonal, taking the inverse ofATA + λI is relatively cheap even if the linear equation set is large. Compared to the fulltable sanitization to re-generate individual-level data with privacy, the set of LDC tables thatCIPHER works with has drastically lower requirements for computer storage and memory.For example, compared to 9,765,625 cells resultant from the full table among 10 attributeswith 5 levels each, there is a 95.4% and 99.99% reduction in the number of cells – down to62,200 and 8,440, respectively – if the set of 210 four-way contingency tables or the set 45of two-way contingency tables are used instead.
If the LDC tables are already given and differentially privately sanitized, then data userscan apply CIPHER themselves to generate microdata for their analyses. During the wholeCIPHER procedure, there is no probing or going back to the original data, thus DP ispreserved. For data curators whose goal is to release microdata and don’t have the setof LDC tables yet, there are several options. First, to choose a set via a model selectionprocedure – which costs privacy budget per se; second, to leverage the domain knowledge
4
to come up with a set without relying on the specific values of the data at hand; third,to be conservative and use high-order contingency tables that but still lower than the fulldimension. The latter two approaches do not cost privacy budget, all of which can be directedtoward sanitizing the LDC table set.
The remaining of the paper is organized as follows. Section 2 reviews the basic concepts inDP and some differentially private mechanisms related to this work. Section 3 introducesthe CIPHER procedure and proposes the SSS (Sign and Statistical Significance) assessmentto evaluate the inferences based on differentially private synthetic data against the originalinferences. Section 4 compares the CIPHER with several other sanitization methods on thestatistical utility of the synthetic data in simulated and real-life data. Section 5 providessome concluding remarks and discusses future research directions.
2 Preliminaries
Consider a data set D. A query/statistic or a set of queries/statistics f asks specific questionsabout D. DP provides a rigorous and robust mathematical conceptual framework to protectindividual privacy information when releasing the query results f .
Definition 1 (ε-differential privacy (Dwork et al., 2006)). A randomized mechanism Rsatisfies ε-differential privacy if for all data sets D1 and D2 differing on one element and allresult subsets S to query f , e−ε ≤ Pr[R(f(D1))∈S]
Pr[R(f(D2))∈S] ≤ eε.
ε is often referred as the privacy budget and is pre-specified. The smaller ε is, the more privacyprotection is imposed on the individuals in the data, in the sense that the probabilitiesof getting the same sanitized query results via R for D1 and D2 gets more similar. Theformulation of privacy via the DP is robust and guards against the worst-case scenario as itdoes not impose any assumptions about the behavior or the background knowledge of dataintruders.
Definition 2 (sequential composition and parallel composition (McSherry, 2009)).Let q = 1, ..., K represent a set of queries on data D and ε be the total privacy budget.Denote by Mq a randomization mechanism of εq-DP. The Sequential Composition states
that the sequence of Mq(D) provides(∑
q εq
)-DP. The Parallel Composition states that the
sequence of of Mq (X ∩Dq) provides ε-DP if {Dq} are arbitrary disjoint subjects of D.
The sequential composition and parallel composition principles are very useful to track andcount privacy budget, and when designing differentially private mechanisms.
There are a variety of mechanisms to provide differentially private results, as alluded to inSection 1. Here we mention two of them – the Laplace mechanism and the Exponentialmechanism, which will be in the experiments in Section 4.
Definition 3 (Laplace mechanism (Dwork et al., 2006)). The ε-differentially privateLaplace mechanism generates the sanitized query result as in f∗(D) = f(D)+Lap(∆f/ε),where ∆f = max
D1,D2
‖f(D1) − f(D2)‖1 is the l1 global sensitivity of query f , for all D1, D2
5
differing in one element.
The larger ∆f is, the more noise would be injected to f(D) to satisfy ε-DP. Generaliza-tion of the Laplace mechanism include the Gaussian mechanism and Generalized Gaussianmechanism that is built upon the lp norm (p ≥ 1) (Dwork et al., 2014; Liu, 2019), amongothers.
Definition 4 (exponential mechanism (McSherry and Talwar, 2007)). Let u be a utilityfunction that assigns a score to each possible output of a query to data D. The Exponentialmechanism that satisfies ε-DP releases query result f ∗(D) with probability
exp(u(f ∗(D);D) ε2δu
)/∫u(f ∗(D);D) ε
2δud(f ∗(D)),
where δu is the maximum change in score u with one element change in data D.
3 CIPHER
We propose the CIPHER method to generate differentially-private full tables and individual-level synthetic data from a set of LDC tables. As mentioned in Section 1, the main motivationfor the development of CIPHER is the reduction of the query size to save on data storage,leveraging the common knowledge that high-order interactions among the the full cross-tabulation are often meaningless and not worth preserving. Figure 1 shows the drasticreduction in the number of cells that need to be stored if the sets of 1-way, 2-way, 3-way, and4-way LDC tables are used in place of the full table for varying p (the number of attributesin the original data). The order of the LDC tables used for getting the full table is allowedto grow with p, but again interactions of very high order are rarely of interest in real-lifedata and are also hard to explain and analytically and computationally challenging.
5 10 15 20
05
1015
20
dimension p
log2
(num
ber
of c
ells
)
2p
full table4−way3−way2−way1−way
5 10 15 20
05
1015
2025
30
dimension p
log2
(num
ber
of c
ells
)
3p
full table4−way3−way2−way1−way
5 10 15 20
010
2030
dimension p
log2
(num
ber
of c
ells
)
{2,3,4,5}p
full table4−way3−way2−way1−way
Figure 1: log(Number of stored cell) for FHD and sets of LDC tables of various dimensionvs p
3.1 Method and Algorithm
The CIPHER algorithm is presented in Algorithm 1, followed by some remarks about thealgorithm. In brief, the CIPHER procedure starts from the lowest-order contingency table(s)
6
in a given set of LDC tables Q and arrives at a solution of the differentially private full tableusing a stepwise approach, without a need for complex sampling algorithms. The LDCtables in Q, which do not have to be of the same dimension, are expected to capture theimportant signals and relationships among the attributes in the original data. Two specialcases of Q are the single p-way full table and the set of p one-way contingency tables,respectively Forming Q can be guided by the domain knowledge without having to consumethe information and thus privacy of the current data. If the domain knowledge is not availableor the data curator prefers to choose a set using the information of the current data, thenthe total privacy budget will need to be divided between the selection of Q and the CIPHERalgorithm itself. In the rest of the discussion, we assume Q is preset before the applicationof the CIPHER algorithm.
Algorithm 1 CIPHER
1: INPUT: original data D (n × p); query set Q; privacy budget ε; number of syntheticdata sets m (Remark 1); Tikhonov regularization constant λ (Remark 2).
2: Denote the lowest dimension of the LDC tables ∈ Q by p0.3: FOR l = 1, . . . ,m4: Sanitize all queries ∈ Q via a mechanism of ε-DP (e.g., q̃
(l)k = qk+Lap(0, ε/(m|Q|)) for
k = 1, . . . , |Q| if the Laplace mechanism is used).5: FOR j = p0 + 1, . . . , p6: List all j-way contingency tables Tj.7: FOR each query qi 6∈ (Tj+1 ∩Q), run the 5 steps below.8: 1) Denote the set of variables that form query qi by Xi and pi = |Xi|.9: 2) Randomly pick a variable out of Xi. WLOG, denote that variable by Xi1, and
the rest of the variables by Xi2, . . . , Xi,pi . Denote the number of cells in Xik byKik for i = 1, . . . , pi.
10: 3) For k = 2, . . . , (pi−1), define bk=Pr(Xi1 6= Ki1|Xi\(Xi1, Xik)=∑
XikPr(Xi1 6=
Ki1, Xik|Xi\(Xi1, Xik)) = Akzk =∑
XikPr(Xik|Xi\(Xi1, Xik)) Pr(Xi0 6= Ki1|Xi\
(Xi1, Xik), Xik), where zk is the conditional probability of (Xi1 6= Ki1) given therest of variables in Xi, Ak is either observed or calculated from step j − 1, and(Xi1 6= Ki1) represents the vector (Xi1 = 1, . . . , Xi1 =Ki1 − 1).
11: 4) Let b = (b1, . . . ,bpi−1)T , z = (z1, . . . ,bpi−1)
T , and A = Diag{A1, . . . ,Api−1};solve for z from Az = b with the Tikhonov regularization; that is, z = (ATA +λI)−1ATb, where I is the identity matrix.
12: 5) Calculate the empirical probability for qi: Pr(Xi) = z · Pr(Xi \Xi1).13: END FOR14: END FOR15: Correct negativity and normalize the empirical joint probability (Pr(X))(l) =
(Pr(X1, ..., Xp))(l) (Remark 3).
16: Generate differentially private data D̃(l) of size n from (Pr(X))(l).17: END FOR18: OUTPUT: m sets of differentially private data D̃(1), . . . , D̃(m).
Remark 1 (number of synthetic data sets m). We recommend setting m at a smallnumber > 1 if the released data will be used for statistical inferences. Releasing multiple sets
7
offers a convenient way to account for the uncertainty and randomness introduced by thesanitization and synthesis procedures, coupled with proper inferential combination rules (Liu,2016a). It is easy to implement in practice and can be viewed as a Monte Carlo approach toaccount for the sanitization and synthesis uncertainty. Though releasing a single set coupledwith explicitly modeling the sanitization mechanism and the synthesis model can also helpto accommodate the uncertainty, the modeling can be much more challenging analyticallyand computationally compared to releasing multiple sets. In addition, as long as m is nottoo large in that the total privacy budget is not spread too thin over the multiple sets(each synthetic set receives 1/m of the total privacy budget per the sequential compositiontheorem), the precision gained by averaging over m sets of synthetic data could outweighthe additional noises introduced from releasing multiple sets than a single set.
Remark 2 (Tikhonov regularization). The reason for using the Tikhonov regularization(aka the l2 regularization) to solve for z from Az = b is that the columns of A are linearlydependent and ATA is not full rank. The Tikhonov regularization is known for solving ill-posed problems like Az = b when the solution z is not unique due to the singularity of A.(Tikhonov, 1963; Tikhonov et al., 2013). It works by adding a small positive constant λ tothe diagonal elements of ATA, and calculating z = (ATA + λI)−1ATb. The constant λ isa tuning parameter. We found from the empirical studies that the solutions from CIPHERare relatively robust to the choice of λ and lead to similar joint distribution except for somenegligible numerical errors as long as λ is relatively small (on the order of o(1)). Since A isblock-diagonal, taking the inverse of ATA + λI is relatively cheap computationally even ifthe linear equation set is large.
Remark 3 (correction of non-negativity and normalization). The cell probabilitiesin the differentially private LDC tables in Q can be < 0 or ≥ 1. In addition, the solutionsfor the conditional probabilities from the linear questions in CIPHER can also be < 0 or≥ 1. We could correct for the non-negativity by the truncation or the boundary inflationtruncation procedures (Liu, 2016b) and normalize the probabilities in every time the sanitizedor solved probabilities are outside [0, 1), or we could wait until the last step of generatingthe full table to make one overall correction. We compared both approaches and found thatoftentimes the two led to similar results and the final overall correction in some cases ledto better results. Given this and the fact that one correction is easier operationally thantaking multiple corrections during the CIPHER algorithm, we recommend users take onefinal correction when obtaining the joint distribution from the full table.
If two or more LDC tables in Q share the same variable(s), then after the sanitization,the frequencies in the LDC tables formed by the shared variables would be inconsistent.For example, suppose table T1 in set Q is a 3-way table (V1, V2, V3) and table T2 is 3-way (V1, V2, V4). The cell frequencies in 2-way table (V1, V2) calculated from the two 3-wayContingency tables would be the same and so would be the cell frequencies in all the 1-way contingency tables in the original data. However, after noises being injected in thedifferentially private sanitization of T1 and T2, the bin counts in the table (V1, V2) calculatedfrom T1 and T2 are not the same. Barak et al. (2007) transform the data into the Fourierdomain, where adding noise will not violate consistency. However, this approach has abottleneck in the linear programming when p is large. The CIPHER procedure does not
8
have this issue with the way it solves for the empirical distributions. The inconsistencyamong the LHDs in Q if they have some shared variables is automatically averaged outwhen solving for the non-full rank linear equation set with the Tikhonov regularization.
Claim 1. The CIPHER algorithm satisfies the ε-DP.
The satisfaction of the DP in CIPHER is straightforward to establish. The only time at whichthe original data are probed during the application of CIPHER is when the queries in Q aresanitized, and the data are accessed mK times with a privacy budget of ε/(mK) per access.Per the sequential composition, the total privacy budget is maintained at (mK)ε/(mK) = ε.
3.2 Example: Illustration of CIPHER in the 3-variable Case
We illustrate the CIPHER procedure with a simple example. Say the original data contain 3variables (p = 3). Denote the 3 variables by V1, V2, V3 with K1, K2 and K3 levels, respectively.Let Q = {T (V1, V2), T (V2, V3), T (V1, V3)} that contains all the 2-way contingency tables.Therefore, p0 = 2 in Algorithm 1. WLOG, suppose V3 is X0 in Algorithm 1. We first findthe relationships among the probabilities, which are{
Pr(V3|V1) =∑
V2Pr(V3, V2|V1) =
∑V2
Pr(V3|V1, V2) Pr(V2|V1)Pr(V3|V2) =
∑V1
Pr(V3, V1|V2) =∑
V1Pr(V3|V1, V2) Pr(V1|V2)
,
We now convert the above relationships into the equation set b = Az. Specifically, b =(Pr(V3|V1)\Pr(V3 = K3|V1), and Pr(V3|V2)\Pr(V3 = K3|V1))T is a known vector of dimension(K1+K2)(K3−1), z = Pr(V3|V1, V2)\Pr(V3 = K3|V1, V2) is of dimension K1K2(K3−1), A is aknown diagonal matrix with K3−1 identical blocks, where each block is a (K1+K2)×(K1K2)matrix comprising the coefficients (i.e., Pr(V1|V2),Pr(V2|V1) or 0) associated with z. After zis solved from b = Az, the joint distribution of Pr(V1, V2, V3) is calculated by z · Pr(V1, V2).The experiments in Section 4 contain more complicated applications of CIPHER.
3.3 Differences between CIPHER and MWEM
Both CIPHER and MWEM can work with a pre-specified set of linear queries to generatean empirical distribution, but they are methodologically and algorithmically different. First,MWEM relies on an iterative multiplicative weighting procedure whereas CIPHER is not aniterative procedure but solves one or more sets of linear equations analytically to reach thedifferentially private empirical joint distribution among the p variables. Second, the queriesin CIPHER are sanitized through a DP mechanism (say the Laplace sanitizer) before beingfed into the algorithm and they only need to be sanitized once. By contrast, each iterationin the MWEM algorithm incurs privacy cost due to it accessing the original data to fetchthe query selected by the Exponential mechanism, which is subsequently sanitized by theLaplace mechanism. As a result, the two algorithms spend different privacy on a query fora given total privacy budget. Suppose the total budget is fixed at ε for the CIPHER andMWEM algorithms. The number of queries in Q is |Q|. If we use equal allocation of the
9
privacy budget, then each query in Q gets a budget of ε/|Q| in the CIPHER algorithm.The sanitization of each query selected by the Exponential mechanism costs ε/(2T ) in theMWEM algorithm. On the other hand, a query can be selected multiple times throughoutthe T iterations. Let ck denote that times that how many times qk ∈ Q is selected amongthe T iterations. Note
∑|Q|k=1 ck = T . Unless ck/(2T ) > |Q|−1 or ck/
∑|Q|k=1 ck > 2|Q|−1,
then the budget allocated to qk in the MWEM algorithm would always be smaller thanthat in CIPHER. In other words, the selection probability for a query needs to at leastdoubles the average selection probability (1/|Q|) to be receive more privacy budget in theMWEW algorithm than in the CIPHER algorithms. Our own experiences from runningthe MWEM algorithm suggest that choosing the “right”number of iterations T for MWEMcan be challenging. T too small is not sufficient to allow the empirical distribution to fullycapture the signals summarized in the queries; and T too large would lead to a large amountof noises being injected as the privacy budget has to be distributed across the T iterations,eventually leading to a useless synthetic data set as each iteration costs privacy.
4 Experiments
We run experiments with simulated and real-life data to evaluate CIPHER, and benchmarkits performance against MWEM and the full table sanitization. We provide below thejustification on the choice of these two methods to compare to CIPHER.
4.1 Methods for Comparison
The full table sanitization can be achieved through injecting independent Laplace noisesdrawn from Lap(0, ε−1) to the cell frequencies in the full table across all the attributes in adata set. Though technically there is only one query (a single histogram), the number of cellsgrows quickly with p (Figure 1), not to mention that a lot of cells in the full table are likelyto be empty. From a statistical perspective, constructing the full table is equivalent to fittinga log-linear model with all possible interactions among all p attributes. Hay et al. (2016) (inanswering 1D or 2D range queries) and Bowen and Liu (2016) show that the the full tablesanitization is likely to outperform and or be similar to the more complex algorithms (e.g.,modips, the Multinomial-Dirichlet synthesizer, DPcube, Privelet) in utility when the size ofthe query set is large or when n or the privacy budget is high. The flat Laplace sanitizeris therefore a useful baseline to benchmark against for other differentially private methodsfor generating queries or synthetic data, especially considering its simplicity for practicalimplementation.
The MWEM algorithm achieves the near optimal bound on the l∞ error between the originaland sanitized linear queries in Q. Though originally proposed for obtaining differentially pri-vate linear queries, the MWEM algorithm is ready for generating synthetic data, assumingthe queries are representative of the population-level signals in the data, given that it out-puts a differentially private empirical distribution. Given that both CIPHER and MWEMalgorithms work with a pre-specified linear query set and since the MWEM achieves theoptimal l∞ error on the query set, it thus makes sense to compare CIPHER to MWEM to
10
see if it can beat MWEM procedure in the l∞ error as well as per other utility metrics.
Though there exist other methods to generate synthetic data from a set of low dimensionalqueries in categorical data, the queries are often model-based (e.g., PrivBayes and MODIPS).Selection of these queries can be computationally costly especially when the dimension ofthe data is high; and some of the queries used in these procedures are not linear or notstraightforward to sanitize (e.g., regression coefficient from logistic regression).
All taken together, in the experiments below, we focus on the comparison between CIPHERand MWEM, using the full table sanitization as the baseline. We aim to show CIPHERdelivers better utility that MWEM with much lower requirement on data storage the fulltable sanitization.
4.2 The SSS assessment
When comparing the utility of synthetic data generated by CIPHER, MWEW, and thefull table sanitization, we not only examine the descriptive statistics such as mean and lp(p > 0) distance between the synthetic and the original data, we also examine the informationpreservation in statistical inferences on population parameters when hypothesis testing isinvolved. Toward that end, we propose the SSS assessment. The first S refers to the the Signof the estimated parameter, and the second and third S’ refer to the Statistical Significanceof the estimated parameter. The consistency in the sign and statistical significance forthe parameter estimates based on the original and synthetic data leading to seven possiblescenarios as listed in Table 1. The best scenario is when both the sign and the statistical
Table 1: Preservation of Signs and Statistical Significance on the estimated parameters (theSSS assessment)
parameter estimates Best II+ I+ Neutral II- I- Worstmatching Signs between original and synthetic? Y Y Y Y N N N NStatistical Significance in original data Y N Y N N Y N YStatistical Significance in synthetic data Y N N Y N N Y Y
significance of the parameter estimates from the original and synthetic data match; and theworst case scenario is that both estimates are statistically significant but with opposite signs,which entails detrimental consequences in practice. Between the two extremes, there are fiveother possibilities.
• II+ and I+ indicate an increase in Type II and Type I error rates, respectively. Inboth cases, the signs match, but the statistical significance goes from significance to non-significance in the synthetic data for II+, resulting in an inflated Type II error rate; and goesfrom non-significance in the original to significance in the synthetic data for I+, resultingin an inflated Type I error rate.
• Neutral indicates that the signs change between the original data and the synthetic data,but are not significant in both cases.
11
• II- indicates a sign change, and the statistical significance changes from being significantoriginally to non-significance in the synthetic data; and I- indicates a sign change and thestatistical significance changes from being non-significant in the original to significant inthe synthetic data.
For the synthetic data, we would want the probability of the best scenario to be high, followedby Neural, II+, II-, I+, I-; and hope the worst case scenario has a close-to-0 probability tooccur. We apply the SSS assessment to the data in the experiment to compare the inferencesbetween the original data and the differentially private synthetic data.
4.3 Experiment 1: Simulated Data
In this experiment, we use simulated data to investigate the inferential properties and theutility of the sanitized data sets generated via CIPHER and compare to the MWEM algo-rithm and the full table sanitization.
The simulation study examines a data scenario with 4 categorical variables, where V1 andV2 have 2 categories each and V3 and V4 have 3 categories each. The data was simulated viaa sequence of multinomial logistic regression models. Specifically,
V1 ∼ Bernoulli(0.5);
V2|V1 was simulated from a logistic model
logit(Pr(V2 = 1|V1)) = β0 + β1V1 with β0 = 0.5 and β1 = 1;
V3|V1, V2 was simulated from multinomial logistic modelln(
Pr(V3=2|V1,V2)Pr(V3=1|V1,V2)
)= β01 + β11V1 + β21V2
ln(
Pr(V3=3|V1,V2)Pr(V3=1|V1,V2)
)= β02 + β12V1 + β22V2
with β01 = −1, β11 = 2, β21 = 1, β02 = 0.5, β12 = 1, β22 = −1;
V4|V1, V2, V3 was simulated from multinomial logistic modelln(
Pr(V4=2|V1,V2,V3)Pr(V4=1|V1,V2,V3)
)=β01+β11V1+β21V2+β311(V3 =1)+β411(V3 =2)
ln(
Pr(V4=3|V1,V2,V3)Pr(V4=1|V1,V2,V3)
)=β02+β12V1+β22V2+β321(V3 =1)+β421(V3 =2)
with β01 = 1.5, β11 = −1, β21 = 0.5, β31 = 1, β41 = −2, and
β02 = 1, β12 = −1.5, β22 = −0.5, β32 = 0.75, and β42 = −1.
We examine two samples size scenarios at n = 200 and n = 500, respectively, each underfive privacy budget scenarios ε = (e−2, e−1, 1, e, e2). We run 1,000 repetitions for each n andε scenario so to investigate the stability of each method. m = 5 synthetic data sets weregenerated by CIPHER, MWEM, and the full table sanitization, respectively, so that theuncertainty of the synthesis model and the randomness brought by the differential privatemechanisms can be properly accounted for. Each synthetic data set has the same samplesize as the original data set.
For the the Laplace sanitizer, the full table across the 4 variables contains 36 cells. Laplace
12
noises were drawn from Lap(0, (mε)−1) and added to each of the 36 cell counts in the fulltable. For the CIPHER and MWEM algorithms, we consider two different query sets Q: (1)Q3 contains all 4 three-way contingency tables among the four variables, which leads to 32cells (88.9% of the full table); (2) Q2 contains all 6 two-way contingency tables among thefour variables, which leads to 20 cells (55.6% of the full table).
For the CIPHER algorithm, we sanitized all the contingency tables in Q2 or Q3 and followedsteps in Algorithm 1 to synthesize the individual-level data. We use CIPHER 3-way andCIPHER 2-way to denote the two cases, according to whether Q3 or Q2 is used. Thelinear equation sets in both cases are presented in the supplementary materials. For theMWEM algorithm, the starting distribution was set as the mutually independent categoricaldistribution with equal probability across all categories for each of the four variables. We runboth MWEM 3-way (if Q3 is used as the query set) and MWEM 2-way (if Q2 is used as thequery set). The number of iterations T can affect the quality of the synthetic data greatly.Since this is a simulation study, we were able to use independent simulated data from thesame model to roughly optimize T for different ε and n; specifically, T = {5, 15, 25, 60, 120}at n = 200 and T = {10, 25, 50, 100, 200} at n = 500 for ε = {e−2, e−1, 1, e1, e2}, respectively.
n = 200
●
●
●
●●
privacy budget
Tota
l Var
iatio
n D
ista
nce
(TV
D)
TVD for One−Way Tables
e−2 e−1 e0 e1 e2
0.0
0.1
0.2
0.3
0.4
●
●
●
●
●
● CIPHER 3−wayCIPHER 2−wayMWEM 3−way
MWEM 2−way full table sanitization
●
●
●
●
●
privacy budget
Tota
l Var
iatio
n D
ista
nce
(TV
D)
TVD for Two−Way Tables
e−2 e−1 e0 e1 e2
0.0
0.2
0.4
0.6
0.8
●
●
●
●
●
● CIPHER 3−wayCIPHER 2−wayMWEM 3−way
MWEM 2−way flat Laplace
●
●
●
●
●
privacy budget
Tota
l Var
iatio
n D
ista
nce
(TV
D)
TVD for Three−Way Tables
e−2 e−1 e0 e1 e2
01
23
●
●
●
●
●
● CIPHER 3−wayCIPHER 2−wayMWEM 3−way
MWEM 2−way full table sanitization
n = 500
●
●
●
● ●
privacy budget
Tota
l Var
iatio
n D
ista
nce
(TV
D)
TVD for One−Way Tables
e−2 e−1 e0 e1 e2
0.00
0.05
0.10
0.15
0.20
0.25
0.30
●
●
●
●●
● CIPHER 3−wayCIPHER 2−wayMWEM 3−way
MWEM 2−way flat Laplace
●
●
●
●
●
privacy budget
Tota
l Var
iatio
n D
ista
nce
(TV
D)
TVD for Three−Way Tables
e−2 e−1 e0 e1 e2
0.0
0.5
1.0
1.5
2.0
2.5
3.0
●
●
●
●●
● CIPHER 3−wayCIPHER 2−wayMWEM 3−way
MWEM 2−way full table sanitization
●
●
●
● ●
privacy budget
Tota
l Var
iatio
n D
ista
nce
(TV
D)
TVD for Two−Way Tables
e−2 e−1 e0 e1 e2
0.0
0.1
0.2
0.3
0.4
●
●
●
● ●
● CIPHER 3−wayCIPHER 2−wayMWEM 3−way
MWEM 2−way full table sanitization
Figure 2: Total Variation Distance (mean ± SD) on 1-way, 2-way and 3-way tables inExperiment 1
We run three types of analyses on the synthetic data. The first two analyses are descriptiveand examine the ability of each method in recovering the original information, while the thirdis inferential, compares some analysis results between the synthetic and original data andalso examines the ability of the methods in preserving the population-level information forstatistical inferences. Specifically, in the first analysis, the average total variation distance(TVD) between the original and synthetic data sets was calculated for the cell probabilities inall three-way, two-way and one-way tables, respectively. The TVD for the cell probabilities in
13
n = 200 n = 500
Figure 3: The SSS (Signs and Statistical Significance) assessment on the estimated regressioncoefficients for n = 200 and n = 500
a table is defined as |p− p̄∗|/2, where p and p̄∗ represent the cell probabilities in the originaldata and those averaged over the m synthetic data sets, which were then averaged for all k-way tables, where k = 1, 2, 3, respectively. In the second analysis, we examine the l∞ error forQ2 andQ3, respectively. MWEM is claimed to have the optimal l∞ error for the set of queriesthat are fed to the algorithm with an optimal T (Hardt et al., 2012). In the third analysis,we fitted the multinomial logistic model with V4 as the outcome and V1, V2, V3 as covariates.The inferences from the m = 5 synthetic data sets were combined using the combination rulein Liu (2016a). Specifically, the final point estimate for a parameter β is β̄ = m−1
∑mj=1 β̂
(j),
where β̂(j) is the MLE of β in synthetic set j; and the variance is estimated by V = m−1B+W ,where W = m−1
∑mj=1 v
2(j)) (the average within-set variability), where v2(j) is the variance
estimate β̂(j), and B = (m−1)−1∑m
j=1(β̂(j)− β̄)2 (the between-set variability). Inferences of
θ are based on the t-distribution tν(β̄, V ) with degrees of freedom ν = (m− 1) (1 +mW/B)2.The bias, root mean square error (RMSE), coverage probability (CP) and confidence interval(CI) width of the 95% CI were determined for each of the regression coefficients from themultinomial logistic regression model. We also run the SSS assessment on the the regressioncoefficients to evaluate the consistency between synthetic and original data on the inferenceson the parameters.
The results for the average TVD are presented in Figure 2. Between CIPHER and MWEM,MWEM produces similar or smaller bias compared to CIPHER when ε = e−2, but is out-performed by CIPHER at ε > 1. There is not much difference between 3-way and 2-wayCIPHER or between 3-way and 2-way MWEW for this analysis. The full table sanitizationis the best performer overall especially in the 3-way table case for ε ≥ e−1. CIPHER andthe full table sanitization delivers similar performances to for 1-way and 2-way tables whenε ≥ e.
14
n = 200 n = 500
●
●
●
●
●
privacy budget
Max
imum
Abs
olut
e D
iffer
ence
Two−Way Tables
e−2 e−1 e0 e1 e2
050
100
150
●
●
●
●
●
●
●
CIPHER 3−wayCIPHER 2−wayMWEM 3−way
MWEM 2−way full table sanitization
●
●
●
●
●
privacy budget
Max
imum
Abs
olut
e D
iffer
ence
Two−Way Tables
e−2 e−1 e0 e1 e2
010
020
030
040
0
●
●
●
●
●
●
●
CIPHER 3−wayCIPHER 2−wayMWEM 3−way
MWEM 2−way full table sanitization
●
●
●
●
●
privacy budget
Max
imum
Abs
olut
e D
iffer
ence
Three−Way Tables
e−2 e−1 e0 e1 e2
050
100
150
●
●
●
●
●
●
●
CIPHER 3−wayCIPHER 2−wayMWEM 3−way
MWEM 2−way full table sanitization
●
●
●
●
●
privacy budget
Max
imum
Abs
olut
e D
iffer
ence
Three−Way Tables
e−2 e−1 e0 e1 e2
010
020
030
040
0
●
●
●
●
●
●
●
CIPHER 3−wayCIPHER 2−wayMWEM 3−way
MWEM 2−way full table sanitization
Figure 4: l∞ (mean ± SD) for Q2 and Q3 in Experiment 1
The results for the l∞ error over the prespecified query set are given Figure 4. The perfor-mance of MWEM does not seem to live up to the claim that it has the optimal l∞ error forthe set of queries that are fed to the algorithm with an optimal T . For example, for 3-waytables, per this claim, MWEM 3-way would have produced the smallest l∞ error, which isnot the case per the results. This might be due to T not being optimized in a precise way,which is not an easy hyper-parameter to tune. In summary, the three methods are similarat ε = e−2, but the Laplace sanitizer edges out as ε increases. CIPHER also outperformsMWEM when ε > e−1.
The results for the SSS assessment on the regression coefficients from the logistic regressionare provided in Figures 3. A method with the longest red bar (best-case scenario) and theshortest purple bar (the worst-case scenario) would be preferable. The two inflated typeI error types (I+/yellow bar and I-/blue bar) would preferably be of low probability. Thetwo inflated type II error or decreased power types (II+/orange bar and I-/green bar) andneural (gray) are acceptable. Per the listed criteria above, first, it is comforting to see theundesirable cases (purple+blue bars) are the shortest among all the 7 scenarios for eachDIPS method; second, as expected, the inferences improve quickly with CIPHER and thefull table sanitization and rather slowly with MWEM as ε increases; third, the full tablesanitization is the best performer in preserving SSS, especially for the medium valued ε,followed closely by CIPHER. Finally, even for CIPHER and the full table sanitization, thereare always non-ignorable proportions of II+ (and II- when ε was small) even when ε is aslarge as e2, suggesting the sanitization decreases the efficiency of the statistical inferences,which is the expected price paid for privacy protection.
15
●
●
●
●
●
●
●
●
●
●
bias
e−2−
20
24
●
●
●
●
●
●
●●
●
●
●
●
CIPHER 3−wayCIPHER 2−wayMWEM 3−way
MWEM 2−way full table sanitizationOriginal
●
●
●
●
● ●
●
●
●
●
e−1
●
●
●
●
● ●
●
●
●
●
●
●
CIPHER 3−wayCIPHER 2−wayMWEM 3−way
MWEM 2−way full table sanitization Original
●
●
●
●
●●
●
●
●
●
1
●
●●
●
●
●
●
●
●
●
●
●
CIPHER 3−wayCIPHER 2−wayMWEM 3−way
MWEM 2−way full table sanitization Original
● ●
●●
●●
●
●
●●
e1
●
● ●
●
●
●
●
●
●
●
●
●
CIPHER 3−wayCIPHER 2−wayMWEM 3−way
MWEM 2−way full table sanitizationOriginal
● ●●
●● ●
●●
● ●
e2
●
● ●
●
● ●
●
●
●
●
●
●
CIPHER 3−wayCIPHER 2−wayMWEM 3−way
MWEM 2−way full table sanitization Original
●● ●
●● ● ● ●
● ●
Cov
erag
e P
roba
bilit
y0
2040
6080
100
●● ●
●
● ● ● ●●
●●
●●
●
● ● ● ●●
●● ●●
●
● ● ● ● ● ●
●
●
●●
● ● ● ●● ●
●
●
●
●
● ● ●
● ●
●● ● ●
●
● ● ● ● ● ●● ● ●
●
● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ●● ● ●
●
● ● ●
●
● ●
●
●
●●
●●
●
●
●
●
rmse
01
23
4
●●
●●
●
●
●
●
●
●
●
● ●●
●●
●
● ●
●●
●
●
●
●
●
●●
●
●
● ●
●
●
● ●
●
●
●●
● ●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
● ●
●●
●
●
●
●
●●
●
●
● ●
●
●
● ●●●
●
●
●●
●
●
●
●
●●
●●
●●
●●
●●
log(
Con
fiden
ce In
terv
al W
idth
)
β01 β11 β21 β31 β41 β02 β12 β22 β32 β42
01
23
4
●
●●
●
●
●
●
●
●
●● ●
● ● ● ●
● ● ●●
β01 β11 β21 β31 β41 β02 β12 β22 β32 β42
●●
●●
●
●
● ●●
●
●
●
●
●●
●
●
●
●
●
β01 β11 β21 β31 β41 β02 β12 β22 β32 β42
●●
●●
●●
●● ●
●
●
●
●
●
●
●
●
●
●
●
β01 β11 β21 β31 β41 β02 β12 β22 β32 β42
● ●
●●
● ●
●
●
●●
●●
●●
●●
●
●
●●
β01 β11 β21 β31 β41 β02 β12 β22 β32 β42
● ●
●● ●
●
●●
●●
Figure 5: Bias, root mean square error (rmse), coverage probability, and log(CI width) of95% CI at n = 200 in Experiment 1
The results on the bias, RMSE, CP, and CI width are presented in Figure 5 and 6 for n = 200and n = 500, respectively. First, between MWEM and CIPHER, CIPHER always deliversnear-nominal CP across all examined ε and both n scenarios while MWEM suffers severeunder-coverage on some parameters. The two methods have similar bias when ε < 1, butthe bias shrinks toward 0 for ε > 1, especially for the 3-way CIPHER, while MWEM hasbias of similar magnitude across all ε values. But MWEM does have the smallest RMSE andCI width for ε ≤ 1. The RMSE and CI width for CIPHER decrease quickly and approachthe original values with increasing ε, whereas those associated with MWEM remain largelyconstant. Second, CIPHER delivers similar performance to the full table sanitization forε < e−1, but the latter has smaller bias, RMSE, and CI width for ε > e−1. Similar toCIPHER, the full table sanitization always has near-nominal CP. Third, the performance ofall the methods improves as n increases from 200 to 500 regarding the bias, RMSE, CP andthe CI width.
16
●
●
●
●
●●
●
●
●
●
bias
e−2−
20
24
●
● ●
●
●●
●
●
●
●
●
●
CIPHER 3−wayCIPHER 2−wayMWEM 3−way
MWEM 2−way full table sanitization Original
●
●
●
●
● ●
●
●
●
●
e−1
●
● ●
●
●
●
●
●
●
●
●
●
CIPHER 3−wayCIPHER 2−wayMWEM 3−way
MWEM 2−way full table sanitization Original
●●
●
●
●●
●
●
●●
e0
●
●
●
●
●
●
●
●
●
●
●
●
CIPHER 3−wayCIPHER 2−wayMWEM 3−way
MWEM 2−way full table sanitization Original
● ●
●●
● ●●
●
● ●
e1
●
●●
●
● ●
●
●
●
●
●
●
CIPHER 3−wayCIPHER 2−wayMWEM 3−way
MWEM 2−way full table sanitizationOriginal
● ● ●●
● ●●
●● ●
e2
●●
●
●
● ●
●
●
●●
●
●
CIPHER 3−wayCIPHER 2−wayMWEM 3−way
MWEM 2−way full table sanitizationOriginal
●
●●
●
● ● ● ●
●
●
Cov
erag
e P
roba
bilit
y0
2040
6080
100
●
●
●
●
● ●●
●
●
●
●
●
●
●
● ● ●●
●
●
●
●
●
●
● ● ●
● ●
●
●● ●
●
● ● ●● ●
●●
●
●
●
● ● ●
●
●● ● ● ● ● ● ● ● ● ● ●● ● ●
●
● ●●
●
● ● ● ● ● ● ● ● ● ● ● ●● ● ●
●
● ●●
●
● ●
●
● ●
●
● ●
●
●
●
●
rmse
0.0
0.5
1.0
1.5
2.0
2.5
3.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●●
●
●
●
● ●
●
●
●
●
● ●
●
●
● ●
●
●
● ●● ●●
●
●
●
●
●
● ●
● ●
●
●
● ●
●
●
●●●
●●
●
●●
●
●
●
●
● ●
●●
● ●
●
●
● ●●●
●
●
● ●
●
●
●
●
● ●
● ●● ●
● ●●
●
log(
Con
fiden
ce In
terv
al W
idth
)
β01 β11 β21 β31 β41 β02 β12 β22 β32 β42
0.0
0.5
1.0
1.5
2.0
2.5
3.0
●
●●
●
●
●
●
●
●
●
●●
●
● ● ●
●
●● ●
β01 β11 β21 β31 β41 β02 β12 β22 β32 β42
●●
●●
●●
●● ● ●
●
●
●
●●
●
●
●
●
●
β01 β11 β21 β31 β41 β02 β12 β22 β32 β42
●●
●
● ●
●●
●
●●
●●
●
●●
●
●
●
●●
β01 β11 β21 β31 β41 β02 β12 β22 β32 β42
●●
●
● ●
● ●
●
●●
●●
●●
●●
●
●
●●
β01 β11 β21 β31 β41 β02 β12 β22 β32 β42
●●
●● ●
●●
●
●●
Figure 6: Bias, root mean square error (rmse), coverage probability and log(CI width) of95% CI at n = 500 in Experiment 1.
4.4 Experiment 2: Company Bankruptcy Data
The experiments runs on a real-life qualitative bankruptcy data set. Qualitative bankruptcydata are often used for feature selection in bankruptcy prediction and to discover experts’decision rules on bankruptcy vs. non-bankruptcy given the qualitative attributes (Kim andHan, 2003; Tsai, 2009; Nagaraj and Sridhar, 2015). The data set used was collected toidentify the qualitative risk factors associated with bankruptcy and is available for downloadfrom the UCI Machine Learning repository (Dheeru and Karra Taniskidou, 2017). The datacontains n = 250 businesses and 7 variables (Table 2). Though the data set does not containany identifiers, sensitive information (such as the bankruptcy evaluation or Credibility) manystill be disclosed using the pseudo-identifiers left in the data (such as Industrial Risk orCompetitiveness), or be used to be linked to other public data to trigger other types ofinformation disclosure.
When applying CIPHER and MWEM to the bankruptcy data, we first decided on the setof LDC tables Q to be sanitized. We selected Q based on the domain knowledge, and thecomputational and analytical considerations when solving the linear equations.Specifically,
17
Table 2: Variables in the Bankruptcy Data
Variable Category (Frequency)industrial risk (IR) positive (80),average (89),negative (81))management risk (MR) positive (62),average (119),negative (69)financial flexibility (FF) positive (57),average (119),negative (74)credibility (CR) positive (79),average (94),negative (77)competitiveness (CO) positive (91),average (103),negative (56)operating risk (OR) positive (79),average (114),negative (57)Class bankruptcy (107),non-Bankruptcy (143)
we first created a 6-category Class/CR variable from the full cross-tabulation, both of whichcan be regarded as sensitive information and might be associated, and a 9-category IR/COcross-tabulation; and then applied the CIPHER 2-way and MWEM 2-way to the 5 variableswith 6 (Class/CR), 9 (IR/CO), 3 (OR), 3 (MR), and 3 (FF) levels respectively. The size ofQ (the number of counts) is thus 149, though technically speaking, there are 10 sets of 2Dhistogram queries. After the synthetic data were generated, we decoupled the two sets ofcombined variables (Class/CR and IR/CO), so the final synthetic data set still contain all7 attributes as in the original data set. In terms of the original 7 attributes, Q employedby the CIPHER and MWEM procedures contains one 4-way contingency table, six 3-waycontingency tables, and three 2-way contingency tables. For the MWEM algorithm, weexamine two iteration scenarios with T = 5 and T = 20, depending on the value of ε (T = 5for small ε ∼ 0.14 to ∼ 0.37 and T = 20 for larger ε = 1 to ∼ 2.27).
For the full table sanitization, there are 1,458 cells in the cross-tabulation across the 7attributes, which is about 10 folds the numbers of cells for CIPHER and MWEM (149).Among the 1,458 cells, 1,355 are empty cells which should be regarded sample zeros, meaningthat these cells are empty because of the finite sample size, and are expected to change ordisappear as the sample size increases or in a different sample data set. In other words,these sample-zero cells are part of the data and should be sanitized as the non-empty cells;otherwise, information about the raw data would be leaked. The same rule applies toCIPHER and MWEM when empty cells are encountered in Q.
We consider 4 privacy budget levels ε = (e−2, e−1, 1, e1), and run 24 repetitions for ε andeach method to examine the stability of the methods. In each repetition, 5 synthetic datasets with n = 250 were generated. We ran a logistic regression model with “Class” as theoutcome variable (bankruptcy vs non-bankruptcy) and the other attributes as predictors,and a support vector machine (SVM) analysis to predict “Class” using other attributes,both benchmarked against the original results. Understanding what predicts the bankruptcystatus and having the ability to predict the bankruptcy status with high accuracy wouldbe what companies and banks are interested in. In both analyses, the results from the 5synthetic data sets were combined using the combination properties outlined in Liu (2016a).
In the logistic regression model, we examined the relationships of the 6 qualitative categoricalcovariates (IR, MR, FF, CR, CO, and OR) with the outcome variable of Class to determinethe odds of bankruptcy (Kim and Han, 2003). Each of the categorical covariates has three
18
categories, and the “average” level of risk was used as the reference for each. Specifically,
the model is log(
P (bankruptcy)1−P (bankruptcy)
)= β0 + β1 · IRN + β2 · IRP + β3 ·MRN + β4 ·MRP + β5 ·
FFN + β6 · FFP + β7 ·CRN + β8 ·CRP + β9 ·CON + β10 ·COP + β11 ·ORN + β12 ·ORP . Theregression coefficients of β and their variance estimates were estimated using the R packagelogistf, which implements the Firth’s bias-reduced penalized-likelihood logistic regression(Heinze and Ploner, 2016). We applied the SSS assessment to the estimated parameters.The results are presented in Figure 7. The figure suggests that all three DIPS methodsperformed well in the sense that the probability that they produced a “bad” estimate (theworst, II-, and I- categories) was close to 0, and the estimates were mostly likely to landin the “best” or the “neutral” categories. The full table Laplace sanitizer had the largestchance to produce estimates in the “best” category for ε ≥ e−1. MWEM, regardless of ε, hadaround 50% probability to land in the “best” category or in “neutral”. Overall, the threealgorithms seemed performance similarly per the SSS assessment.
Figure 7: The SSS assessment on the logistic regression coefficients in the bankruptcy data
Table 3: Accuracy (%) of Support Vector Machines (SVM) for Predicting “Class” in thebankruptcy data
ε CIPHER MWEM full table sanitizatione−2 67.8 50.0 41.1e−1 64.7 51.3 55.51 68.5 51.0 63.8e1 77.8 47.2 85.7
The prediction accuracy with the original training data is 100%.
In the SVM analysis to classify Class and determine the bankruptcy status, given the sixqualitative risk attributes, we randomly split the original data into a training data set of200 samples (80% of n = 250) and a testing data set of 50 (20% of n = 250). We then
19
apply CIPHER, MWEM, and the full table Laplace sanitization to the training set onlyto generate synthetic data, on which the SVM was trained. The trained SVM with thesynthetic data from each method was applied to make predictions on the same testing set.We 24 repetitions and generated 5 sets of synthetic data with 1/5 of total privacy budgetper set. The averaged prediction accuracy rates over 5 sets and 24 repeats are presentedin Table 3. CIPHER is the obvious winner for ε ≤ 1 with significantly better predictionaccuracy than the other two. When ε = e, the full table sanitization is the best with ∼86%accuracy, followed by CIPHER with ∼78% accuracy. Regardless of ε, MWEM has difficultyin classifying Class, with accuracy between 45∼55% at all the examined ε.
5 Discussion
We proposed the CIPHER algorithm to release differentially private synthetic data setsgiven a set of LDC tables. We also proposed the SSS assessment to evaluate the utilityof the synthetic data hypothesis testing. We compared our algorithm with the full tablesanitization and the MWEM algorithm in a simulation study and a real-life qualitativebankruptcy data set. CIPHER delivers similar the statistical inferences of population-levelparameters as the full table sanitization when ε is relatively small or large and somewhatinferior to the latter around the medium-size ε (in the neighborhood of 1), but working witha significantly smaller set of sanitized statistics compared the full table sanitization. ThoughMWEM, like CIPHER, can works with a small set of statistics, the utility of the syntheticdata is not as good of CIPHER in general.
The asymptotic version of both CIPHER and MWEM is the full table sanitization whenLDC table set contains only one query – the full table. If the Q comprises a set LDCtables instead of the full table, both CIPHER and MWEM have additional sources of noisecompared to the full table sanitization, in addition to the noise introduced by differentiallyprivate sanitizer, which deviates the synthetic data further away from the original data. ForCIPHER, it is the shrinkage brought by the l2 regularization; fro MWEM it is the numericalerrors introduced through the iterative procedure with a hard-to-choose T .
We demonstrated the implementation CIPHER for categorical data, but the algorithm canalso be used in data with numerical attributes. Rather than taking on a set of LDC tablesas input, the input would become a set of low-dimensional histograms. This implies the nu-merical attributes will need to be cut into bins first before the application of the CIPHER.High-dimensional histograms with good statistical properties are difficult to construct (Scott,2015), which poses additional changes for the full table sanitation in addition to the datastorage issue. Low-dimensional histograms would be more desirable from a statistical per-spective, on top of the huge saving in data storage. CIPHER can be directly applied to theset of low-dimensional histograms, following the steps in Algorithm 1, to generate the empir-ical joint distribution among all the attributes. For any numerical attributes involved in thesynthesized histograms, one can uniformly sample from the sanitized bins to “transform”the discretized values back to the numerical values for these attributes.
For future work, we plan to investigate the theoretical aspect for CIPHER in terms of
20
accuracy by certain utility criterion. In addition, we plan to apply CIPHER to more dataof higher dimensions in terms of both attributes and the number of levels per attribute tosee how CIPHER scales up in those cases.
References
Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L.(2016). Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSACConference on Computer and Communications Security, pages 308–318. ACM.
Abowd, J. M. and Vilhuber, L. (2008). How protective are synthetic data? In Privacy inStatistical Databases, pages 239–246. Springer.
Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., and Talwar, K. (2007). Privacy,accuracy, and consistency too: a holistic solution to contingency table release. In Pro-ceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principlesof database systems, pages 273–282. ACM.
Bowen, C. M. and Liu, F. (2016). Comparative study of differentially private data synthesismethods. arXiv preprint arXiv:1602.01063.
Chaudhuri, K., Monteleoni, C., and Sarwate, A. D. (2011). Differentially private empiricalrisk minimization. Journal of Machine Learning Research, 12(Mar):1069–1109.
Chaudhuri, K., Sarwate, A., and Sinha, K. (2012). Near-optimal differentially private prin-cipal components. In Advances in Neural Information Processing Systems, pages 989–997.
Chen, R., Xiao, Q., Zhang, Y., and Xu, J. (2015). Differentially private high-dimensionaldata publication via sampling-based inference. In Proceedings of the 21th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, pages 129–138. ACM.
Culnane, C., Rubinstein, B. I. P., and Teague, V. (2017). Health data in an open world.arXiv preprint arXiv:1712.05627v1.
Dheeru, D. and Karra Taniskidou, E. (2017). UCI machine learning repository.
Ding, B., Winslett, M., Han, J., and Li, Z. (2011). Differentially private data cubes: optimiz-ing noise sources and consistency. In Proceedings of the 2011 ACM SIGMOD InternationalConference on Management of data, pages 217–228. ACM.
Dwork, C. (2008). Differential privacy: A survey of results. In International Conference onTheory and Applications of Models of Computation, pages 1–19. Springer.
Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). Calibrating noise to sensitivityin private data analysis. In Theory of Cryptography Conference, pages 265–284. Springer.
Dwork, C., Roth, A., et al. (2014). The algorithmic foundations of differential privacy.Foundations and Trends R© in Theoretical Computer Science, 9(3–4):211–407.
21
Hardt, M., Ligett, K., and McSherry, F. (2012). A simple and practical algorithm fordifferentially private data release. In Advances in Neural Information Processing Systems,pages 2339–2347.
Hay, M., Machanavajjhala, A., Miklau, G., Chen, Y., and Zhang, D. (2016). Principledevaluation of differentially private algorithms using dpbench. In Proceedings of the 2016International Conference on Management of Data, pages 139–154. ACM.
Hay, M., Rastogi, V., Miklau, G., and Suciu, D. (2010). Boosting the accuracy of differ-entially private histograms through consistency. Proceedings of the VLDB Endowment,3(1-2):1021–1032.
Heinze, G. and Ploner, M. (2016). logistf: Firth’s Bias-Reduced Logistic Regression. Rpackage version 1.22.
Kasiviswanathan, S. P., Nissim, K., Raskhodnikova, S., and Smith, A. (2013). Analyzinggraphs with node differential privacy. In Theory of Cryptography, pages 457–476. Springer.
Kifer, D., Smith, A., and Thakurta, A. (2012). Private convex empirical risk minimizationand high-dimensional regression. In Conference on Learning Theory, pages 25–1.
Kim, M.-J. and Han, I. (2003). The discovery of experts’ decision rules from qualitativebankruptcy data using genetic algorithms. Expert Systems with Applications, 25(4):637–646.
Li, X., Yang, J., Sun, Z., and Zhang, J. (2017). Differential privacy for edge weights in socialnetworks. Security and Communication Networks, 2017.
Liu, F. (2016a). Model-based differential private data synthesis. arXiv preprintarXiv:1606.08052.
Liu, F. (2016b). Statistical properties of sanitized results from differentially private laplacemechanisms with noninformative bounding. arXiv preprint arXiv:1607.08554.
Liu, F. (2019). Generalized gaussian mechanism for differential privacy. IEEE Transactionson Knowledge and Data Engineering, 31(4):747 – 756.
Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., and Vilhuber, L. (2008). Privacy:Theory meets practice on the map. IEEE ICDE IEEE 24th International Conference,pages 277 – 286.
McClure, D. and Reiter, J. P. (2012). Differential privacy and statistical disclosure riskmeasures: An investigation with binary synthetic data. Transactions on Data Privacy,5(3):535–552.
McSherry, F. and Talwar, K. (2007). Mechanism design via differential privacy. In Foun-dations of Computer Science, 2007. FOCS’07. 48th Annual IEEE Symposium on, pages94–103. IEEE.
22
McSherry, F. D. (2009). Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Con-ference on Management of data, pages 19–30. ACM.
Nagaraj, K. and Sridhar, A. (2015). A predictive system for detection of bankruptcy usingmachine learning techniques. arXiv preprint arXiv:1502.03601.
Narayanan, A. and Shmatikov, V. (2006). How to break anonymity of the netflix prizedataset. CoRR, abs/cs/0610105.
Narayanan, A. and Shmatikov, V. (2008). Robust de-anonymization of large sparse datasets.In Security and Privacy, 2008. SP 2008. IEEE Symposium on, pages 111–125. IEEE.
Roth, A. and Roughgarden, T. (2010). Interactive privacy via the median mechanism. InProceedings of the forty-second ACM symposium on Theory of computing, pages 765–774.ACM.
Scott, D. W. (2015). Multivariate density estimation: theory, practice, and visualization.John Wiley & Sons.
Shokri, R. and Shmatikov, V. (2015). Privacy-preserving deep learning. In Proceedingsof the 22nd ACM SIGSAC conference on computer and communications security, pages1310–1321. ACM.
Sweeney, L. (2013). Matching known patients to health records in washington state data.CoRR, abs/1307.1370.
Tikhonov, A. N. (1963). On the solution of ill-posed problems and the method of regular-ization. Doklady Akademii Nauk, 151(3):501–504.
Tikhonov, A. N., Goncharsky, A., Stepanov, V., and Yagola, A. G. (2013). Numericalmethods for the solution of ill-posed problems, volume 328. Springer Science & BusinessMedia.
Tockar, A. (2014). Riding with the stars: Passenger privacy in the nyc taxicab dataset.https://research.neustar.biz/author/atockar/.
Tsai, C.-F. (2009). Feature selection in bankruptcy prediction. Knowledge-Based Systems,22(2):120–127.
Yan, S., Pan, S., Zhao, Y., and Zhu, W.-T. (2016). Towards privacy-preserving data min-ing in online social networks: Distance-grained and item-grained differential privacy. InAustralasian Conference on Information Security and Privacy, pages 141–157. Springer.
Zhang, J., Cormode, G., Procopiuc, C. M., Srivastava, D., and Xiao, X. (2014). Privbayes:Private data release via bayesian networks. In Proceedings of the 2014 ACM SIGMODInternational Conference on Management of Data, SIGMOD ’14, pages 1423–1434, NewYork, NY, USA. ACM.
23
Supplementary Materials for “Construction of Microdata from a Set of DifferentiallyPrivate Low-dimensional Contingency Tables through Solving Linear Equations with
Tikhonov Regularization” by Evercita C. Eugenio and Fang Liu
The supplementary materials contain additional simulation results and the derivation of thelinear equations sets Ax = b for the three-variable and four-variable cases. Specifically,Tables 1 to and 4 present the numerical values on the bias, RMSE, coverage probability andconfidence interval width for the results presented in Figure 2 for n = 200; and Tables 5 to8 give the numerical values for n = 500 in the simulation study; Tables ?? and ?? presentedthe ill-conditioned synthetic data sets in simulation study. Section 2 includes the detailedderivation for Ax = b using several examples when p = 3 and p = 4, respectively. Thefour-variable case p = 4 is also what was used in the CIPHER algorithm for the simulationstudy.
24
1A
dd
itio
nal
Sim
ula
tion
Resu
lts
1.1
n=
200
Tab
le1:
Sim
ula
tion
Res
ult
s:B
ias
forn
=20
0
εA
lgor
ith
mB
ias
β01
β11
β21
β31
β41
β02
β12
β22
β32
β42
e−2
CIP
HE
RT
hre
e-W
ay-1
.388
0.04
20.
870
1.48
6-0
.013
0.31
5-0
.562
-0.8
941.
593
0.94
6C
IPH
ER
Tw
o-W
ay-1
.543
1.80
81.
027
1.63
5-0
.295
0.25
7-0
.674
-0.8
561.
918
0.84
3M
WE
MT
hre
e-W
ay(T
=5)
-1.5
060.
983
-0.5
03-0
.976
2.02
3-1
.005
1.47
60.
476
-0.7
301.
042
MW
EM
Tw
o-W
ay(T
=5)
-1.4
651.
035
-0.4
79-1
.030
1.96
8-0
.956
1.54
30.
479
-0.7
930.
944
FD
HL
apla
ceS
anit
izer
-1.2
06-0
.971
0.97
01.
168
0.27
20.
366
-0.1
90-0
.610
1.01
60.
519
e−1
CIP
HE
RT
hre
e-W
ay-0
.980
0.08
40.
906
1.21
40.
101
0.09
5-0
.370
-0.8
280.
904
0.38
0C
IPH
ER
Tw
o-W
ay-1
.125
1.57
40.
927
1.40
50.
070
0.00
2-0
.413
-1.0
801.
375
0.51
5M
WE
MT
hre
e-W
ay(T
=15
)-1
.521
1.03
5-0
.444
-0.9
951.
932
-0.9
841.
488
0.47
6-0
.738
0.96
2M
WE
MT
wo-
Way
(T=
15)
-1.4
520.
909
-0.3
21-1
.058
2.00
9-0
.989
1.40
00.
534
-0.8
080.
995
FD
HL
apla
ceS
anit
izer
-0.7
04-0
.529
0.68
00.
847
0.40
10.
256
-0.2
23-0
.644
0.33
20.
276
e0C
IPH
ER
Th
ree-
Way
-0.7
28-0
.275
0.63
21.
050
0.30
70.
108
-0.3
07-0
.852
0.53
90.
221
CIP
HE
RT
wo-
Way
-0.7
951.
009
0.83
01.
171
0.36
3-0
.086
-0.5
80-1
.544
0.69
40.
045
MW
EM
Th
ree-
Way
(T=
25)
-1.4
970.
962
-0.4
47-0
.965
1.98
6-0
.977
1.46
90.
487
-0.7
431.
009
MW
EM
Tw
o-W
ay(T
=25
)-1
.372
0.98
5-0
.292
-0.9
151.
920
-1.0
001.
460
0.47
0-0
.720
0.95
5F
DH
Lap
lace
San
itiz
er-0
.285
-0.1
910.
403
0.63
00.
316
0.14
0-0
.332
-0.6
230.
028
-0.0
31
e1C
IPH
ER
Th
ree-
Way
-0.2
95-0
.213
0.46
40.
688
0.22
40.
099
-0.3
26-0
.593
0.09
5-0
.047
CIP
HE
RT
wo-
Way
-0.2
180.
554
0.49
61.
124
0.22
9-0
.172
-0.6
29-1
.400
-0.0
09-0
.469
MW
EM
Th
ree-
Way
(T=
60)
-1.4
310.
958
-0.3
29-0
.841
1.84
5-1
.007
1.44
40.
464
-0.6
651.
022
MW
EM
Tw
o-W
ay(T
=60
)-1
.030
0.97
1-0
.186
-0.8
321.
695
-0.9
871.
318
0.44
7-0
.775
0.71
3F
DH
Lap
lace
San
itiz
er-0
.077
-0.0
230.
131
0.24
60.
204
0.04
9-0
.226
-0.3
46-0
.099
-0.0
67
e2C
IPH
ER
Th
ree-
Way
0.06
00.
028
0.13
60.
338
0.14
70.
055
-0.2
08-0
.390
-0.1
96-0
.152
CIP
HE
RT
wo-
Way
0.20
60.
455
0.38
51.
053
-0.0
56-0
.148
-0.8
12-1
.216
-0.3
26-0
.605
MW
EM
Th
ree-
Way
(T=
120)
-1.3
880.
982
-0.1
94-0
.672
1.83
9-0
.910
1.35
40.
433
-0.6
930.
922
MW
EM
Tw
o-W
ay(T
=12
0)-0
.734
1.01
30.
107
-0.7
671.
448
-0.9
771.
148
0.40
3-0
.901
0.81
8F
DH
Lap
lace
San
itiz
er0.
072
0.06
40.
012
0.05
60.
062
0.00
4-0
.094
-0.1
38-0
.130
-0.1
06
25
Tab
le2:
Sim
ula
tion
Res
ult
s:R
oot
Mea
nSquar
eE
rror
(RM
SE
)fo
rn
=20
0
εA
lgor
ithm
Sim
ula
tion
Res
ult
s:R
oot
Mea
nSquar
eE
rror
(RM
SE
)β01
β11
β21
β31
β41
β02
β12
β22
β32
β42
e−2
CIP
HE
RT
hre
e-W
ay3.
335
2.91
73.
187
3.28
22.
934
2.82
13.
395
3.23
93.
817
3.40
0C
IPH
ER
Tw
o-W
ay3.
303
3.18
62.
689
2.77
72.
513
2.27
53.
090
2.76
73.
821
2.90
3M
WE
MT
hre
e-W
ay(T
=5)
1.56
01.
043
0.62
91.
064
2.06
81.
086
1.52
30.
637
0.83
71.
133
MW
EM
Tw
o-W
ay(T
=5)
1.74
41.
271
0.89
91.
290
2.10
31.
338
1.72
60.
908
1.12
11.
280
FD
HL
apla
ceSan
itiz
er2.
829
3.13
82.
424
2.77
22.
067
2.55
82.
750
3.29
63.
172
3.37
1
e−1
CIP
HE
RT
hre
e-W
ay2.
695
2.49
42.
454
2.57
02.
363
2.24
42.
796
3.08
53.
074
2.93
4C
IPH
ER
Tw
o-W
ay2.
824
2.89
52.
314
2.51
42.
248
2.07
62.
512
2.57
62.
974
2.34
5M
WE
MT
hre
e-W
ay(T
=15
)1.
904
1.49
61.
011
1.44
12.
238
1.44
31.
723
1.11
81.
345
1.51
0M
WE
MT
wo-
Way
(T=
15)
2.60
62.
090
1.93
32.
201
2.81
12.
239
2.22
61.
842
2.06
62.
084
FD
HL
apla
ceSan
itiz
er1.
972
2.20
11.
445
1.79
61.
375
1.45
82.
000
2.58
32.
110
2.32
3
e0C
IPH
ER
Thre
e-W
ay1.
837
1.84
11.
516
1.85
81.
504
1.55
31.
890
2.28
41.
999
1.90
9C
IPH
ER
Tw
o-W
ay2.
122
2.16
51.
610
1.86
91.
767
1.63
01.
788
2.47
92.
004
1.91
1M
WE
MT
hre
e-W
ay(T
=25
)1.
639
1.13
00.
772
1.22
22.
095
1.17
21.
575
0.77
31.
017
1.21
2M
WE
MT
wo-
Way
(T=
25)
2.03
71.
505
1.18
81.
767
2.34
81.
959
1.94
91.
333
1.62
61.
742
FD
HL
apla
ceSan
itiz
er1.
185
1.35
10.
742
1.04
80.
727
0.80
61.
321
1.74
51.
216
1.40
3
e1C
IPH
ER
Thre
e-W
ay1.
157
1.31
30.
817
1.14
80.
797
0.88
31.
291
1.66
11.
171
1.35
9C
IPH
ER
Tw
o-W
ay1.
308
1.48
50.
883
1.45
81.
031
1.10
41.
384
2.02
51.
252
1.43
0M
WE
MT
hre
e-W
ay(T
=60
)1.
825
1.37
11.
068
1.43
72.
193
1.53
41.
745
1.06
11.
477
1.58
3M
WE
MT
wo-
Way
(T=
60)
2.48
82.
103
1.78
52.
218
2.61
42.
717
2.42
32.
069
2.39
72.
312
FD
HL
apla
ceSan
itiz
er0.
754
0.87
00.
495
0.64
00.
544
0.59
50.
875
1.10
60.
788
0.88
3
e2C
IPH
ER
Thre
e-W
ay0.
945
1.02
00.
557
0.70
80.
566
0.60
51.
089
1.27
30.
987
1.01
9C
IPH
ER
Tw
o-W
ay1.
014
1.13
30.
580
1.18
20.
604
0.69
21.
270
1.60
91.
025
1.18
4M
WE
MT
hre
e-W
ay(T
=12
0)1.
826
1.36
10.
978
1.40
32.
214
1.53
41.
698
1.12
11.
445
1.59
4M
WE
MT
wo-
Way
(T=
120)
2.28
21.
911
1.80
02.
084
2.42
52.
855
2.25
52.
271
2.61
42.
358
FD
HL
apla
ceSan
itiz
er0.
716
0.78
60.
480
0.57
40.
500
0.57
10.
793
0.93
00.
738
0.79
1
26
Tab
le3:
Sim
ula
tion
Res
ult
s:C
over
age
Pro
bab
ilit
y(C
P)
forn
=20
0
εA
lgor
ithm
Cov
erag
eP
robab
ilit
y(C
P)
β01
β11
β21
β31
β41
β02
β12
β22
β32
β42
e−2
CIP
HE
RT
hre
e-W
ay97
.199
.498
.697
.199
.899
.699
.199
.497
.298
.3C
IPH
ER
Tw
o-W
ay97
.199
.998
.794
.099
.899
.699
.098
.896
.298
.2M
WE
MT
hre
e-W
ay(T
=5)
51.0
91.6
99.7
95.7
18.4
92.9
43.3
99.8
99.1
93.9
MW
EM
Tw
o-W
ay(T
=5)
70.5
94.1
99.7
95.8
32.0
95.3
56.1
99.9
99.4
96.6
FD
HL
apla
ceSan
itiz
er90
.697
.793
.590
.099
.398
.998
.298
.593
.210
0.0
e−1
CIP
HE
RT
hre
e-W
ay95
.299
.297
.294
.299
.799
.599
.299
.696
.799
.3C
IPH
ER
Tw
o-W
ay10
0.0
100.
097
.493
.599
.799
.710
0.0
100.
010
0.0
100.
0M
WE
MT
hre
e-W
ay(T
=15
)80
.292
.399
.296
.162
.195
.176
.899
.297
.795
.5M
WE
MT
wo-
Way
(T=
15)
93.5
96.8
99.6
98.0
81.1
97.1
86.6
99.6
99.4
98.5
FD
HL
apla
ceSan
itiz
er10
0.0
99.1
100.
092
.710
0.0
99.4
100.
099
.010
0.0
99.0
e0C
IPH
ER
Thre
e-W
ay92
.798
.494
.492
.499
.599
.298
.997
.610
0.0
99.1
CIP
HE
RT
wo-
Way
95.4
99.7
95.2
90.7
99.8
99.2
99.1
94.6
95.8
99.2
MW
EM
Thre
e-W
ay(T
=25
)74
.692
.599
.193
.848
.793
.970
.399
.598
.094
.4M
WE
MT
wo-
Way
(T=
25)
88.0
95.6
99.7
97.7
72.0
96.3
82.5
99.0
99.1
97.5
FD
HL
apla
ceSan
itiz
er97
.299
.497
.695
.299
.999
.899
.199
.198
.599
.8
e1C
IPH
ER
Thre
e-W
ay97
.799
.098
.495
.299
.499
.799
.499
.098
.599
.3C
IPH
ER
Tw
o-W
ay98
.499
.097
.886
.799
.499
.199
.794
.499
.199
.3M
WE
MT
hre
e-W
ay(T
=60
)85
.892
.398
.196
.875
.895
.381
.899
.098
.395
.2M
WE
MT
wo-
Way
(T=
60)
94.7
96.3
99.6
99.0
88.6
98.3
92.0
99.4
99.2
98.7
FD
HL
apla
ceSan
itiz
er99
.699
.899
.999
.099
.799
.999
.799
.399
.899
.7
e2C
IPH
ER
Thre
e-W
ay99
.399
.599
.498
.899
.799
.699
.599
.099
.499
.4C
IPH
ER
Tw
o-W
ay99
.799
.899
.887
.199
.699
.999
.194
.799
.599
.7M
WE
MT
hre
e-W
ay(T
=12
0)86
.392
.799
.396
.575
.795
.183
.398
.998
.195
.4M
WE
MT
wo-
Way
(T=
120)
94.9
95.4
99.2
98.8
92.2
99.0
95.3
99.4
99.1
99.0
FD
HL
apla
ceSan
itiz
er99
.099
.199
.799
.199
.599
.098
.998
.199
.099
.1
27
Tab
le4:
Sim
ula
tion
Res
ult
s:C
onfiden
ceIn
terv
alW
idth
sfo
rn
=20
0
εA
lgor
ithm
Con
fiden
ceIn
terv
alW
idth
β01
β11
β21
β31
β41
β02
β12
β22
β32
β42
e−2
CIP
HE
RT
hre
e-W
ay24
.176
21.9
8723
.682
22.0
0823
.293
21.6
0728
.030
25.6
8928
.136
25.6
45C
IPH
ER
Tw
o-W
ay24
.620
19.0
3521
.406
17.1
1222
.118
17.5
0426
.016
19.9
6027
.290
20.5
44M
WE
MT
hre
e-W
ay(T
=5)
3.33
23.
010
3.06
13.
458
3.51
73.
265
3.03
63.
072
3.40
93.
468
MW
EM
Tw
o-W
ay(T
=5)
6.02
54.
688
4.83
15.
509
5.39
55.
818
4.74
64.
469
5.43
45.
764
FD
HL
apla
ceSan
itiz
er16
.386
19.6
3611
.470
14.3
2111
.276
13.8
5818
.846
24.2
0920
.402
25.7
41
e−1
CIP
HE
RT
hre
e-W
ay19
.170
18.4
7516
.696
17.1
9718
.062
17.3
3522
.348
22.5
4323
.617
22.1
10C
IPH
ER
Tw
o-W
ay24
.273
21.9
2815
.827
14.1
3316
.697
14.4
6323
.847
22.7
0225
.324
22.0
85M
WE
MT
hre
e-W
ay(T
=15
)6.
650
5.46
75.
487
6.98
87.
333
6.66
15.
664
5.75
27.
171
7.35
9M
WE
MT
wo-
Way
(T=
15)
17.0
8913
.685
13.7
4216
.892
16.9
8816
.942
13.5
5613
.631
17.1
1317
.442
FD
HL
apla
ceSan
itiz
er27
.574
13.1
4426
.466
8.50
526
.516
8.38
828
.537
17.7
8528
.720
15.0
08
e0C
IPH
ER
Thre
e-W
ay10
.794
12.3
298.
030
9.57
98.
939
9.99
012
.403
15.3
6219
.794
13.9
84C
IPH
ER
Tw
o-W
ay13
.625
12.4
088.
204
9.05
510
.573
9.99
411
.596
12.7
9113
.135
12.2
53M
WE
MT
hre
e-W
ay(T
=25
)4.
411
3.85
23.
970
4.63
14.
843
4.20
03.
850
3.86
34.
358
4.60
4M
WE
MT
wo-
Way
(T=
25)
10.4
898.
184
7.59
610
.348
10.3
1111
.967
8.85
48.
608
11.1
6311
.726
FD
HL
apla
ceSan
itiz
er6.
101
7.46
13.
563
4.55
63.
909
4.63
17.
417
10.1
076.
429
8.18
7
e1C
IPH
ER
Thre
e-W
ay6.
515
7.47
43.
900
5.08
64.
288
4.96
47.
353
9.36
06.
703
7.89
6C
IPH
ER
Tw
o-W
ay7.
269
7.57
44.
331
4.87
25.
406
5.60
37.
397
8.72
77.
142
7.63
0M
WE
MT
hre
e-W
ay(T
=60
)7.
238
6.06
45.
642
8.17
18.
166
7.55
65.
993
5.74
49.
029
8.83
8M
WE
MT
wo-
Way
(T=
60)
19.3
6314
.160
13.9
0718
.644
19.0
9221
.151
16.1
5215
.595
21.5
9321
.131
FD
HL
apla
ceSan
itiz
er4.
246
4.68
62.
920
3.36
12.
992
3.25
74.
629
5.46
14.
332
4.74
1
e2C
IPH
ER
Thre
e-W
ay4.
745
5.15
73.
172
3.58
13.
249
3.45
55.
365
6.17
54.
758
5.17
0C
IPH
ER
Tw
o-W
ay5.
257
5.53
13.
038
3.40
43.
421
3.73
15.
321
5.78
85.
174
5.53
3M
WE
MT
hre
e-W
ay(T
=12
0)6.
830
5.54
55.
517
7.34
97.
661
7.25
65.
842
6.04
78.
315
8.73
0M
WE
MT
wo-
Way
(T=
120)
17.2
6511
.555
12.3
2816
.417
18.3
0521
.648
15.8
1515
.868
22.8
2921
.965
FD
HL
apla
ceSan
itiz
er3.
322
3.53
02.
768
3.04
32.
798
2.97
73.
515
3.89
33.
306
3.56
6
28
1.2
n=
500
Tab
le5:
Sim
ula
tion
Res
ult
s:B
ias
forn
=50
0
εA
lgor
ithm
Bia
sβ01
β11
β21
β31
β41
β02
β12
β22
β32
β42
e−2
CIP
HE
RT
hre
e-W
ay-1
.109
-0.1
360.
915
1.32
90.
138
0.35
4-0
.606
-1.0
081.
206
0.49
0C
IPH
ER
Tw
o-W
ay-1
.272
1.01
60.
922
1.44
9-0
.015
0.17
4-0
.582
-1.0
841.
609
0.68
7M
WE
MT
hre
e-W
ay(T
=10
)-1
.464
0.99
0-0
.486
-1.0
151.
984
-0.9
681.
492
0.47
0-0
.771
0.99
9M
WE
MT
wo-
Way
(T=
10)
-1.4
500.
974
-0.4
49-0
.970
1.98
5-0
.979
1.48
40.
484
-0.7
270.
958
FD
HL
apla
ceSan
itiz
er-0
.860
-0.5
840.
736
0.93
10.
421
0.27
8-0
.054
-0.7
620.
400
0.17
0
e−1
CIP
HE
RT
hre
e-W
ay-0
.839
-0.2
530.
793
1.09
60.
223
0.12
9-0
.444
-0.8
830.
709
0.25
8C
IPH
ER
Tw
o-W
ay-0
.952
0.73
50.
833
1.24
40.
189
-0.1
15-0
.497
-1.3
370.
946
0.20
8M
WE
MT
hre
e-W
ay(T
=25
)-1
.498
1.01
7-0
.463
-0.9
321.
960
-0.9
851.
476
0.47
0-0
.771
0.97
8M
WE
MT
wo-
Way
(T=
25)
-1.3
341.
015
-0.3
63-0
.976
1.84
6-1
.041
1.51
00.
442
-0.7
420.
934
FD
HL
apla
ceSan
itiz
er-0
.427
-0.2
760.
426
0.64
40.
382
0.13
7-0
.306
-0.5
500.
096
0.00
4
e0C
IPH
ER
Thre
e-W
ay-0
.537
-0.3
660.
538
0.81
50.
235
0.09
7-0
.303
-0.5
570.
335
0.13
7C
IPH
ER
Tw
o-W
ay-0
.490
0.32
10.
593
1.13
50.
142
-0.1
49-0
.470
-1.2
180.
366
-0.1
98M
WE
MT
hre
e-W
ay(T
=50
)-1
.490
1.01
0-0
.353
-0.9
111.
922
-0.9
701.
469
0.48
2-0
.775
0.99
1M
WE
MT
wo-
Way
(T=
50)
-1.3
170.
994
-0.2
07-0
.782
1.85
9-0
.972
1.40
70.
487
-0.7
380.
940
FD
HL
apla
ceSan
itiz
er-0
.122
-0.0
700.
121
0.25
50.
192
0.07
2-0
.180
-0.3
00-0
.042
-0.0
31
e1C
IPH
ER
Thre
e-W
ay-0
.161
-0.1
530.
176
0.40
40.
149
0.09
3-0
.138
-0.3
390.
028
0.01
0C
IPH
ER
Tw
o-W
ay-0
.043
0.25
40.
413
1.06
8-0
.077
-0.1
53-0
.632
-1.0
39-0
.053
-0.4
06M
WE
MT
hre
e-W
ay(T
=10
0)-1
.443
1.02
9-0
.258
-0.7
851.
813
-0.9
501.
439
0.46
4-0
.745
0.90
7M
WE
MT
wo-
Way
(T=
100)
-1.1
410.
996
0.02
0-0
.716
1.71
8-0
.968
1.27
50.
492
-0.7
340.
811
FD
HL
apla
ceSan
itiz
er0.
006
0.01
50.
027
0.04
50.
057
0.00
8-0
.077
-0.1
00-0
.070
-0.0
61
e2C
IPH
ER
Thre
e-W
ay0.
061
0.06
30.
030
0.18
60.
044
0.03
2-0
.088
-0.2
30-0
.120
-0.1
21C
IPH
ER
Tw
o-W
ay0.
035
0.22
90.
392
0.95
6-0
.160
-0.0
92-0
.647
-0.9
09-0
.086
-0.3
10M
WE
MT
hre
e-W
ay(T
=20
0)-1
.405
1.05
1-0
.042
-0.6
061.
655
-0.9
181.
375
0.44
8-0
.729
0.89
0M
WE
MT
wo-
Way
(T=
200)
-0.9
170.
987
0.26
6-0
.578
1.47
3-0
.926
1.24
90.
449
-0.7
230.
782
FD
HL
apla
ceSan
itiz
er0.
030
0.01
5-0
.005
-0.0
190.
015
-0.0
04-0
.004
0.00
1-0
.052
-0.0
34
29
Tab
le6:
Sim
ula
tion
Res
ult
s:R
oot
Mea
nSquar
eE
rror
(RM
SE
)fo
rn
=50
0
εA
lgor
ithm
Root
Mea
nSquar
eE
rror
(RM
SE
)β01
β11
β21
β31
β41
β02
β12
β22
β32
β42
e−2
CIP
HE
RT
hre
e-W
ay2.
376
2.12
42.
112
2.28
71.
900
1.87
52.
351
2.49
02.
730
2.36
4C
IPH
ER
Tw
o-W
ay2.
305
1.97
81.
855
2.06
01.
599
1.44
91.
880
2.01
72.
478
1.87
1M
WE
MT
hre
e-W
ay(T
=10
)1.
521
1.07
30.
619
1.09
62.
027
1.05
71.
546
0.59
30.
919
1.07
9M
WE
MT
wo-
Way
(T=
10)
1.62
61.
221
0.76
31.
147
2.07
81.
243
1.64
10.
811
0.97
71.
182
FD
HL
apla
ceSan
itiz
er2.
129
2.15
91.
537
1.89
51.
439
1.63
02.
155
2.83
32.
303
2.35
2
e−1
CIP
HE
RT
hre
e-W
ay1.
585
1.47
41.
357
1.59
51.
183
1.19
11.
567
1.93
11.
737
1.66
5C
IPH
ER
Tw
o-W
ay1.
677
1.57
21.
337
1.66
71.
235
1.20
91.
340
1.91
21.
596
1.31
8M
WE
MT
hre
e-W
ay(T
=25
)1.
629
1.17
10.
727
1.14
02.
067
1.16
11.
582
0.75
71.
034
1.16
7M
WE
MT
wo-
Way
(T=
25)
1.94
71.
503
1.13
11.
476
2.20
61.
790
1.90
21.
240
1.39
01.
619
FD
HL
apla
ceSan
itiz
er1.
113
1.33
60.
731
0.99
80.
777
0.88
11.
257
1.70
41.
156
1.41
8
e0C
IPH
ER
Thre
e-W
ay0.
962
0.99
80.
754
1.03
80.
626
0.64
10.
915
1.20
50.
926
0.94
4C
IPH
ER
Tw
o-W
ay0.
946
0.91
50.
828
1.34
00.
727
0.83
80.
970
1.58
90.
887
0.89
7M
WE
MT
hre
e-W
ay(T
=50
)1.
626
1.15
30.
684
1.11
02.
039
1.17
31.
577
0.77
10.
998
1.17
7M
WE
MT
wo-
Way
(T=
50)
1.96
61.
439
1.09
41.
418
2.24
31.
656
1.71
71.
186
1.37
91.
502
FD
HL
apla
ceSan
itiz
er0.
577
0.67
80.
401
0.53
40.
440
0.44
60.
729
0.90
90.
581
0.68
8
e1C
IPH
ER
Thre
e-W
ay0.
504
0.54
00.
414
0.58
30.
435
0.43
00.
538
0.70
70.
493
0.54
0C
IPH
ER
Tw
o-W
ay0.
517
0.61
30.
524
1.14
60.
451
0.52
00.
795
1.18
60.
496
0.69
1M
WE
MT
hre
e-W
ay(T
=10
0)1.
577
1.17
60.
627
1.01
31.
919
1.18
41.
549
0.77
30.
969
1.09
6M
WE
MT
wo-
Way
(T=
100)
1.63
31.
331
0.91
21.
239
2.00
31.
626
1.66
61.
103
1.35
91.
430
FD
HL
apla
ceSan
itiz
er0.
377
0.42
80.
313
0.37
00.
316
0.35
30.
430
0.52
80.
386
0.44
9
e2C
IPH
ER
Thre
e-W
ay0.
432
0.45
70.
338
0.40
70.
350
0.37
20.
487
0.59
20.
432
0.46
8C
IPH
ER
Tw
o-W
ay0.
443
0.51
50.
458
0.99
60.
360
0.36
20.
758
1.00
80.
435
0.54
3M
WE
MT
hre
e-W
ay(T
=20
0)1.
560
1.19
60.
563
0.86
81.
775
1.11
61.
494
0.71
70.
966
1.09
6M
WE
MT
wo-
Way
(T=
200)
1.38
01.
258
0.81
41.
102
1.74
91.
511
1.51
00.
976
1.30
21.
393
FD
HL
apla
ceSan
itiz
er0.
363
0.41
00.
293
0.35
50.
309
0.34
40.
412
0.49
50.
366
0.42
5
30
Tab
le7:
Sim
ula
tion
Res
ult
s:C
over
age
Pro
bab
ilit
y(C
P)
forn
=50
0
εA
lgor
ithm
Cov
erag
eP
robab
ilit
y(C
P)
β01
β11
β21
β31
β41
β02
β12
β22
β32
β42
e−2
CIP
HE
RT
hre
e-W
ay90
.597
.995
.191
.098
.998
.397
.697
.592
.597
.5C
IPH
ER
Tw
o-W
ay92
.498
.694
.085
.498
.998
.796
.094
.586
.495
.3M
WE
MT
hre
e-W
ay(T
=10
)38
.174
.898
.282
.113
.282
.429
.998
.694
.483
.0M
WE
MT
wo-
Way
(T=
10)
61.5
83.0
98.8
90.2
25.7
86.4
47.5
98.9
97.5
89.0
FD
HL
apla
ceSan
itiz
er86
.997
.588
.984
.497
.896
.897
.896
.890
.597
.3
e−1
CIP
HE
RT
hre
e-W
ay87
.797
.392
.686
.398
.898
.597
.896
.390
.697
.0C
IPH
ER
Tw
o-W
ay92
.698
.490
.382
.699
.398
.197
.890
.390
.097
.6M
WE
MT
hre
e-W
ay(T
=25
)63
.280
.697
.490
.840
.188
.358
.297
.693
.590
.5M
WE
MT
wo-
Way
(T=
25)
81.4
89.4
98.6
94.0
60.9
92.8
73.9
97.6
97.9
93.9
FD
HL
apla
ceSan
itiz
er94
.098
.396
.292
.096
.998
.598
.896
.797
.298
.6
e0C
IPH
ER
Thre
e-W
ay93
.396
.295
.189
.899
.198
.999
.096
.296
.199
.1C
IPH
ER
Tw
o-W
ay97
.699
.595
.879
.799
.498
.598
.289
.197
.199
.6M
WE
MT
hre
e-W
ay(T
=50
)64
.383
.297
.890
.543
.786
.856
.897
.093
.986
.3M
WE
MT
wo-
Way
(T=
50)
86.8
90.8
98.6
95.3
66.3
93.2
78.4
98.6
97.8
94.5
FD
HL
apla
ceSan
itiz
er99
.799
.699
.799
.299
.599
.899
.598
.399
.899
.8
e1C
IPH
ER
Thre
e-W
ay99
.499
.799
.799
.399
.610
0.0
99.6
98.7
99.3
99.8
CIP
HE
RT
wo-
Way
99.2
99.4
99.4
71.4
100.
099
.797
.588
.199
.599
.3M
WE
MT
hre
e-W
ay(T
=10
0)66
.683
.498
.693
.049
.588
.464
.097
.195
.090
.5M
WE
MT
wo-
Way
(T=
100)
86.2
87.5
98.8
95.2
70.5
92.7
83.0
97.9
96.2
95.6
FD
HL
apla
ceSan
itiz
er10
0.0
99.9
100.
099
.910
0.0
99.9
99.8
99.3
99.9
99.8
e2C
IPH
ER
Thre
e-W
ay99
.799
.899
.999
.899
.999
.999
.599
.099
.699
.9C
IPH
ER
Tw
o-W
ay10
0.0
99.5
99.7
76.8
100.
099
.996
.889
.799
.999
.4M
WE
MT
hre
e-W
ay(T
=20
0)68
.581
.099
.495
.359
.288
.364
.897
.495
.290
.5M
WE
MT
wo-
Way
(T=
200)
89.0
87.9
98.5
96.9
76.9
94.0
83.2
97.7
97.1
96.2
FD
HL
apla
ceSan
itiz
er10
0.0
99.8
99.9
99.9
100.
099
.999
.799
.599
.899
.6
31
Tab
le8:
Sim
ula
tion
Res
ult
s:C
onfiden
ceIn
terv
alW
idth
sfo
rn
=50
0
εA
lgor
ithm
Con
fiden
ceIn
terv
alW
idth
β01
β11
β21
β31
β41
β02
β12
β22
β32
β42
e−2
CIP
HE
RT
hre
e-W
ay13
.450
13.3
5611
.739
11.2
9812
.138
11.7
9914
.900
15.2
6216
.651
15.6
61C
IPH
ER
Tw
o-W
ay12
.521
10.2
059.
495
8.30
110
.245
8.72
011
.275
9.97
912
.000
9.93
6M
WE
MT
hre
e-W
ay(T
=10
)2.
850
2.64
32.
596
2.92
72.
973
2.83
82.
619
2.60
22.
928
2.93
5M
WE
MT
wo-
Way
(T=
10)
4.40
33.
793
3.48
34.
001
4.02
84.
360
3.86
13.
557
4.01
54.
253
FD
HL
apla
ceSan
itiz
er10
.130
11.9
475.
750
7.55
16.
739
7.90
912
.781
17.7
6512
.944
14.4
11
e−1
CIP
HE
RT
hre
e-W
ay8.
407
7.70
15.
548
5.97
16.
887
6.38
47.
313
7.73
37.
524
7.28
9C
IPH
ER
Tw
o-W
ay8.
071
8.61
25.
940
6.63
86.
519
6.71
99.
079
10.6
779.
682
9.74
7M
WE
MT
hre
e-W
ay(T
=25
)3.
825
3.39
63.
363
3.85
93.
813
3.73
93.
431
3.33
23.
811
4.02
8M
WE
MT
wo-
Way
(T=
25)
8.44
06.
290
6.43
27.
438
7.82
48.
935
7.41
66.
535
7.89
08.
564
FD
HL
apla
ceSan
itiz
er5.
383
6.99
43.
239
4.05
23.
449
4.06
86.
728
9.52
86.
077
7.79
3
e0C
IPH
ER
Thre
e-W
ay4.
366
4.90
53.
100
3.62
03.
333
3.62
24.
746
5.69
34.
535
5.05
2C
IPH
ER
Tw
o-W
ay4.
530
4.79
43.
386
3.91
13.
958
4.41
14.
717
5.50
84.
450
4.88
3M
WE
MT
hre
e-W
ay(T
=50
)3.
711
3.36
83.
346
3.78
33.
814
3.68
73.
325
3.41
53.
729
3.75
9M
WE
MT
wo-
Way
(T=
50)
8.37
85.
718
6.01
07.
989
7.96
78.
327
6.14
06.
364
8.34
08.
010
FD
HL
apla
ceSan
itiz
er3.
141
3.57
02.
460
2.79
62.
450
2.67
33.
650
4.37
13.
164
3.66
9
e1C
IPH
ER
Thre
e-W
ay2.
888
3.10
52.
487
2.75
72.
548
2.71
83.
173
3.59
22.
902
3.15
0C
IPH
ER
Tw
o-W
ay2.
976
3.21
72.
370
2.69
62.
673
2.96
93.
046
3.47
83.
000
3.30
3M
WE
MT
hre
e-W
ay(T
=10
0)3.
650
3.38
53.
365
3.79
93.
739
3.73
13.
396
3.47
13.
806
3.80
5M
WE
MT
wo-
Way
(T=
100)
6.98
64.
763
5.18
46.
366
6.69
57.
998
5.84
35.
902
7.79
58.
276
FD
HL
apla
ceSan
itiz
er2.
450
2.61
12.
195
2.40
92.
209
2.34
72.
580
2.88
32.
444
2.64
6
e2C
IPH
ER
Thre
e-W
ay2.
585
2.74
42.
290
2.49
72.
315
2.46
92.
717
3.03
12.
559
2.77
5C
IPH
ER
Tw
o-W
ay2.
592
2.71
52.
160
2.34
32.
292
2.41
72.
623
2.88
82.
581
2.74
4M
WE
MT
hre
e-W
ay(T
=20
0)3.
726
3.29
73.
316
3.76
43.
790
3.70
23.
387
3.39
43.
853
3.78
0M
WE
MT
wo-
Way
(T=
200)
6.17
94.
408
4.36
85.
918
6.08
67.
931
4.90
25.
185
7.94
88.
003
FD
HL
apla
ceSan
itiz
er2.
345
2.49
22.
144
2.34
72.
164
2.29
92.
469
2.74
82.
343
2.53
8
32
2D
eri
vati
on
of
Lin
ear
Equ
ati
on
sets
Ax
=b
Inth
isse
ctio
n,
we
illu
stra
teth
eder
ivat
ion
ofth
elinea
req
uat
ion
set
give
na
pre
-sp
ecifi
edquer
yse
tQ
inth
efo
llow
ing
thre
esc
enar
ios:
(1)
3-va
riab
le2×
2×
2ca
sew
ithQ
=al
l2-
way
his
togr
ams;
2)3-
vari
able
2×
3×
3ca
sew
ithQ
=al
l2-
way
his
togr
ams;
3)4
vari
able
case
:2×
2×
3×
3w
ithQ
=al
l2-
way
his
togr
ams.
2.1
Thre
evari
able
case
(2×
2×
2)
Inth
e3
vari
able
case
2×
2×
2,w
efirs
tob
tain
P(V
3=
0|V1)
=P
(V3
=0,V2
=0|V1)
+P
(V3
=0,V2
=1|V1)
=P
(V3
=0|V2
=0,V1)P
(V2
=0|V1)
+P
(V3
=0|V2
=1,V1)P
(V2
=1|V1)
P(V
3=
0|V2)
=P
(V3
=0,V1
=0|V2)
+P
(V3
=0,V1
=1|V2)
=P
(V3
=0|V1
=0,V2)P
(V1
=0|V2)
+P
(V3
=0|V1
=1,V2)P
(V1
=1|V2).
Exam
inin
gea
chsc
enar
ioofV1
andV2,
the
two
equat
ions
abov
eca
nb
eex
pan
ded
into
four
equat
ions.
P(V
3=
0|V1
=0)
=P
(V3
=0|V2
=0,V1
=0)P
(V2
=0|V1
=0)
+P
(V3
=0|V2
=1,V1
=0)P
(V2
=1|V1
=0)
P(V
3=
0|V1
=1)
=P
(V3
=0|V2
=0,V1
=1)P
(V2
=0|V1
=1)
+P
(V3
=0|V2
=1,V1
=1)P
(V2
=1|V1
=1)
P(V
3=
0|V2
=0)
=P
(V3
=0|V1
=0,V2
=0)P
(V1
=0|V2
=0)
+P
(V3
=0|V1
=1,V2
=0)P
(V1
=1|V2
=0)
P(V
3=
0|V2
=1)
=P
(V3
=0|V1
=0,V2
=1)P
(V1
=0|V2
=1)
+P
(V3
=0|V1
=1,V2
=1)P
(V1
=1|V2
=1)
(1)
Usi
ng
the
sanit
ized
valu
esfr
omth
e2-
way
table
s,th
enth
ele
fthan
dsi
des
ofth
efo
ur
equat
ions
abov
eb
=(P
(V3
=0|V1
=0),P
(V3
=0|V1
=1),P
(V3
=0|V2
=0),P
(V3
=0|V2
=1)
)ar
eknow
n.
Addit
ional
ly,
onth
eri
ght
han
dsi
de,
the
elem
ents
ofP
(V2
=0|V1
=0)
,P
(V2
=0|V1
=1)
,P
(V1
=0|V2
=0)
,P
(V1
=0|V2
=1)
,P
(V2
=1|V1
=0)
,P
(V2
=1|V1
=1)
,P
(V1
=1|V2
=0)
,an
dP
(V1
=1|V2
=1)
can
be
calc
ula
ted
from
the
sanit
ized
2-w
ayta
ble
s.T
her
efor
e,E
qn
(1)
can
be
wri
tten
asb
=Az
,w
her
e‡
=(P
(V3
=0|V1
=0,V2
=0),P
(V3
=0|V1
=1,V2
=0),P
(V3
=0|V1
=0,V2
=1),P
(V3
=0|V1
=1,V2
=1)
)A
conta
ins
know
nco
effici
ents
asso
ciat
edw
ith
z.N
ote
that
thou
ghth
ere
are
four
equat
ions
inE
qn
(1),
they
actu
ally
are
linea
rly
dep
enden
t,T
her
efor
e,w
eap
ply
the
Tik
hon
ovre
gula
riza
tion
toso
lve
for
the
four
unknow
ns
inz.
Once
we
get
zP
(V3
=1|V1,V
2)
=1−P
(V3
=0|V1,V
2),
we
can
subse
quen
tly
calc
ula
teth
ejo
int
pro
bab
ilit
yam
ong
(V1,V
2,V
3)
asinP
(V1,V
2,V
3)
=P
(V3|V
1,V
2)P
(V1,V
2),
from
whic
hw
eca
nsa
mple
the
synth
etic
dat
a.
33
2.2
Thre
evari
able
case
for
(2×
3×
3)
Inth
e3
vari
able
case
(2×
3×
3),
the
init
ial
equat
ions
are
P(V
3=
0|V1)
=P
(V3
=0,V2
=0|V1)
+P
(V3
=0,V2
=1|V1)
+P
(V3
=0,V2
=2|V1)
=P
(V3
=0|V2
=0,V1)P
(V2
=0|V1)
+P
(V3
=0|V2
=1,V1)P
(V2
=1|V1)
+P
(V3
=0|V2
=2,V1)P
(V2
=2|V1)
P(V
3=
0|V2)
=P
(V3
=0,V1
=0|V2)
+P
(V3
=0,V1
=1|V2)
=P
(V3
=0|V1
=0,V2)P
(V1
=0|V2)
+P
(V3
=0|V1
=1,V2)P
(V1
=1|V2)
P(V
3=
1|V1)
=P
(V3
=1,V2
=0|V1)
+P
(V3
=1,V2
=1|V1)
+P
(V3
=1,V2
=2|V1)
=P
(V3
=1|V2
=0,V1)P
(V2
=0|V1)
+P
(V3
=1|V2
=1,V1)P
(V2
=1|V1)
+P
(V3
=1|V2
=2,V1)P
(V2
=2|V1)
P(V
3=
1|V2)
=P
(V3
=1,V1
=0|V2)
+P
(V3
=1,V1
=1|V2)
=P
(V3
=1|V1
=0,V2)P
(V1
=0|V2)
+P
(V3
=1|V1
=1,V2)P
(V1
=1|V2)
Exam
inin
gea
chsc
enar
ioofV1
andV2,
we
can
expan
dth
eab
ove
4eq
uat
ions
into
10eq
uat
ions
wit
h12
unknow
ns.
The
left
sides
ofth
e10
equat
ions
com
pos
eb
,an
dz
com
pri
ses
ofth
e12
unknow
ns,
whic
har
eth
eco
ndit
ional
pro
bab
ilit
ies
ofV3
=0|V1,V
2an
d
34
V3
=1|V1,V
2,
and
Aco
nta
ins
the
corr
esp
ondin
gco
effici
ents
.W
eap
ply
the
Tik
hon
ovre
gula
riza
tion
toso
lve
for
zfr
omb
=A
z.
P(V
3=
0|V1
=0)
=P
(V3
=0|V2
=0,V1
=0)P
(V2
=0|V1
=0)
+P
(V3
=0|V2
=1,V1
=0)P
(V2
=1|V1
=0)
+P
(V3
=0|V2
=2,V1
=0)P
(V2
=2|V1
=0)
P(V
3=
0|V1
=1)
=P
(V3
=0|V2
=0,V1
=1)P
(V2
=0|V1
=1)
+P
(V3
=0|V2
=1,V1
=1)P
(V2
=1|V1
=1)
+P
(V3
=0|V2
=2,V1
=1)P
(V2
=2|V1
=1)
P(V
3=
0|V2
=0)
=P
(V3
=0|V1
=0,V2
=0)P
(V1
=0|V2
=0)
+P
(V3
=0|V1
=1,V2
=0)P
(V1
=1|V2
=0)
P(V
3=
0|V2
=1)
=P
(V3
=0|V1
=0,V2
=1)P
(V1
=0|V2
=1)
+P
(V3
=0|V1
=1,V2
=1)P
(V1
=1|V2
=1)
P(V
3=
0|V2
=2)
=P
(V3
=0|V1
=0,V2
=2)P
(V1
=0|V2
=2)
+P
(V3
=0|V1
=1,V2
=2)P
(V1
=1|V2
=2)
P(V
3=
1|V1
=0)
=P
(V3
=1|V2
=0,V1
=0)P
(V2
=0|V1
=0)
+P
(V3
=1|V2
=1,V1
=0)P
(V2
=1|V1
=0)
+P
(V3
=1|V2
=2,V1
=0)P
(V2
=2|V1
=0)
P(V
3=
1|V1
=1)
=P
(V3
=1|V2
=0,V1
=1)P
(V2
=0|V1
=1)
+P
(V3
=1|V2
=1,V1
=1)P
(V2
=1|V1
=1)
+P
(V3
=1|V2
=2,V1
=1)P
(V2
=2|V1
=1)
P(V
3=
1|V2
=0)
=P
(V3
=1|V1
=0,V2
=0)P
(V1
=0|V2
=0)
+P
(V3
=1|V1
=1,V2
=0)P
(V1
=1|V2
=0)
P(V
3=
1|V2
=1)
=P
(V3
=1|V1
=0,V2
=1)P
(V1
=0|V2
=1)
+P
(V3
=1|V1
=1,V2
=1)P
(V1
=1|V2
=1)
P(V
3=
1|V2
=2)
=P
(V3
=1|V1
=0,V2
=2)P
(V1
=0|V2
=2)
+P
(V3
=1|V1
=1,V2
=2)P
(V1
=1|V2
=2)
35
2.3
Four
vari
able
case
(2×
2×
3×
3)
Inth
isex
ample
,th
eva
riab
lesV1
andV2
hav
etw
oca
tego
ries
,an
dV3
andV4
hav
eth
ree
cate
gori
es.
We
assu
me
the
quer
yse
tQ
consi
sts
ofal
l2D
his
togr
ams
amon
ghe
vari
able
sofV1,V2,V3
andV4
(The
pro
cedure
sar
esi
milar
ifQ
consi
sts
ofot
her
typ
esof
his
togr
ams,
such
asal
l3D
his
togr
ams,
and
am
ixtu
reof
2Dor
3Dhis
togr
ams)
.C
IPH
ER
firs
tso
lves
for
the
pro
bab
ilit
ydis
trib
uti
onfo
ral
l3D
his
togr
ams
give
n2D
his
togr
ams,
the
pro
cedure
sar
esi
milar
toth
e3-
vari
able
exam
ple
sin
Sec
tion
s2.
1an
d2.
2of
the
supple
men
tary
mat
eria
ls.
Once
the
3Dhis
togr
ams
are
avai
lable
,w
eca
nca
lcula
teth
epro
bab
ilit
ydis
trib
uti
onof
the
four
vari
able
isca
lcula
ted
give
nth
e3D
his
togr
ams.
The
init
ial
equat
ions
are
P(V
4=
0|V1,V
2)
=P
(V4
=0,V3
=0|V1,V
2)
+P
(V4
=0,V3
=1|V1,V
2)
+P
(V4
=0,V3
=2|V1,V
2)
=P
(V4
=0|V3
=0,V1,V
2)P
(V3
=0|V1,V
2)
+P
(V4
=0|V3
=1,V1,V
2)P
(V3
=1|V1,V
2)
+P
(V4
=0|V3
=2,V1,V
2)P
(V3
=2|V1,V
2)
P(V
4=
0|V1,V
3)
=P
(V4
=0,V2
=0|V1,V
3)
+P
(V4
=0,V2
=1|V1,V
3)
=P
(V4
=0|V2
=0,V1,V
3)P
(V2
=0|V1,V
3)
+P
(V4
=0|V2
=1,V1,V
3)P
(V2
=1|V1,V
3)
P(V
4=
0|V2,V
3)
=P
(V4
=0,V1
=0|V2,V
3)
+P
(V4
=0,V1
=1|V2,V
3)
=P
(V4
=0|V1
=0,V2,V
3)P
(V1
=0|V2,V
3)
+P
(V4
=0|V1
=1,V2,V
3)P
(V1
=1|V2,V
3)
P(V
4=
1|V1,V
2)
=P
(V4
=1,V3
=0|V1,V
2)
+P
(V4
=1,V3
=1|V1,V
2)
+P
(V4
=1,V3
=2|V1,V
2)
=P
(V4
=1|V3
=0,V1,V
2)P
(V3
=0|V1,V
2)
+P
(V4
=1|V3
=1,V1,V
2)P
(V3
=1|V1,V
2)
+P
(V4
=1|V3
=2,V1,V
2)P
(V3
=2|V1,V
2)
P(V
4=
1|V1,V
3)
=P
(V4
=1,V2
=0|V1,V
3)
+P
(V4
=1,V2
=1|V1,V
3)
=P
(V4
=1|V2
=0,V1,V
3)P
(V2
=0|V1,V
3)
+P
(V4
=1|V2
=1,V1,V
3)P
(V2
=1|V1,V
3)
P(V
4=
1|V2,V
3)
=P
(V4
=1,V1
=0|V2,V
3)
+P
(V4
=1,V1
=1|V2,V
3)
=P
(V4
=1|V1
=0,V2,V
3)P
(V1
=0|V2,V
3)
+P
(V4
=1|V1
=1,V2,V
3)P
(V1
=1|V2,V
3)
Exam
inin
gea
chsc
enar
ioofV1,V2,
andV3,
we
can
expan
dth
eab
ove
6eq
uat
ions
into
32w
ith
24unknow
ns
(the
condit
ional
pro
bab
ilit
y.A
gain
,32
equat
ions
are
linea
rly
dep
enden
t,an
dit
sra
nk
is<
24.
Ther
efor
e,w
eap
ply
the
Tik
hon
ovre
gula
riza
tion
toso
lve
for
zfr
omb
=A
z,w
her
eb
conta
ins
the
left
sides
ofth
e32
equat
ions,
and
zre
fer
toth
eco
ndit
ional
pro
bab
ilit
ies
ofV4
=0|V1,V
2,V
3an
dV4
=1|V1,V
2,V
3,
and
Aco
nta
ins
the
corr
esp
ondin
gco
effici
ents
.
36
P(V
4=
0|V1
=0,V2
=0)
=P
(V4
=0|V1
=0,V2
=0,V3
=0)P
(V3
=0|V1
=0,V2
=0)
+P
(V4
=0|V1
=0,V2
=0,V3
=1)P
(V3
=1|V1
=0,V2
=0)
+P
(V4
=0|V1
=0,V2
=0,V3
=2)P
(V3
=2|V1
=0,V2
=0)
P(V
4=
0|V1
=1,V2
=0)
=P
(V4
=0|V1
=1,V2
=0,V3
=0)P
(V3
=0|V1
=1,V2
=0)
+P
(V4
=0|V1
=1,V2
=0,V3
=1)P
(V3
=1|V1
=1,V2
=0)
+P
(V4
=0|V1
=1,V2
=0,V3
=2)P
(V3
=2|V1
=1,V2
=0)
P(V
4=
0|V1
=0,V2
=1)
=P
(V4
=0|V1
=0,V2
=1,V3
=0)P
(V3
=0|V1
=0,V2
=1)
+P
(V4
=0|V1
=0,V2
=1,V3
=1)P
(V3
=1|V1
=0,V2
=1)
+P
(V4
=0|V1
=0,V2
=1,V3
=2)P
(V3
=2|V1
=0,V2
=1)
P(V
4=
0|V1
=1,V2
=1)
=P
(V4
=0|V1
=1,V2
=1,V3
=0)P
(V3
=0|V1
=1,V2
=1)
+P
(V4
=0|V1
=1,V2
=1,V3
=1)P
(V3
=1|V1
=1,V2
=1)
+P
(V4
=0|V1
=1,V2
=1,V3
=2)P
(V3
=2|V1
=1,V2
=1)
P(V
4=
0|V1
=0,V3
=0)
=P
(V4
=0|V1
=0,V2
=0,V3
=0)P
(V2
=0|V1
=0,V3
=0)
+P
(V4
=0|V1
=0,V2
=1,V3
=0)P
(V2
=1|V1
=0,V3
=0)
P(V
4=
0|V1
=1,V3
=0)
=P
(V4
=0|V1
=1,V2
=0,V3
=0)P
(V2
=0|V1
=1,V3
=0)
+P
(V4
=0|V1
=1,V2
=1,V3
=0)P
(V2
=1|V1
=1,V3
=0)
P(V
4=
0|V1
=0,V3
=1)
=P
(V4
=0|V1
=0,V2
=0,V3
=1)P
(V2
=0|V1
=0,V3
=1)
+P
(V4
=0|V1
=0,V2
=1,V3
=1)P
(V2
=1|V1
=0,V3
=1)
P(V
4=
0|V1
=1,V3
=1)
=P
(V4
=0|V1
=1,V2
=0,V3
=1)P
(V2
=0|V1
=1,V3
=1)
+P
(V4
=0|V1
=1,V2
=1,V3
=1)P
(V2
=1|V1
=1,V3
=1)
P(V
4=
0|V1
=0,V3
=2)
=P
(V4
=0|V1
=0,V2
=0,V3
=2)P
(V2
=0|V1
=0,V3
=2)
+P
(V4
=0|V1
=0,V2
=1,V3
=2)P
(V2
=1|V1
=0,V3
=2)
P(V
4=
0|V1
=1,V3
=2)
=P
(V4
=0|V1
=1,V2
=0,V3
=2)P
(V2
=0|V1
=1,V3
=2)
+P
(V4
=0|V1
=1,V2
=1,V3
=2)P
(V2
=1|V1
=1,V3
=2)
37
P(V
4=
0|V2
=0,V3
=0)
=P
(V4
=0|V1
=0,V2
=0,V3
=0)P
(V1
=0|V2
=0,V3
=0)
+P
(V4
=0|V1
=1,V2
=0,V3
=0)P
(V1
=1|V2
=0,V3
=0)
P(V
4=
0|V2
=1,V3
=0)
=P
(V4
=0|V1
=0,V2
=1,V3
=0)P
(V1
=0|V2
=1,V3
=0)
+P
(V4
=0|V1
=1,V2
=1,V3
=0)P
(V1
=1|V2
=1,V3
=0)
P(V
4=
0|V2
=0,V3
=1)
=P
(V4
=0|V1
=0,V2
=0,V3
=1)P
(V1
=0|V2
=0,V3
=1)
+P
(V4
=0|V1
=1,V2
=0,V3
=1)P
(V1
=1|V2
=0,V3
=1)
P(V
4=
0|V2
=1,V3
=1)
=P
(V4
=0|V1
=0,V2
=1,V3
=1)P
(V1
=0|V2
=1,V3
=1)
+P
(V4
=0|V1
=1,V2
=1,V3
=1)P
(V1
=1|V2
=1,V3
=1)
P(V
4=
0|V2
=0,V3
=2)
=P
(V4
=0|V1
=0,V2
=0,V3
=2)P
(V1
=0|V2
=0,V3
=2)
+P
(V4
=0|V1
=1,V2
=0,V3
=2)P
(V1
=1|V2
=0,V3
=2)
P(V
4=
0|V2
=1,V3
=2)
=P
(V4
=0|V1
=0,V2
=1,V3
=2)P
(V1
=0|V2
=1,V3
=2)
+P
(V4
=0|V1
=1,V2
=1,V3
=2)P
(V1
=1|V2
=1,V3
=2)
P(V
4=
1|V1
=0,V2
=0)
=P
(V4
=1|V1
=0,V2
=0,V3
=0)P
(V3
=0|V1
=0,V2
=0)
+P
(V4
=1|V1
=0,V2
=0,V3
=1)P
(V3
=1|V1
=0,V2
=0)
+P
(V4
=1|V1
=0,V2
=0,V3
=2)P
(V3
=2|V1
=0,V2
=0)
P(V
4=
1|V1
=1,V2
=0)
=P
(V4
=1|V1
=1,V2
=0,V3
=0)P
(V3
=0|V1
=1,V2
=0)
+P
(V4
=1|V1
=1,V2
=0,V3
=1)P
(V3
=1|V1
=1,V2
=0)
+P
(V4
=1|V1
=1,V2
=0,V3
=2)P
(V3
=2|V1
=1,V2
=0)
P(V
4=
1|V1
=0,V2
=1)
=P
(V4
=1|V1
=0,V2
=1,V3
=0)P
(V3
=0|V1
=0,V2
=1)
+P
(V4
=1|V1
=0,V2
=1,V3
=1)P
(V3
=1|V1
=0,V2
=1)
+P
(V4
=1|V1
=0,V2
=1,V3
=2)P
(V3
=2|V1
=0,V2
=1)
P(V
4=
1|V1
=1,V2
=1)
=P
(V4
=1|V1
=1,V2
=1,V3
=0)P
(V3
=0|V1
=1,V2
=1)
+P
(V4
=1|V1
=1,V2
=1,V3
=1)P
(V3
=1|V1
=1,V2
=1)
+P
(V4
=1|V1
=1,V2
=1,V3
=2)P
(V3
=2|V1
=1,V2
=1)
38
P(V
4=
1|V1
=0,V3
=0)
=P
(V4
=1|V1
=0,V2
=0,V3
=0)P
(V2
=0|V1
=0,V3
=0)
+P
(V4
=1|V1
=0,V2
=1,V3
=0)P
(V2
=1|V1
=0,V3
=0)
P(V
4=
1|V1
=1,V3
=0)
=P
(V4
=1|V1
=1,V2
=0,V3
=0)P
(V2
=0|V1
=1,V3
=0)
+P
(V4
=1|V1
=1,V2
=1,V3
=0)P
(V2
=1|V1
=1,V3
=0)
P(V
4=
1|V1
=0,V3
=1)
=P
(V4
=1|V1
=0,V2
=0,V3
=1)P
(V2
=0|V1
=0,V3
=1)
+P
(V4
=1|V1
=0,V2
=1,V3
=1)P
(V2
=1|V1
=0,V3
=1)
P(V
4=
1|V1
=1,V3
=1)
=P
(V4
=1|V1
=1,V2
=0,V3
=1)P
(V2
=0|V1
=1,V3
=1)
+P
(V4
=1|V1
=1,V2
=1,V3
=1)P
(V2
=1|V1
=1,V3
=1)
P(V
4=
1|V1
=0,V3
=2)
=P
(V4
=1|V1
=0,V2
=0,V3
=2)P
(V2
=0|V1
=0,V3
=2)
+P
(V4
=1|V1
=0,V2
=1,V3
=2)P
(V2
=1|V1
=0,V3
=2)
P(V
4=
1|V1
=1,V3
=2)
=P
(V4
=1|V1
=1,V2
=0,V3
=2)P
(V2
=0|V1
=1,V3
=2)
+P
(V4
=1|V1
=1,V2
=1,V3
=2)P
(V2
=1|V1
=1,V3
=2)
P(V
4=
1|V2
=0,V3
=0)
=P
(V4
=1|V1
=0,V2
=0,V3
=0)P
(V1
=0|V2
=0,V3
=0)
+P
(V4
=1|V1
=1,V2
=0,V3
=0)P
(V1
=1|V2
=0,V3
=0)
P(V
4=
1|V2
=1,V3
=0)
=P
(V4
=1|V1
=0,V2
=1,V3
=0)P
(V1
=0|V2
=1,V3
=0)
+P
(V4
=1|V1
=1,V2
=1,V3
=0)P
(V1
=1|V2
=1,V3
=0)
P(V
4=
1|V2
=0,V3
=1)
=P
(V4
=1|V1
=0,V2
=0,V3
=1)P
(V1
=0|V2
=0,V3
=1)
+P
(V4
=1|V1
=1,V2
=0,V3
=1)P
(V1
=1|V2
=0,V3
=1)
P(V
4=
1|V2
=1,V3
=1)
=P
(V4
=1|V1
=0,V2
=1,V3
=1)P
(V1
=0|V2
=1,V3
=1)
+P
(V4
=1|V1
=1,V2
=1,V3
=1)P
(V1
=1|V2
=1,V3
=1)
P(V
4=
1|V2
=0,V3
=2)
=P
(V4
=1|V1
=0,V2
=0,V3
=2)P
(V1
=0|V2
=0,V3
=2)
+P
(V4
=1|V1
=1,V2
=0,V3
=2)P
(V1
=1|V2
=0,V3
=2)
P(V
4=
1|V2
=1,V3
=2)
=P
(V4
=1|V1
=0,V2
=1,V3
=2)P
(V1
=0|V2
=1,V3
=2)
+P
(V4
=1|V1
=1,V2
=1,V3
=2)P
(V1
=1|V2
=1,V3
=2)
39