Binary segmentation methods based on Gini Index - unisi.it May/PAPER_Costa... · Binary...

19
Binary segmentation methods based on Gini Index: A new approach to the measurement of poverty Michele Costa Giuliano Galimberti Angela Montanari Università di Bologna, Dipartimento di Scienze Statistiche, [email protected] [email protected] [email protected] 1. Introduction Gini inequality index (Gini, 1939) represents one of the most used and widespread inequality measures, both in methodological studies and in applied researches. In the case of a population divided into k subgroups, a long standing tradition states that inequality indexes decomposition is obtained in the framework of the analysis of variance, and therefore inequality between subgroups is calculated as a function of the k subgroup means only. Since the contribution by Bhatthacharya and Mahalanobis (1967), Gini index too is usually decomposed following this approach. Typically, however, subgroup income or wealth distributions strongly differ not only for the mean, but also for variance and skewness. A decomposition of the Gini index able to provide a more complete and correct measurement of inequality between subgroups has been proposed by Dagum (1997), who considers not only the differences between the means of the population subgroups, but all the differences between subgroup distributions. A further issue which has been deeply analyzed in the literature (see for instance Cheli et al.,1994; Cannari and D’Alessio, 2003) concerns the role of individual covariates in explaining income differences and poverty structure. In this paper we propose to study the effect of socio-demographic and geographical characteristics on subgroup differences by developing a non parametric regression model for income inequalities, based on recursive partitioning methods. Within the framework of Classification and Regression Trees (Brieman et al. 1984) we suggest to replace the usually employed splitting criterion, based on the well known decomposition into between and within group deviance components, with a new criterion based on Gini index, which optimizes inequality between subgroups. This solution allows to better detect the covariates which mainly influence income inequality by taking into account all the income distributional aspects and points out specific income profiles.

Transcript of Binary segmentation methods based on Gini Index - unisi.it May/PAPER_Costa... · Binary...

Page 1: Binary segmentation methods based on Gini Index - unisi.it May/PAPER_Costa... · Binary segmentation methods based on Gini Index: A new approach to the measurement of poverty Michele

Binary segmentation methods based on Gini Index: A new approach to the measurement of poverty Michele Costa Giuliano Galimberti Angela Montanari Università di Bologna, Dipartimento di Scienze Statistiche, [email protected] [email protected] [email protected] 1. Introduction Gini inequality index (Gini, 1939) represents one of the most used and widespread inequality

measures, both in methodological studies and in applied researches. In the case of a

population divided into k subgroups, a long standing tradition states that inequality indexes

decomposition is obtained in the framework of the analysis of variance, and therefore

inequality between subgroups is calculated as a function of the k subgroup means only. Since

the contribution by Bhatthacharya and Mahalanobis (1967), Gini index too is usually

decomposed following this approach. Typically, however, subgroup income or wealth

distributions strongly differ not only for the mean, but also for variance and skewness. A

decomposition of the Gini index able to provide a more complete and correct measurement of

inequality between subgroups has been proposed by Dagum (1997), who considers not only

the differences between the means of the population subgroups, but all the differences

between subgroup distributions.

A further issue which has been deeply analyzed in the literature (see for instance Cheli et

al.,1994; Cannari and D’Alessio, 2003) concerns the role of individual covariates in

explaining income differences and poverty structure.

In this paper we propose to study the effect of socio-demographic and geographical

characteristics on subgroup differences by developing a non parametric regression model for

income inequalities, based on recursive partitioning methods. Within the framework of

Classification and Regression Trees (Brieman et al. 1984) we suggest to replace the usually

employed splitting criterion, based on the well known decomposition into between and within

group deviance components, with a new criterion based on Gini index, which optimizes

inequality between subgroups. This solution allows to better detect the covariates which

mainly influence income inequality by taking into account all the income distributional

aspects and points out specific income profiles.

Page 2: Binary segmentation methods based on Gini Index - unisi.it May/PAPER_Costa... · Binary segmentation methods based on Gini Index: A new approach to the measurement of poverty Michele

2. Dagum’s decomposition of Gini index in the k group case

In a population of n units, with income (or wealth, …) vector {y1, y2, ..., yn} and average

income y , subdivided in k subgroups of nj units each, with j=1,...,k , , and

average income

∑= =kj jnn 1

jy , the Gini index is:

∑ ∑∑ ∑ −== = = =

k

j

k

h

jn

i

hn

rhrji yy

ynG

1 1 1 12 ||2

1

(1) By setting the subgroup j population share nnp jj /= , and the subgroup j income share

yyps jjj /= , the Gini index between subgroup j and subgroup h can be expressed as

∑ ∑ −+

== =

jn

i

hn

rhrji

hjhjjh yy

yynnG

1 1||

)(1

(2)

and the Gini index can be obtained as a weighted sum of with weights jhG hj sp

∑ ∑== =

k

j

k

hhjjh spGG

1 1

(3)

In this framework the inequality within the k subgroups can be easily expressed as

∑==

k

jjjjjw spGG

1

(4)

i.e. the sum of the k Gini indexes in the k subgroups weighted by pjsj. This measure of the

inequality within is quite generally accepted in the literature, where the inequality between

subgroups is usually obtained as a variant of Bhatthacharya and Mahalanobis (1967)

Page 3: Binary segmentation methods based on Gini Index - unisi.it May/PAPER_Costa... · Binary segmentation methods based on Gini Index: A new approach to the measurement of poverty Michele

∑ ∑ −==

≠=

k

j

k

hjh

hjhjb yyppG1 1

||

An alternative proposal (Dagum, 1997a; Mehran, 1975) aimed at measuring the contribution

to total inequality due to the differences between the k subgroups is

∑ ∑==

≠=

k

j

k

hjh

hjjhb spGG1 1

(5)

which takes into account not only the differences between the income means of the k

population subgroups, but also all other possible differences. Dagum suggests a further

decomposition of the inequality between subgroups distinguishing the net Gini inequality

between and the contribution of the intensity of transvariation; a more detailed discussion is

provided in the original references.

A simple expression for the net inequality between, in the two group case, is derived in Costa

(2004).

3.Regression trees

The goal of regression trees, as introduced by Brieman, Friedman, Olshen and Stone (1984) is

to study the relationship between a scalar response variable Y and a set of covariates X,

either real valued or categorical. Regression trees have proved to be a flexible modeling

instrument for very complex data sets (G. Galimberti, A. Montanari, 2002)

Assuming ( ) ipiiii exxxfy += ,...,, 21 , i=1,...,n, a regression tree approximates the unknown f

by a step function defined on the covariate space.

The construction of a regression tree goes through the following main steps:

1. Definition of a set o questions regarding the covariates, also called splits, in order to

partition the covariate space. Recursive application of these questions leads to a tree,

which is binary if the questions are binary (yes/no). Units for which the answer is yes

are assigned to a given daughter node (the left in our results), those for which the

answer is no are assigned to the complementary one.

Page 4: Binary segmentation methods based on Gini Index - unisi.it May/PAPER_Costa... · Binary segmentation methods based on Gini Index: A new approach to the measurement of poverty Michele

2. Definition of a so called split function that, at each step, can be evaluated for any split

s of any node g. The preferred split is the one which guaranties the purest daughter

nodes.

Ordinary least squares (OLS) regression trees produce a partition of the covariate

space such that, within each element of the partition, the regression function may be

approximated by the mean value of the response variable Y corresponding to those

units whose covariate values belong to the partition member.

This is obtained by choosing the total within node sum of squares as the split function: 2_

~

)(1∑∑∈ ∈

Tt txi

i

tyyn

(where ~T is the set of terminal nodes, is the response value measured on the i-th

statistical unit belonging to node t and is the node average response value) and by

iteratively splitting nodes so as to maximise its decrease.

iy

)(ty−

3. After a large regression tree has been built, according to what described in steps 1 and

2, (the largest one is in principle the one containing only one unit in each terminal

node, but in general some constraint on the node size is imposed) overfitting effects

can be cancelled out by identifying a sequence of nested trees (with increasing number

of terminal nodes) through the evaluation of a cost complexity function and by

choosing among these trees the optimal one according to an unbiased estimate of the

prediction error. The estimate is usually computed on a so called test set, that is a

novel set of units, belonging to the same population as those which were used for tree

construction, but which did not contribute to node splitting . In OLS trees the cost

complexity function is a function of both the tree size and the sum of squared

prediction errors on the training set (see Brieman et al, 1984 for further details).

An interesting feature of regression trees is that the allow to explore the effect of a large

number of covariates and at the same time obtain a parsimonious model, as while developing

the tree they simultaneously perform a variable selection. The covariates which enter the final

tree can be therefore considered the most important predictors.

Page 5: Binary segmentation methods based on Gini Index - unisi.it May/PAPER_Costa... · Binary segmentation methods based on Gini Index: A new approach to the measurement of poverty Michele

4. Regression trees for the study of income inequalities

The theoretical debate on the measurement of poverty has recently offered substantial

improvements, gradually moving from the traditional unidimensional view of poverty to the

new multidimensional concept of social exclusion (Hagenaars, 1986; Dagum, 1989; Sen,

1992).

As frequently happens owing to a great theoretical development, a methodological adjustment

is needed, but it is neither immediate nor automatic. The adoption of a more general and

multidimensional definition of poverty requires to adequate methodological tools for the

measurement of poverty, actually generally still obtained on the basis of income only.

In this framework classification and regression trees can represent a really effective tool

which allows to perform an efficient multidimensional analysis of poverty.

When income is the variable Y which has to be studied as a function of a set of covariates,

OLS regression trees are not however the most suitable device, as is well recognised that

deviance is not a good measure of overall income inequality. Furthermore, according to

Dagum, between group inequality too can not be properly evaluated by a measure which takes

the income means of the subpopulations as their representative values: “the income

distributions of the subpopulations often differ in variance and asymmetry” too.

Following Dagum (1997), in this paper we propose to address the issue of studying the socio

demographic determinants of income inequalities by modifying steps 2 and 3 in the tree

construction. The split function we consider is the inequality within subgroups as given by

equation (4). At each step of the tree construction we choose the split which maximizes the

decrease in the within group concentration.

Suppose we have already obtained a partition of the set of units into k subgroups and that the

k-th one is now going to be split again. Denoted by G the inequality within in the k group

partition and by G the inequality within after splitting node k into his left (L) and right (R)

daughter nodes, so that

w

'w

∑−

=

++=1

1

'k

ikRkRkkRkLkLkkLiiiiw spGspGspGG ,

then our tree building procedure chooses the split which causes 'ww GG −

to be maximum. This also amounts to choose the split which produces the largest inequality

between as given by (5).

Page 6: Binary segmentation methods based on Gini Index - unisi.it May/PAPER_Costa... · Binary segmentation methods based on Gini Index: A new approach to the measurement of poverty Michele

The pruning step is modified accordingly, still using the inequality within, evaluated on a test

set, in order to identify the best tree in terms of parsimony and interpretative power.

5. Some results on a real data set

The data used in the study have been supplied by the Bank of Italy, who since 1966 performs

a survey of households' income and wealth (Banca d’Italia, 2004; Brandolini, 1999;

Brandolini and Cannari, 1995).

The 2002 survey covers 8011 households composed of 22184 individuals and 13536 income

earners. The sample unit is the household, which is defined as a group of individuals living

together, linked by ties of blood, marriage or affection and sharing their incomes.

The interviewed households are chosen by a two stage sampling procedure, in which firstly

the location of residence is selected by taking into account all the cities with a population of

more than 40.000, and a random sample of the smaller cities. The second step of the sampling

procedure concerns the choice of the household, a problem solved by resorting to a semi-

panel framework: in 2002 about 45% of the interviewed households were also present in the

previous survey, while the remaining were randomly chosen.

The main focus of the survey is on net income (that is minus taxes and social contributions)

and wealth, but it includes also relevant information about demographic characteristics,

housing, health, education and labour market. The information provided by the survey allow

to construct an exhaustive set of indicators on the basis of both household and individual data.

In order to ensure comparability among incomes of households of different sizes, the OECD

equivalence scale is adopted. By indicating with ne the number of adult equivalents, with n the

household size and with na the number of adults (14 years or more), the OECD equivalence

scale states that:

)(*5.0)1(*7.01 aae nnnn −+−+= .

Table 1: Household size n, number of adults na and number of adult equivalents ne according

to OECD equivalence scale

n 1 2 2 3 3 3 4 4 4 5 5 5

na 1 2 1 3 2 1 4 3 2 4 3 2

ne 1 1.7 1.5 2.4 2.2 2 3.1 2.9 2.7 3.6 3.4 3.2

Page 7: Binary segmentation methods based on Gini Index - unisi.it May/PAPER_Costa... · Binary segmentation methods based on Gini Index: A new approach to the measurement of poverty Michele

Table 1 shows the relation between household size and number of adult equivalents for some

values of n and na; with respect to the simple per capita household income yi/ni, the

transformation yei= yi/nei gives the per capita household income corrected by the effect of

different household sizes, thus allowing to correctly compare household incomes.

The covariates which have been included in the analysis in order to explain income

inequalities are: household size (numeric), number of household members younger than 14

years old (numeric), number of earners (numeric), gender (binary), educational level (ordered

categorical), age (categorized), marital status (categorical), work status (categorical), branch

of activity (categorical) of the head of the household, a dummy variable indicating if he/she is

an income earner, a further dummy indicating if he/she is the major earner, geographical area

(categorical), town size (categorized), location of the dwelling (categorical), dwelling location

rating (ordered categorical), dwelling rating (ordered categorical), dwelling surface area

(categorized), year of dwelling construction (categorized), number of bathrooms

(categorized), heating system (dummy), dwelling value (categorized), other dwelling

possessed (dummy), dwelling tenure (categorical), net wealth (categorized), amount of bank

current account deposits (categorized), other bank deposits (dummy), PO deposits (dummy),

Italian government securities (dummy), bonds (dummy), mutual funds (dummy), Italian

shares (dummy), managed savings (dummy), foreign securities (dummy), loans to

cooperatives (dummy).

We have analyzed equivalent income as a function of the above mentioned covariates

(including and excluding net wealth and the variables describing the forms of saving) by the

regression trees which make use of Gini’s ratio as the splitting criterion.

In OLS trees, the dimension of the final tree is the one which gives rise to the minimum of the

residual sum of squares (which is U shaped) evaluated on a test set, in Gini’s trees however

due to its particular characteristics, the test set G is monotone decreasing and reaches its

minimum for a highly complex tree which has as many final nodes as the observed different

income values. In order to obtain a parsimonious model, while preserving its capability of

measuring inequality we have decided to construct a scree diagram analogue plotting Gini’s

inequality between against the tree complexity (Fig.1) and choosing the tree whose size

corresponds to the elbow of the scree. According to this criterion trees with 16 and 14

terminal nodes respectively have been chosen for the analysis of income as a function of all

the covariates and of the socio demographic and dwelling covariates only (excluding net

wealth and the variables describing the forms of saving).

w

Page 8: Binary segmentation methods based on Gini Index - unisi.it May/PAPER_Costa... · Binary segmentation methods based on Gini Index: A new approach to the measurement of poverty Michele

We have decided to built a regression tree on the whole set of variables and another

regression tree on a subset of them, which excludes the financial ones, as in the second case

we wanted to evaluate the effect on income only of variables which could be obtained for

instance from census data, being the financial ones not available on the whole population. As

socio demographic and dwelling variables turn out to be able to explain e large part of the

between group concentration, they suggest the possibility of further research aimed at

obtaining income estimates for instance on a regional bases. This represents an interesting

challenge as the problem has been so far not completely solved.

Figure 1: Scree diagram of Gw (evaluated on the test set) for trees obtained using Gini’s ratio as splitting criterion with all the covariates

In order to compare the performances of OLS and Gini trees we have also run OLS trees for

both data sets, constraining them to have the same number of final nodes as the Gini ones.

The results are reported in the Fig 2-5 and summarized in the Tab. 2-8 which report the results

both on the training and the test set. The covariates reported in the tree graphs have shown a

relevant role in explaining either income variability or income concentration.

From a methodological point of view, an interesting feature which emerges from the

comparison of OLS and Gini trees for instance on the whole set of variables is that while the

former concentrate on the wealthy people (which represent only the 12,9% of the training set)

Page 9: Binary segmentation methods based on Gini Index - unisi.it May/PAPER_Costa... · Binary segmentation methods based on Gini Index: A new approach to the measurement of poverty Michele

and employ 5 terminal nodes out of 16 to explain the differences in their equivalent income,

the latter only use two nodes. On the contrary Gini’s tree concentrates on the poorer

households (income less than 11.000 euros) and explains their differences by 5 nodes against

the 2 used by OLS trees. This means that OLS and Gini trees detect different aspects of

income inequalities and closely reproduce the distinguishing features of the measures they use

as splitting criteria.

From an empirical perspective the analysis of regression trees, and particularly of the ones

based on Gini index, allows powerful insights on the poverty structure in Italian households.

The main determinant of poverty is represented by net wealth, or, equivalently, by dwelling

value. Furthermore, among the analysed indicators, the other main factors of poverty are

identified in the educational level, the geographical area and branch of activity and the

household size.

6. Concluding remarks

By identifying the poverty structure, regression trees can be extremely useful in correctly

classify Italian households into poor and non-poor units. Since the key point in poverty

analyses is not to establish how many are the poor households, but who are they, only

complete and exhaustive information provided by a multidimensional analysis allow to

correctly individuate the set of the poor and to formulate actions able to reduce poverty.

A further challenging issue is represented by the possibility of estimating income at a regional

or subregional level on the basis of census data by using regression trees built on the regional

samples of the Bank of Italy surveys of Households’ Income and Wealth. This poses new

methodological problems as far as the use of regression trees in stratified sample surveys is

concerned.

Page 10: Binary segmentation methods based on Gini Index - unisi.it May/PAPER_Costa... · Binary segmentation methods based on Gini Index: A new approach to the measurement of poverty Michele

Figure 2: Binary tree obtained using Gini’s ratio as splitting criterion with all the covariates

Page 11: Binary segmentation methods based on Gini Index - unisi.it May/PAPER_Costa... · Binary segmentation methods based on Gini Index: A new approach to the measurement of poverty Michele

Figure 3: OLS tree obtained with all the covariates (number of terminal nodes set to 16)

Page 12: Binary segmentation methods based on Gini Index - unisi.it May/PAPER_Costa... · Binary segmentation methods based on Gini Index: A new approach to the measurement of poverty Michele

Figure 4: binary tree obtained using Gini’s ratio as splitting criterion without covariates on net wealth and forms of saving

Page 13: Binary segmentation methods based on Gini Index - unisi.it May/PAPER_Costa... · Binary segmentation methods based on Gini Index: A new approach to the measurement of poverty Michele

Figure 6: OLS tree obtained without covariates on net wealth and forms of saving (number of terminal nodes set to 14)

Page 14: Binary segmentation methods based on Gini Index - unisi.it May/PAPER_Costa... · Binary segmentation methods based on Gini Index: A new approach to the measurement of poverty Michele

Table 2: binary tree obtained using Gini’s ratio as splitting criterion with all the covariates: summary statistics (training set)

j Gj pj sj mean median 11 0.301 0.105 0.047 6261.07 5740.7412 0.258 0.102 0.063 8538.14 7911.7715 0.289 0.038 0.089 32954.82 29057.2116 0.310 0.037 0.069 25904.63 20894.5418 0.213 0.051 0.081 22056.24 21481.2519 0.275 0.071 0.054 10516.06 9420.3520 0.241 0.067 0.070 14666.62 12677.7622 0.214 0.074 0.098 18388.09 16731.0023 0.250 0.077 0.056 10093.06 9264.1624 0.183 0.073 0.063 11997.83 11691.2925 0.226 0.072 0.053 10172.34 9630.8626 0.204 0.067 0.074 15377.99 14631.5027 0.217 0.043 0.042 13572.21 12347.1128 0.213 0.035 0.046 18077.18 17489.0929 0.206 0.042 0.039 12867.15 11846.5230 0.242 0.045 0.057 17698.53 15755.83

overall mean overall median Gtot 0.328 13924.40 11882.64Gw 0.015 (4.58%) Gb 0.313 (95.42%) Table 3: binary tree obtained using Gini’s ratio as splitting criterion with all the covariates: summary statistics (test set)

j Gj pj sj mean median 11 0.290 0.107 0.053 6799.08 6043.8312 0.235 0.096 0.060 8680.22 7908.1215 0.378 0.032 0.081 35393.66 26163.6916 0.288 0.028 0.050 24908.29 22000.6018 0.226 0.047 0.075 22218.41 21322.3519 0.263 0.091 0.072 10904.60 10186.9420 0.201 0.054 0.056 14309.76 13271.2922 0.192 0.075 0.094 17450.59 16314.6023 0.288 0.064 0.051 10933.50 9750.0024 0.171 0.073 0.063 11848.78 11161.2025 0.223 0.065 0.049 10424.05 9553.9026 0.281 0.067 0.078 15987.11 14392.7527 0.235 0.059 0.056 13092.18 12641.5328 0.207 0.046 0.059 17786.82 16305.7329 0.216 0.049 0.044 12546.86 12123.5030 0.179 0.051 0.059 16070.48 14885.63

overall mean overall median Gtot 0.314 13778.43 12116.63Gw 0.015 (4.79%) Gb 0.299 (95.21%)

Page 15: Binary segmentation methods based on Gini Index - unisi.it May/PAPER_Costa... · Binary segmentation methods based on Gini Index: A new approach to the measurement of poverty Michele

Table 4: OLS tree obtained with all the covariates (number of terminal nodes set to 16): summary statistics (training set)

j Gj pj sj mean median 3 0.360 0.015 0.042 42662.71 26968.527 0.262 0.037 0.047 17505.90 16543.03

11 0.291 0.208 0.110 7506.61 6906.9414 0.255 0.029 0.043 20799.49 19197.2915 0.183 0.032 0.034 14957.47 14230.5216 0.241 0.083 0.064 10728.58 10107.5520 0.221 0.212 0.168 11258.84 10760.0021 0.269 0.021 0.040 25255.91 24521.2522 0.247 0.027 0.066 32901.81 31116.9924 0.201 0.021 0.036 24312.88 22891.3025 0.226 0.041 0.031 11230.25 9600.0026 0.200 0.131 0.151 15893.79 14805.3327 0.199 0.034 0.027 11439.64 10912.4128 0.255 0.043 0.051 16570.90 15073.0029 0.182 0.051 0.077 20623.20 20224.3330 0.211 0.015 0.014 14019.16 11664.24

overall mean overall median Gtot 0.328 13924.40 11882.64Gw 0.024 (7.18%) Gb 0.304 (92.82%) Table 5: OLS tree obtained with all the covariates (number of terminal nodes set to 16): summary statistics (test set)

j Gj pj sj mean median 3 0.431 0.013 0.044 47202.22 31107.327 0.241 0.037 0.046 17452.32 15506.20

11 0.268 0.202 0.113 7689.63 7050.0014 0.215 0.028 0.040 19814.43 16300.7115 0.194 0.037 0.040 14781.58 13907.4216 0.247 0.101 0.075 10321.54 9947.5620 0.206 0.188 0.152 11129.91 10838.0921 0.265 0.016 0.029 24741.00 22000.6022 0.236 0.019 0.044 31557.61 27894.5624 0.218 0.018 0.029 22715.50 21376.0025 0.210 0.049 0.040 11242.94 10583.3326 0.173 0.137 0.153 15459.97 14550.7127 0.271 0.031 0.026 11500.51 10213.1128 0.320 0.050 0.063 17320.34 15073.0029 0.206 0.056 0.084 20756.39 18987.7330 0.243 0.019 0.021 15233.65 14161.36

overall mean overall median Gtot 0.314 13778.43 12116.63Gw 0.022 (6.95%) Gb 0.292 (93.05%)

Page 16: Binary segmentation methods based on Gini Index - unisi.it May/PAPER_Costa... · Binary segmentation methods based on Gini Index: A new approach to the measurement of poverty Michele

Table 6: binary tree obtained using Gini’s ratio as splitting criterion without covariates on net wealth and forms of saving: summary statistics (training set)

j Gj pj sj mean median 11 0.281 0.113 0.076 9375.57 8277.1812 0.291 0.122 0.059 6690.49 6677.9015 0.311 0.030 0.071 33412.68 28283.5316 0.264 0.054 0.088 22564.81 19983.6817 0.266 0.069 0.077 15647.50 14341.4118 0.271 0.052 0.085 22739.45 20635.4219 0.257 0.083 0.067 11371.33 10566.1520 0.238 0.065 0.069 14745.98 13578.0021 0.238 0.061 0.079 17880.43 16392.9822 0.230 0.060 0.062 14345.95 13323.0023 0.278 0.074 0.064 12008.46 10735.2924 0.268 0.076 0.066 12205.91 11184.2125 0.226 0.072 0.078 15285.56 14021.1826 0.242 0.070 0.058 11654.57 10899.10

overall mean overall median Gtot 0.328 13924.40 11882.64Gw 0.018 (5.62%) Gb 0.310 (94.38%) Table 7: binary tree obtained using Gini’s ratio as splitting criterion without covariates on net wealth and forms of saving: summary statistics (test set) j Gj pj sj media mediana

11 0.284 0.115 0.079 9456.00 8370.5912 0.260 0.114 0.060 7256.41 6818.3215 0.390 0.024 0.061 34999.19 27964.6016 0.311 0.057 0.095 23183.59 19446.1217 0.262 0.093 0.101 14844.12 13098.8018 0.252 0.054 0.087 22149.94 21000.0019 0.250 0.094 0.075 10933.09 9904.6120 0.204 0.073 0.075 14290.84 13709.2621 0.214 0.055 0.067 16754.95 15600.1022 0.222 0.064 0.070 15029.04 13903.8223 0.194 0.071 0.061 11725.91 11112.3324 0.251 0.067 0.059 12221.61 10393.5925 0.203 0.063 0.064 14143.81 13446.0026 0.269 0.058 0.048 11375.10 10873.00

overall mean overall median Gtot 0.314 13778.43 12116.63Gw 0.018 (5.81%) Gb 0.296 (94.19%)

Page 17: Binary segmentation methods based on Gini Index - unisi.it May/PAPER_Costa... · Binary segmentation methods based on Gini Index: A new approach to the measurement of poverty Michele

Table 8: OLS tree obtained without covariates on net wealth and forms of saving (number of terminal nodes set to 14): summary statistics (training set)

j Gj pj sj mean median 7 0.246 0.028 0.048 24648.09 22730.258 0.384 0.012 0.035 39558.46 26149.88

12 0.282 0.225 0.123 7803.62 7323.5314 0.234 0.125 0.097 10970.55 10460.5915 0.259 0.013 0.012 14639.94 11851.8516 0.259 0.064 0.109 24075.55 20593.0617 0.252 0.125 0.137 15326.55 13658.0018 0.276 0.037 0.058 20012.18 19740.0019 0.309 0.041 0.031 10845.44 9023.8520 0.271 0.015 0.018 17778.06 16989.2021 0.267 0.015 0.026 23617.99 21150.9223 0.223 0.071 0.056 11430.16 10338.5225 0.234 0.207 0.217 14540.45 13613.7526 0.198 0.022 0.032 19943.37 18864.21

overall mean overall medianGtot 0.328 13924.40 11882.64Gw 0.030 (9.14%) Gb 0.298 (90.86%) Table 8: OLS tree obtained without covariates on net wealth and forms of saving (number of terminal nodes set to 14): summary statistics (test set)

j Gj pj sj mean median 7 0.339 0.032 0.056 24638.94 19560.838 0.269 0.016 0.038 33615.16 29902.52

12 0.263 0.231 0.134 8015.52 7490.9414 0.235 0.121 0.099 11262.89 10631.6615 0.235 0.015 0.016 14298.88 12707.7616 0.346 0.057 0.099 23726.10 19674.1217 0.230 0.122 0.137 15503.69 14247.0518 0.275 0.029 0.036 17169.25 15042.8419 0.320 0.052 0.041 10980.85 9847.9120 0.237 0.016 0.020 17161.83 17575.6721 0.206 0.021 0.028 18478.00 16007.0423 0.204 0.071 0.058 11280.70 10455.3525 0.220 0.194 0.206 14655.61 13640.6726 0.254 0.022 0.029 18137.23 16300.71

overall mean overall medianGtot 0.314 13778.43 12116.63Gw 0.029 (9.13%) Gb 0.285 (90.87%)

Page 18: Binary segmentation methods based on Gini Index - unisi.it May/PAPER_Costa... · Binary segmentation methods based on Gini Index: A new approach to the measurement of poverty Michele

References

Banca d’Italia (2004), I bilanci delle famiglie italiane nell’anno 2002, Supplementi al

Bollettino Statistico, Note metodologiche e informazioni statistiche, XIV, n. 12, 2004.

Bhattacharya N., Mahalanobis B. (1967) Regional disparities in household consumption in

India, Journal of the American Statistical Association, 62, 143-161

Brandolini, A. (1999). The Distribution of Personal Income in Post-War Italy: Source

Description, Data Quality and the Time Pattern of Income Inequality, Servizio Studi Banca

d'Italia, Tema di discussione n. 350.

Brandolini, A. and Cannari, L. (1995). The Bank of Italy Survey on Household Income and

Wealth. Ando, A., Guiso, L. and Visco, I. editors. Saving and the Accumulation of Wealth.

Cambridge University Press, New York..

Brieman L., Friedman J., Olshen R., Stone C. (1984) Classification and regression trees,

Wadsworth, Belmont.

Cannari L., D’Alessio G. (2003) La distribuzione del reddito e della ricchezza nelle regioni

italiane, Servizio Studi Banca d’Italia, Tema di discussione n. 482

Cheli B., Ghellini G., Lemmi A., Pannuzi N. (1994) Measuring poverty in the Countries in

transition via TFR method:the case of Poland in 1990-91, Statistics in transition, 1,585-

636.

Costa M. (2004) Notes on the Gini index decomposition, Atti della ??? Riunione Scientifica

della Società Italiana di Statistica, Bari, Giugno 2004.

Dagum C. (1989) Poverty as Perceived by the Leyden Evaluation Project. A Survey of

Hagenaars’ Contribution on the Perception of Poverty, Economic Notes, 1, 99-110.

Dagum C. (1997a) A new decomposition of the Gini income inequality ratio, Empirical

Economics, 22, 515-531.

Dagum C. (1997b) Decomposition and interpretation of Gini and Theil entropy inequality

measures, Proceedings of the Business and Economic Statistics Section, American

Statistical Association, 200-205.

Galimberti G., Montanari A. (2002) Regression trees for longitudinal data with time

dependent covariates, in Classification, clustering and data analysis, K.Jajuga, A.

Sokolowski, H.H. Bock (eds), Springer Verlag, Heidelberg, 391-398.

Gini C. (1939) Variabilità e concentrazione, Giuffrè, Milano.

Gini C. (1959) Transvariazione, Libreria goliardica, Roma.

Page 19: Binary segmentation methods based on Gini Index - unisi.it May/PAPER_Costa... · Binary segmentation methods based on Gini Index: A new approach to the measurement of poverty Michele

Giorgi G. M. (1999) Income inequality measurement: the statistical approach, in: Handbook

on Income Inequality Measurement, Silber, J. (Ed.), Kluwer, 245-267.

Hagenaars A.J.M. (1986), The Perception of Poverty, North Holland, Amsterdam.

Mehran F. (1975) A statistical analysis of income inequality based on a decomposition of the

Gini index, Proceedings of the 40th ISI Session, Warsaw, 580-585.

Sen A.K. (1992), Inequality Reexamined, Harvard University Press, Cambridge (MA).

Shorrocks A. F. (1980) The class of additively decomposable inequality measures,

Econometrica, 48, 613-625.