Data Exploration

15
Data Exploration 1 A. Data as the Starting Point In statistical theory we assume that the statistical model is correctly specified and subsequently derive the properties of the estimators and of the hypothesis tests based upon them. The applied researchers, however, more often than not are mainly concerned with finding an appropriate model for the problem at hand. A major theme in applied research is not just to test ideas against data but also to get ideas from data. Data help to confirm (or falsify) ideas one holds, but they often also provide clues and hints which point towards other, perhaps more powerful ideas. Indeed applied researchers are often as much concerned with model creation as with model estimation. Model specification and model selection are two important steps to precede the model estimation stage. Exploratory Data Analysis (EDA) becomes an important tool of empirical research with this changed orientation. B. Data Presentation: Raw Data Variables can be both quantitative (measurement variable- “how much”) or qualitative (categorical variable- “what kind”) in nature; The latter may also be either ordered (more-less-type: level of education) or unordered (different-type: colour of eyes, religion, etc.); The information on some variables may sometimes be missing or even when available there may have some measurement error; Before starting data analysis one has to take care of these problems; The raw data are presented in a tabular form (frequency distribution, grouped frequency distribution, histogram, frequency polygon, etc.); STEM & LEAF display is a very useful technique of data presentation that captures the characteristics of entire data set before entering into the analysis of summary measures; Suppose we have the following dataset available on the 50 state of the US:

Transcript of Data Exploration

Page 1: Data Exploration

Data Exploration

1

A. Data as the Starting Point

In statistical theory we assume that the statistical model is correctly specified and subsequently derive the properties of the estimators and of the hypothesis tests based upon them.

The applied researchers, however, more often than not are mainly concerned with finding an appropriate model for the problem at hand.

A major theme in applied research is not just to test ideas against data but also to get ideas from data.

Data help to confirm (or falsify) ideas one holds, but they often also provide clues and hints which point towards other, perhaps more powerful ideas.

Indeed applied researchers are often as much concerned with model creation as with model estimation.

Model specification and model selection are two important steps to precede the model estimation stage.

Exploratory Data Analysis (EDA) becomes an important tool of empirical research with this changed orientation.

B. Data Presentation: Raw Data

Variables can be both quantitative (measurement variable- “how much”) or qualitative (categorical variable- “what kind”) in nature;

The latter may also be either ordered (more-less-type: level of education) or unordered (different-type: colour of eyes, religion, etc.);

The information on some variables may sometimes be missing or even when available there may have some measurement error;

Before starting data analysis one has to take care of these problems;

The raw data are presented in a tabular form (frequency distribution, grouped frequency distribution, histogram, frequency polygon, etc.);

STEM & LEAF display is a very useful technique of data presentation that captures the characteristics of entire data set before entering into the analysis of summary measures;

Suppose we have the following dataset available on the 50 state of the US:

Page 2: Data Exploration

Data Exploration

2

Stem and Leaf Graph

1 267 2 6 3 33345699 4 014577779 5 123456667799 6 224999 7 02222499 8 26 9 6

SL. No. State Environmental Voting Percentage

SL. No. State Environmental Voting Percentage

1 Idaho 12 26 S. Dakota 55 2 Utah 16 27 Illinois 56 3 Alaska 17 28 Montana 56 4 Wyoming 26 29 Missouri 56 5 Alabama 33 30 Ohio 57 6 Mississippi 33 31 Washington 57 7 Virginia 33 32 California 59 8 Nebraska 34 33 N. Dakota 59 9 Arizona 35 34 Maryland 62 10 Arkansas 36 35 Pennsylvania 62 11 Texas 39 36 Hawaii 64 12 Kansas 39 37 Delaware 69 13 Louisiana 40 38 Michigan 69 14 Kentucky 41 39 W. Virginia 69 15 N. Carolina 44 40 Minnesota 70 16 Tennessee 45 41 New York 72 17 New Mexico 47 42 Wisconsin 72 18 Nevada 47 43 New Hampshire 72 19 S. Carolina 47 44 New Jersey 72 20 Colorado 47 45 Iowa 74 21 Georgia 49 46 Main 79 22 Florida 51 47 Connecticut 79 23 Oklahoma 52 48 Massachusetts 82 24 Oregon 53 49 Rhode Island 86 25 Indiana 54 50 Vermont 96

Page 3: Data Exploration

Data Exploration

3

Stem and leaf displays are the compact versions of the ordered data with initial digits broken off to form stems shown to the left of the vertical line and to the right of that line are the leaves, the following digits.

When the raw data is arranged in order this display presents the shape of the distribution.

The voting percentage ranges from 12 to 96;

The centre is around 55;

Most of the observations are lying between 30 to 70 per cent;

C. Summarizing the dataset:

To explore the nature of the data we have to put them in order;

In fact, data exploration always places more emphasis on median based analysis than mean based analysis as the former being less sensitive to extreme values (i.e., outliers) is more robust in nature;

So, data exploration starts from order-based analysis of moments like median, inter-quartile range, Bowley’s measure of skewness, etc.

For any data set five observations are most important in studying its nature,

a. Minimum Value;

b. Maximum Value;

c. Median Value (Q2);

d. First Quartile (Q1) and

e. Third Quartile (Q3);

The distance between (Q3 – Q1) is called the Inter Quartile Range (IQR);

If the distribution is symmetric then (Q2 – Q1) = (Q3 – Q2);

So, by comparing these two distances idea can be formed about the nature of asymmetry, i.e., skewness;

The distance between (Q1 & Minimum Value) and (Q3 & Maximum Value) helps to study the thickness of tail (Kurtosis) which is very important in analyzing outliers.

If a distribution is NORMAL then it is symmetric and thin tailed (skewness 0 mesokurtik);

Page 4: Data Exploration

Data Exploration

4

Then [IQR/1.35] = std deviation (σ);

For any distribution [IQR/1.35] is called the pseudo standard deviation (PSD) and is compared with the actual SD to detect the presence of outliers;

If the distribution is asymmetric and/or PSD is distinctly different from SD then arithmetic mean is no longer the proper representation of the average behavior of the data in a probabilistic sense;

Now the centre of gravity (mean) differs from the centre of probability (median);

Box-plot is very useful technique here;

Checking for Symmetry: for uni-modal distribution

Positive Skewness: mean > median;

Symmetry: mean median;

Negative Skewness: mean < median;

Checking for the thickness of the tails:

For any distribution comparison of SD and PSD helps to determine the nature of tail

PSD < SD implies heavier than normal tails;

PSD SD implies approximately normal tails;

PSD > SD implies thinner than normal tails.

Ladder analysis is applied to find out the most appropriate transformation to ensure

correspondence between mean and median;

Page 5: Data Exploration

Data Exploration

5

Consider the distribution of per capita household income

We have to get a symmetric distribution

Log-transformation

Page 6: Data Exploration

Data Exploration

6

Fourth root of GNP per capita

D. Regression Models:

The theory of statistics primarily deals with univariate distributions;

For regression by using the proposed causal relation we try to derive an estimated value of the study variable Y and then try to minimize the difference between the observed value of Y and the estimated value of Y;

This difference is called the residual term and is defined as: )ˆ( iii YYe ;

Regression models are entirely concerned with the stochastic properties of this single variate ei;

If it is symmetric and thin-tailed then the mean-centric analysis of gravity turns out to be equivalent to the median-centric analysis of probability and the likeliness of the proposed causal connection between Y and Xs will be statistically confirmed;

Once that is ensured through the testing of “Goodness of Fit” the data exploration part is complete;

One may apply standard Econometric analysis on the transformed model to come up with statistically reliable confirmations;

Page 7: Data Exploration

Data Exploration

7

E. Illustration I:

This section uses an example (Mukherjee, White & Wuyts 1998) to show that seemingly good results in regression analysis may turn out to be quite questionable if we care to scrutinize the data in greater depth.

In a proposed model with the help of cross-country data crude birth rate (Y) is attempted to be explained in terms of per capita GNP (X1) and infant mortality rate (X2) where the former is expected to have negative influence on the study variable and the latter a positive influence.

The estimated regression is:

2)04.14(

22.01)23.2(

00039.0)65.11(

8.18ˆ XXiY

With 109&79.02 nR ;

At the first sight, these results look good. The coefficient of determination (R2) tells us that the regression explains 79% of the total variation of CBR and both the slope coefficients have expected sign and significance. Many researchers may be inclined to stop the data analysis at this point. That would be definitely unwise.

Before accepting the model one has to carry out diagnostic checking. The normality assumption is checked by J-B test and it has been weakly passes at 5% level of significance. G-Q test has been carried out to eliminate the possibility of heteroscedastic errors and it has also accepted the null hypothesis of homoscedasticity at 5% level of significance. So far, we have enough reason to be satisfied with our fitted model.

However, a few interesting observations may be extended at this juncture: (a) we never actually looked at the data but concentrated on the final results alone; (b) the only purpose of the data set was to verify whether it supports the hypothesis at hand or not; (c) no attempt was made to explore other equally interesting possibilities those may be suggested by the same data set.

To explore data at the first step the histograms of the three variables are plotted and it can be seen that they have very different patterns of distribution. Since the regression model is trying to explain the Y-variation in terms of variation in X’s, hence, the pattern of X-variations should have close correspondence with that of Y-variation. Moreover, for the normality assumption to hold good all these distributions should be more or less bell-shaped.

Page 8: Data Exploration

Data Exploration

8

At the next step pair-wise scatter plots and correlation coefficients should be studied.

Page 9: Data Exploration

Data Exploration

9

X1 X2 Y

X1 Negative exponential Negative exponential

with number of outliers

X2 Negative exponential

Positive exponential

Y Negative exponential with number of outliers

Positive concave

Since the regression model is a linear one the variables should be so transformed as to arrive at linear scatters among all pairs. Skewness in data is a major problem when modeling an average.

A skewed distribution has no clear centre. The mean, its centre of gravity will differ from the median, its centre of probability. Moreover, the sample mean is no longer an attractive estimator of population mean in presence of skewness.

To handle this problem one has to design suitable non-linear transformations for each relevant variable. For example, a power, a square root or a logarithm, and so on. The log-transformation is highly popular in applied data analysis.

When X1 is replaced by log(X1) and X2 by (X2)½ the pair-wise scatter plots suggest regular linear shapes.

Page 10: Data Exploration

Data Exploration

10

2)78.13(

06.41log)925.0(

63.0)38.0(

59.2ˆ XXiY where 109&85.02 nR ;

The first thing to note about this regression is that the value of R2 has gone up from 0.79 to 0.85. The income variable log(X1) has lost its importance altogether and should be dropped from the model.

When dropped, it gives the estimated regression equation as 2)17.24(

83.375.261.3ˆ XiY

with R2=0.85 & n=109.

This simple regression confirms that dropping the income variable from the equation hardly affects the coefficient of determination. This regression, therefore, yields a better result than the multiple regression proposed earlier.

Page 11: Data Exploration

Data Exploration

11

F. Illustration II (Dutta & Banerjee, 2013):

Here an example will be given by using unit-level NSSO data on Morbidity collected in 60th Round during 2004-05;

Observations have been taken on Urban West Bengal and the type of fuel use has been taken as a proxy for indoor pollution;

Causal link between indoor pollution and morbidity has been studied by giving control to a number of socio economic variables like monthly per capita expenditure, living condition, level of education of the head of the household, etc.

Hypothesis and Variables:

Morbidity = f (Income, fuel-use, education, living condition)

Sl no.

Variable name Notation Description Expected sign

I. Morbidity index M2 Percentage of Morbid Members in each household

Dependent variable

II. Monthly per capita expenditure of the household

MPCE Same as in the data source -

III.

Fuel use FUEL Dirty= 1 / Clean = 0 +

IV. Education EDU Education of the head of the household (Indexed)

EDU=(Actual/ Maximum)*100

-

V. Living Condition LCI Constructed on the basis of information on house structure, latrine, drainage, water source using PCA

-

Page 12: Data Exploration

Data Exploration

12

Descriptive Statistics (NSS_60_Round)

Descriptive Statistics M2 MPCE LCI FUEL EDU

Mean 17.99 1062.45 79.28 0.51 58.02

Median 0.00 875.00 88.27 1.00 66.67

Standard Deviation 25.58 753.27 18.78 0.50 35.24

IQR 25.00 716.00 18.12 1.00 66.67

Pseudo SD 18.53 531.3 12.07 0.74 49.42

Skewness 1.77 2.78 -1.37 -0.04 -0.39

Kurtosis (Normalized) 2.80 13.42 1.33 -2.00 -1.10

Sample Size 1878 1878 1878 1878 1878

Observations with 0 value 922

Observations with 1 value 956

Box-Plot: MPCE

02,

000

4,00

06,

000

8,00

0M

PC

E

Page 13: Data Exploration

Data Exploration

13

Box_Plot: M2, EAI, LCI (all Index values)

020

4060

8010

0

M2 EAILCI_indx

For MPCE:

Mean > Median → Positively skewed with thickness in right tail as PSD < SD;

→ Outlier at the upper tail;

→ High values need to be dampened;

→ to decide on the nature of appropriate transformation consider the ladder graph;

→ given the nature of transformation needed log (ln) or square root (sqrt) seems suitable;

Page 14: Data Exploration

Data Exploration

14

The Ladder-Graph for MPCE

02.0e

-11

4.0e

-11

6.0e

-11

8.0e

-11

0 1.00e+112.00e+113.00e+114.00e+11

cubic

01.0e

-07

2.0e

-07

3.0e

-07

4.0e

-07

5.0e

-07

0 2.00e+07 4.00e+076.00e+07

square

02.0e

-04

4.0e

-04

6.0e

-04

8.0e

-04

.001

0 2000 4000 6000 8000

identity0

.02

.04

.06

0 20 40 60 80

sqrt

0.2

.4.6

.81

4 5 6 7 8 9

log

010

2030

4050

-.1 -.08 -.06 -.04 -.02

1/sqrt

020

0400

6008

00

-.01 -.008 -.006 -.004 -.002 0

inverse

05.0e

+04

1.0e

+05

1.5e

+05

2.0e

+05

2.5e

+05

-.0001-.00008-.00006-.00004-.00002-6.78e-21

1/square

01.0

e+07

2.0e

+07

3.0e

+07

-1.00e-06-8.00e-07-6.00e-07-4.00e-07-2.00e-07 0

1/cubic

Den

sity

mpceHistograms by transformation

Between ln and sqrt, the former appears more suitable and this can be confirmed

by the χ2 test of goodness of fit through application of ladder analysis;

For EAI no adjustment is needed;

For M2 upper tail needs to be dampened; due to the presence of a number of 0-

values, log transformation cannot be applied; hence square root appeared to be the

most suitable one;

For LCI the outlier is in the left-hand side; the lower values need to be raised and

to be pushed towards the center;

This suggested the required transformation as square of LCI;

Page 15: Data Exploration

Data Exploration

15

Transformation of the variables Sl No.

Variable Property transformation

I. M2 skewness >0, fat right tail Square root (many 0, log not appropriate)

II. MPCE skewness>0, fat right tail

logarithmic

IV. EDU Kurtosis<0 with thin tails

No transformation

V. LCI Skewness<0, fat left tail

Square

So, data exploration advises the researcher to pay attention to the entire data set and not

merely to the central tendency alone. Whenever the distribution is skewed (i.e., mean is

different from median) or fat-tailed (indicating denial of mean-convergence and presence

of outliers) appropriate transformations should be defined to come up with satisfactory

model specification.