CERN/IES Marie-Curie Project Proposal September 19 th 2014 Welcome 1.
Group Project 6 IES
-
Upload
abhishek-kumar -
Category
Documents
-
view
217 -
download
0
Transcript of Group Project 6 IES
7/28/2019 Group Project 6 IES
http://slidepdf.com/reader/full/group-project-6-ies 1/11
Internal Consistency Check for DATA
COLLECTED IN a Survey on Common Man’s ExpEctation about Inflation
Rate
Sayar Karmakar, Tamal Kumar De, Deborshee Sen,
Arka Bhattacharjee, Riddhiman Bhattacharya, Abhirup Mondal
Introduction
In this project we have done some internal consistency checking on a survey data. As the size of the data
is huge, one can surely ask the question how reliable the responses are. To address this question we carry
out some tests on the dataset and draw some conclusion.
Description of the dataset
The dataset is obtained from a survey conducted by Reserve Bank of India. A total of 17 questions were
asked in general. Three of them were direct questions about the expected inflation rate currently, three
months later and one year later. Along with these direct questions they were asked whether the change of
prices of commodities would increase, decrease or stay same as the current in coming 3 months and 1 year.
These two questions are asked for goods in general and then separately for each kind of goods. The data
also contains the occupation, age and sex of the respondents, along with name of the investigator. The
data has been collected in four different times over a period of one year in 12 major cities.
A special feature of the dataset is all the responses are categorical. There were ten categories for the direct
questions (<1%,1-2% …>10%,etc) and for the indirect questions the categories were price increase rate
more than current rate, at the current rate, less than the current rate along with no increment in price and
decline of price. These categories were denoted in the datasets by alphabets A,B,C,D,E,…,J or A,B,C,D,E .
The data for the first 8000 responses for the current inflation rate column was missing. It seems that the
question was not asked in the first two phases of the survey.
What is meant by consistency
Suppose one of the respondents says that inflation rate in the current situation is 1-2% but in the next
year it will be more than 10 % whereas when asked about the price increase he says that the general price
changing rate would decrease from what it is in the current situation. This is a type of inconsistency
between the answers to a direct question and an indirect question.
7/28/2019 Group Project 6 IES
http://slidepdf.com/reader/full/group-project-6-ies 2/11
There might be inconsistency between the response about the general price increase and commodity wise
price increase. Suppose a respondent says that commodity prices would increase in general more than the
current rate in next 3 months. But when asked about commodity wise price change he replies that price
would decline for every such commodity. We can do tests for this kind of inconsistency for both the 3
month data and the 1 year data.
These are the inconsistencies between the responses of a particular respondent. We might also beinterested at whether the responses of the people surveyed by a particular investigator in a particular city
and time are close or not. If in some cases we see too much variation or too much consistency, then we
might suspect that something has gone wrong.
We can also assign a consistency percentage to every investigator depending on the proportion of
inconsistent response. Then we might look how this proportion has varied over investigators in the same
city and time. We might also look for a pattern between the consistency percentage and the workload.
As the time for collecting the data were spaced by 3 months we could tally the data collected for expected
inflation rate after 3 months with the data on current inflation rate obtained after three months. But this
was not done as the persons interviewed or the investigators would not remain same over the time period. Also inflation rate changes are highly unpredictable and while doing consistency checks we are not
interested in determining how well the respondents predicted the future inflation rates. So this would not
give us anything useful.
Consistency between the direct and indirect question
Here the direct questions are about what is the current inflation rate, what would the inflation rate be in
the next three months and 1 year. We tally these responses with the response on the rate of price change
of commodities in general.
We look at the responses about current inflation rate and rate in the next three month. The differencebetween these two responses should be consistent with what he said about the price increase in general
commodity
So first of all we set some rules to decide on what kind of situations are inconsistent for us. We have to be
very careful about the choice of the rules. We are inclined to a conservative scenario and form the rules as
below.
Rules
We first replace the responses A,B,C,D and E for indirect questions (good wise) with 10,30,50,70 and 90
respectively. This is our response (3). Then we replaced the responses of the direct question (inflation rate) A,B..J by 5,15… 95 respectively. The 3 month or 1 year inflation rate is response (2) and the current
inflation rate is response (1). We call a response consistent if
i. Response(2) – Response(1) > 35 and Response(3) =10 or
ii. 15 < Response(2) – Response(1) < 35 and Response(3) <= 30 or
iii. -15 < Response(2) – Response(1) < 15 and Response(3) <= 50 or
iv. -15 < Response(2) – Response(1) < 15 and 30 <= Response(3) <= 70 or
v. Response(2) – Response(1) < -35 and Response(3) >= 50
7/28/2019 Group Project 6 IES
http://slidepdf.com/reader/full/group-project-6-ies 3/11
These rules were chosen carefully. To simplify, we illustrate the first rule. If response(3) is 10 that means
price has increased more than current rate and so difference between response(2) and response(1) should
be high. Similarly the others rule can be verified.
We carry out the same tests for both three months and 1 year. These are two types of consistency checks.
Consistency of the general price increase rate and commodity-wise priceincrease rates
Here, we want to see whether the responses in the price change in general are consistent with the
responses for separate goods. If for example someone responds that price would decrease for every
commodity but he says that price would increase in general then this is an inconsistency. Here, we want to
compare a single categorical response with a vector of categorical response.
Here we do not know how much weight these different commodities attribute to the general economic
structure. So, we cannot compare the general response with some typical weighted mean of the responses
in the categories. What we did instead is we look at the range of the responses in the good-wisecategorization. If the general response lie outside this then we say that it is inconsistent. It seems too
much conservative, but actually this is the best we can do. Any weighted mean would always lie in the
range of the responses. As we do not know the actual weights, our measure of inconsistency would remain
invariant under any kind of weighing schema.
We do this good-wise consistency check both for 3 months and 1 year over various cities over time.
Once we define these 4 types of inconsistency, we say a response is consistent if it is consistent in all the 4
cases and if the direct-indirect consistency cannot be checked due to unavailability of data we say the
response is consistent if it is consistent in good-wise check.
Results for these four type of inconsistencies
We first plot the consistency proportions for different time points. For the march and june data only last
two tests (the first two tests cannot be performed as data on current inflation rate is missing) were taken
whereas for September and December all the four tests were taken and we called a response consistent if
its consistent in all four tests.
Then we plot the consistency proportion of the four months only for the last two months. We see there is a
decreasing pattern when we were combining all the four type of consistencies whereas no specific pattern
was found when only the good-wise consistency was checked over time. This is quite expected as 2 tests
for consistency were not performed for march and june and these tests would have reduced proportion of
consistent in these two months.
7/28/2019 Group Project 6 IES
http://slidepdf.com/reader/full/group-project-6-ies 4/11
Then we concentrated on different time points and see consistency percentage for different cities.
In the following plots the cities are plotted in the following order: Ahmedabad, Bangalore, Bhopal, Chennai,
Guwahati, Hyderabad, Jaipur, Kolkata, Lucknow, Mumbai, New Delhi, Patna.
March June
Se tember December
Goodwise consistency for march and
une All for se tmber & december
Goodwise consistency for all
the months
7/28/2019 Group Project 6 IES
http://slidepdf.com/reader/full/group-project-6-ies 5/11
We can easily notice here that there is a sharp decline in proportion of consistent responses in Ahmedabad
in September, and again in Patna in December. Also Hyderabad in September and Bangalore in December
also have markedly low consistency percentage. We observe the percentage of consistent responses is
quite less in quite a few cases leading us to the conclusion the overall consistency of responses is not quite
high.
Next we plotted the consistency proportions for the cities for the 4 months taken together. In the followingplot we can easily see most cities have about average percentage of consistent responses. Only exception
is Patna for which overall percentage of consistent responses falls even below 70%.
Consistency percentage of an investigator and effect of workload on it
We have discussed mainly two types of inconsistency between the responses of a single respondent. We
can calculate these two for both 3 month and 1 year. Suppose we say a particular respondent is
inconsistent if he has inconsistency of any of these four types. Then we look at the proportion of
inconsistent respondents among those interviewed by a particular investigator. This might give us an idea
on the reliability of any particular investigator. We plot this percentage with the workload for different
investigators in a city and look for some pattern, if any.
We plot the workload v/s consistency proportion for all the twelve cities and notice that there’s no specific
pattern.Ahmedabad Bangalore
Overall consistency percentage over 4
months in all the 12 cities
7/28/2019 Group Project 6 IES
http://slidepdf.com/reader/full/group-project-6-ies 6/11
Chennai
Hyderabad
Jaipur
Lucknow
Kolkata
Mumbai
Guwahati
Bhopal
7/28/2019 Group Project 6 IES
http://slidepdf.com/reader/full/group-project-6-ies 7/11
We also plotted all the workload-consistency proportions for the all the investigators together .
Again we see no specific
pattern. We did a linear
regression of consistency
percentage on workload. We
obtained the value of α and
β to be 81.59 and -0.0022
respectively. The p-value for
the test H0 : β = 0 v /s H1 : β
≠ 0 using t-statistic for β
was found to be 0.5567, so
H0 cannot be rejected, or in
other words there is nosignificant relationship
between workload and
consistency percentage.
Dependence on age, occupation and sex of the respondents
We see how our measure of consistency varies over the occupation and sex of the respondents.
We first plot the percentage consistency for males and females.
Then we plot for different occupation categories.
Next we divide the age of the respondents in different categories and plot the respective consistency
percentages.
New Delhi Patna
Workload v/s consistency percentage over all cities
7/28/2019 Group Project 6 IES
http://slidepdf.com/reader/full/group-project-6-ies 8/11
So we see that there is no specific pattern in the consistency proportions. They are more or less
comparable for every such categorization.
Variation of responses in a particular question for an investigator
Here we wish to see how much spread is shown by the responses collected by a particular investigator. If this dispersion measure is too small or too large then we might suspect something going wrong with the
reliability of the corresponding investigator. In order to do this, we need a measure of dispersion for the
ordinal data.
We picked up two random respondents and look at their individual responses. Then we take the distance
between these categories. We repeat this for 10000 times and take the mean or median of the distances.
This sample mean or median of these distance measures would give us a good idea about the actual
dispersion of the distribution from which the ordinal data is coming.
We calculated this measure for each 15 questions and over various cities and compare the results obtained.
Once we calculate the spread measure for all the questions we wish to see whether some cases have too
high or too low spread. Now for a particular investigator we call a particular question inconsi stent if it’s
outside the (mean-1.5 sd, mean +1.5 sd) of that question.
Sex- wise consistency proportion Occupation-wise consistency proportion
Age-wise consistency proportion
7/28/2019 Group Project 6 IES
http://slidepdf.com/reader/full/group-project-6-ies 9/11
If a particular investigator has measures for at least 2 questions inconsistent then we say that he’s
suspicious or unreliable. If someone has measures for 4 or more questions inconsistent then we identify
that investigator as highly unreliable.
Results
We first removed the investigators who have surveyed less than 20 persons as calculating a measure of
spread based on paired samples taken from a very small population size would not be very meaningful. Out
of the rest 180 investigators 32 are found were found to unreliable or suspicious and 10 were highly
suspicious. Among the cities the highest number of unreliable investigators are from Chennai and New
Delhi. From Chennai there were 4 investigators who were unreliable or suspicious and 4 were highly
suspicious among a total of 20 investigators. And from New Delhi 2 were highly suspicious and 3 more
were suspicious or unreliable. Ahmedabad scored a clean sheet as no investigator from this city was
identified as unreliable.
Variation of response in a particular question for different cities
Next we plotted the variation of response in the 15 questions for different cities. Results obtained are
tabulated below and also described by a multiple line-plot.
crnt gen 3mnt Gd 1 Gd 2 Gd 3 Gd4 Gd5 3 mnth gen 1y r Gd1 Gd2 Gd3 gd4 Gd5 1y r
1.8969 1.2643 1.1748 1.1157 1.2293 1.4026 1.5141 2.4689 1.3985 1.0916 1.1508 1.3903 1.4402 1.4749 2.5868
2.5312 1.0216 0.9477 1.0901 1.5196 0.7098 1.287 2.678 0.9949 0.9044 1.0226 1.537 0.7008 1.2523 3.0453
2.5604 1.0778 0.954 1.3348 1.374 1.2085 1.4871 1.5005 0.6472 0.4473 0.9186 1.4839 0.8031 1.3764 1.6822
1.6939 0.5706 0.7508 0.7986 1.0296 1.1472 1.2188 1.9913 0.5359 0.6843 0.8153 1.0068 1.102 1.2705 2.0419
1.8941 0.9745 1.0415 1.0311 0.607 1.1342 0.6031 1.6376 0.7354 0.6045 0.7202 0.6264 0.632 1.0769 1.871
1.612 0.987 0.9388 0.9875 1.1604 0.9631 1.081 2.3912 1.0548 0.8698 0.9564 1.1095 0.9474 1.0552 2.2248
1.9406 0.7225 0.6066 0.7022 0.8271 0.7312 0.9817 1.9529 0.7099 0.5656 0.6507 0.7667 0.6866 0.9299 2.2396
2.0337 0.9551 1.0476 1.1994 1.2221 1.3004 1.3837 1.7314 0.8166 0.9316 1.1927 1.3016 1.2808 1.4328 1.6536
1.7552 0.8138 0.6304 0.8502 1.1511 1.04 0.9786 2.9187 0.7793 0.6336 0.7709 1.0746 1.0098 0.9355 2.9867
2.611 1.1311 1.1135 1.2 1.2351 1.2259 1.3192 2.0025 1.1679 1.1872 1.193 1.3106 1.2446 1.2627 1.9501
2.1442 1.083 0.7804 0.9736 1.3465 0.8369 1.2075 2.9519 1.0194 0.6413 0.8684 1.4179 0.7109 1.2148 2.7633
2.6309 1.0107 0.8427 1.0728 1.7467 1.121 1.4231 2.6996 1.3381 1.3881 1.4831 1.6905 1.5211 1.5721 2.8522
7/28/2019 Group Project 6 IES
http://slidepdf.com/reader/full/group-project-6-ies 10/11
Summary
1. First of all we decide upon what we mean by consistency. We mainly address the respondent –wise
consistency here.
2. For a particular respondent, we look at their responses at current inflation and 3 month after inflation. If
their difference and the response in general category are contradictory we call it an inconsistency. Samewas done for 1 year also.
3. We checked whether the response in general price change is consistent with the response in price changes
over different commodities. This was done for both 3 month and 1 year response.
4. Once we defined these inconsistencies we call a respondent inconsistent if he has inconsistency in any one
of these 4 types. Then we looked at the consistency percentage of different investigators and whether this
percentage has any effect on the workload or not.
5. We also checked whether sex and occupation of the respondent have a special effect on the proportion of
inconsistency.
6. After dealing with this respondent-wise inconsistency we checked whether the responses over a city are
close or not. We use a particular measure for this purpose. We try to look at the consistency of variation
from city to city and also for different investigators.
Future works
1. The measure we used to judge a particular respondent is consistent in his direct and indirect questions is
somewhat subjective to our intuition. We can look for some measure which would be invariant under
subjective judgments.
2. In literature there are some tests for checking consistency of real-valued responses such as Cronbach’s
alpha, McDonald’s omega, etc. These measures can be used if we can replace these categories by suitable
representative values. Likert scaling is one of the methods we can use.
3. In order to replace these categorical data by suitable real numbers, we were thinking to use the cumulative
frequency table. If suppose from the histogram we found that the responses are following a normal pattern
then we could replace i-th category by the expectation of the normal truncated at phi-inv (P i) and phi-inv
(Pi-1). If not normal then we try to guess the distribution and use corresponding quartiles.
4. We were saying that a respondent is inconsistent if we are getting an inconsistency in any of the 4 types
we discussed. Instead we can look for some combination of these 4 inconsistencies to get a consistency
measure for each respondent. We call a particular respondent inconsistent if he or she has consistency
measure less than a particular cut-off.
Conclusion
This is mainly a comparative study. We could not carry out the consistency of the first two types in allresponses due to unavailability of data. These type of systemic errors should be avoided in future surveys. Among the rest 7999 responses, 6273 (78.422%) were consistent for the first two types (direct-indirect).14099 (88.124%) responses were consistent for the second two type of consistency (good-wise) among allthe 15999 responses. We see that only 12755 (79.724%) among the 15999 responses are consistentconsidering all these four types. So we see that consistency rate is falling when we are considering all thefour measures together and the consistency between direct and indirect question is affecting the
consistency measure a lot. There might be two reasons possible behind this. One is inflation rate is acomplicated concept that not many common men are aware of. So that type of inconsistency is moreprominent. Another reason might be that we were a bit conservative in deciding the rules for checkinggood-wise consistency. Overall we see that the general understanding of inflation seems to be quite pooramong the people, hence an overall consistency of only 79%.
Among the cities, Ahmedabad showed very high consistency in March but very low consistency inSeptember. Patna is also quite low at consistency percentage in the month of December. The other cities
7/28/2019 Group Project 6 IES
http://slidepdf.com/reader/full/group-project-6-ies 11/11
did not show any such pattern. For other cities there were variations in different time points but not drasticchanges like this two cities. So we might suspect that something has went wrong in these two cities inthose specific time points. Our suspicion is mainly bad choice of sample or some failure on the part of theinvestigators who carried out the survey.
Among the months, the overall consistency has decreased from March to December. But no drastic changeis there. So we could not conclude something very specific about consistency in these months.
We tried to see whether the workload has affected a particular investigator. We found no specific patternfor that. Linear regression has shown the slope coefficient term is insignificant. This shows that whateverinconsistency these investigators had shown was not due to their workload, maybe perception of thecommon man played a big role in these consistency measures.
Then we wished to see whether categorizing these common men shows us some specific pattern. We seethere is no specific dependence of consistency on age, sex and occupation of the respondents. Althoughconsistency increases slightly over different age groups, it’s not a significant difference. It maybe showsthat older and middle aged people have a slightly better idea about inflation.
Next, we look at the variation of each responses. We used the sample dispersion as an estimate of the
variation present in the population distribution. We used this measure of variation to check the reliability of investigators and found out that among the 180 investigators who surveyed 20 persons or more, 32 weresuspicious and unreliable and 10 were highly suspicious. As there was no dependence of consistencypercentage of an investigator on workload and since many of the investigators were identified asunreliable, we would recommend not using these investigators in future surveys.
When we divided the responses city wise, although the difference is not large, we see Guwahati and Jaipur
are having markedly less variation and Patna has more variation than other cities. This may be due todifferent degrees of understanding of the concept of inflation in different cities or due to different actualvalues of inflation in different cities.