Choice of an appropriate statistical technique a complex issue somewhat arbitrary
Real-life data often contain mixtures of different types of data two statisticians may select different methods depending upon what assumptions they are willing to take
into account extraneous factors
availability of software and its limitationsavailability of time and financial resources
General Principles of Data Analysis
Warnings Figures allow us to calculate them Applying different techniques and obtaining different results
does not mean that something is wrong Looking for an answer to the same question by using several
methods may lead to a better understanding Obtaining negative results may be as informative as getting a
positive one Obtaining no answer by using one technique, does not mean
that there is no answer at all Etc.
General Principles of Data Analysis
The choice of a statistical technique depends essentially upon Characteristics of the analysis question; Characteristics of the data; Characteristics of the sampling design.
Characteristics of the Analysis Question Whether there is a distinction between independent and dependent
variables or not? Whether the nature of the research problem requires:
Description, exploration, estimation, orTesting of a hypothesis or model
Whether the focus of research is on 'variables' or 'objects‘.
General Principles of Data Analysis
Characteristics of the DataTypes of data sets
Individuals - variables data sets Proximities data sets
Variable - Variable Proximities Individual - Individual Proximities
Types of Variables Continuous or Quantitative Variables Discrete or Qualitative Variables
Variable types by measurement level
General Principles of Data Analysis
Nominal-scale variables Ordinal-scale variables
Interval-scale variablesRatio-scale variables
Techniques for problems without distinction between independent and dependent variables
General Principles of Data Analysis
Measurement Level Analysis Method
Nominal Frequencies, ProportionsOrdinal Median, ModePreferences Rank Consensus among evaluators
One Mean, Median, Mode, Variance, Skewness, Kurtosis
Two Cross-tabulation Chi-squareTwo Cross tabulation, Chi-square, Correspondence AnalysisTwo Kendall's Tau,Spearman's Rho, Gamma
Two Scatter plot, Pearson's Correlation CoefficientMore than two Principal Components Analysis, Factor Analysis, Cluster
Analysis Multidimensional Scaling
No. of VariablesNON-METRIC
OneOneOne
METRICInterval or ratio scale
NON-METRICDichotomousNominalOrdinal
METRICInterval-scaleInterval-scale
Techniques for problems with distinction between independent and dependent variables
General Principles of Data Analysis
Analysis MethodDependent Independent Dependent Independent
One One Nominal Nominal Non-parametric tests, Chi-squareOne One Nominal
(dichotomous)Nominal Multiple Classification Analysis
One One Nominal Nominal (Dichotomous)
Wilcoxon's two sample test, Chi-square, Kolmogorov-Smirnov Test
One One Interval-scale Nominal (Dichotomous)
t-test, Analysis of Variance
One One Interval-scale Interval-scale Regression AnalysisOne One Interval-scale Nominal Analysis of VarianceOne More Nominal Interval-scale Discriminant AnalysisOne More Interval-scale Nominal Analysis of Variance, Multiple Regression
Analysis, Multiple Classification AnalysisOne More Interval scale Dummy Analysis of Variance, Multiple Regression
Analysis, Multiple Classification AnalysisOne More Interval-scale Interval-scale Multiple Regression Analysis
No. of Variables Measurement Level
Usual way of statistical problem solving Formulate the question using terms and logics of the specific
field of the problem (science management, pedagogy, economics, etc.)
Reformulate the question using statistical terms and logics Find appropriate statistical model(s) and technique(s) Use the selected model(s) and technique(s) Give statistical interpretation to the results obtained Reformulate the interpretation with terms of the original field
of application
General Principles of Data Analysis
Question in research management
Research groups have multiple outputs comprising publications, patents, experimental materials etc. What are the differences if any in the performance of the Research Groups of selected countries?
Statistical question
Can we construct a reasonable productivity index, using the following measures of the scientific output
Articles in country PatentsArticles abroad Algorithms and designs Original research reports Experimental material
Can we find a significant difference by countries in the productivity index?
Scientific products by country
Statistical model and technique Partial order scoring for constructing the index of research output Analysis of variance for testing the hypothesis concerning the
significance of the difference
Use of the selected model and technique
Scientific products by country
$RUN POSCOR $FILES PRINT = POSCOR.LST DICTIN = R2R3RU.DIC DATAIN = R2RU.DAT DICTOUT =POSCOR.DIC DATAOUT =POSCOR.DAT $SETUP POSCOR SCORES OF RU OUTPUTS BADDATA=MD1 - IDVAR=V2 - TRANSVARS=(V1) POSCOR ORDER=DESR - ANAME=‘RU OUTPUT’ –VARS=(V116,V118,V122,V126,V128,V130)
$RUN ONEWAY $FILES PRINT = ONEWAY1.LST DICTIN = POSCOR.DIC DATAIN = POSCOR.DAT $SETUP ANALYSIS OF VARIANCE OF RU OUTPUT BADDATA=MD1 - PRINT=CDICT DEPVARS=(V8) CONVARS=(R1) $RECODE R1=RECODE V15 (40)=1, (360)=2, (410)=3, (638)=4, (844)=5, (868)=6
Scientific products by country
Use of the selected model and technique (results)Weight-
sum1 334 334 22.9 37.731 35.794 1.26E+04 16.8 9.02E+052 239 239 16.4 45.213 35.778 1.08E+04 14.4 7.93E+053 200 200 13.7 77.585 27.336 1.55E+04 20.7 1.35E+064 225 225 15.4 52.547 35.43 1.18E+04 15.7 9.02E+055 233 233 16 36.7 33.266 8.55E+03 11.4 5.71E+056 229 229 15.7 69.074 36.255 1.58E+04 21.1 1.39E+06
Code Label N % Mean
S.D.(estim.) Sum of X %
Sum of X-square
Total sum of squares 2048467For 6 groups , Eta 0.4018943For 6 groups , Etasq 0.161519For 6 groups , Eta(adj) 0.3982909For 6 groups , Etasq(adj) 0.1586357Between means sum of squares 330866.5Within groups sum of squares 1717601F( 5,1454) 56.018
Scientific products by country
Statistical interpretation The F( 5,1454)=56.018 value shows that there is a highly
significant difference by country in the constracted performance index.
We see also a medium strength differentiation between the countries: Eta(adj)=0.398.
The Mean values show the level of each country.
Interpretation for research managementThere are two countries with low, two ones with medium and two other ones with high productivity index.
SourceP.S. Nagpaul: Guide to Advanced Data Analysis using IDAMS Software
Question in psychology - pedagogy
Intellectual performance, motivation and creativity of school children can be measured by using several indicators. Some of them are produced by the children themselves (e.g. IQ tests) others are based on the evaluation given by their teachers (e.g. average grade). What are the perceivable dimensions if any behind these indicators?
Statistical question
In the set of the listed indicators, are there any groups within which statistical inter-correlation and between which statistical independence can be detected?
T Average grade T Creative behaviourC IQ C Achievement motivation C Creativity test T Motivated behaviourC Creative attitude T Motivation index
Performance, motivation and creativity of school
children
Statistical model and technique Pearsonian correlation between the measured indicators Multidimensional scaling, cluster analysis
Use of the selected model and technique
Executing PEARSON, MDSCAL, CLUSFIND in IDAMS
MDSCAL result
Performance, motivation and creativity of school
children
Teachers
Children
Use of the selected model and technique
CLUSFIND result
Performance, motivation and creativity of school
children
C IQ
C Creativity test
C Creative attitude
C Achievem. motivation
T Average grade
T Creative behaviour
T Motivated behaviour
T Motivation index
0,75
0,71
0,40
0,45
0,27
0,13
0,02
Performance, motivation and creativity of school
childrenStatistical interpretation
Multidimensional scaling shows clear separation of indicators produced by children and teachers
Cluster analysis supports the finding of the separation of variables coming from teachers and children
Pedagogical/psychological interpretationJust one aspect: ratings given by teachers to children are nearly the
same, independently of the evaluated ability, attitude or behaviour dimension Source
M. Hunya: Multidimensional statistical techniques in pedagogical studiesData
A.Deak, B. Kozeki: Study into the effect of motivation and creativity factors on the performance of school children
Question in hydrology We have water level data on four rivers in North-Africa (mor
than 40 years). Can the water flow level be predicted on the basis of data from the past? If so, with what precision?
What if the average flow level is considered instead of the individual ones?
Statistical question Can the river flow values be predicted by using a set of values
from the preceding period? How does the prediction change if 6 month average flow is
used?
Prediction of river flow values
Statistical model and technique Autoregression model (with a lag of 12 to 36) applied to the river flow
time series Transformation of the original data into a time series of moving
averages (interval length = 6)
Use of the selected model and technique
Time Series Analysis option from the IDAMS interactive facilities
Original series Moving average series12 months R**2=0,32 12 months R**2=0,92
24 months R**2=0,35 24 months R**2=0,93
36 months R**2=0,36
Prediction of river flow values
Use of the selected model and techniqueOriginal series
Prediction of river flow values
Moving average series
Prediction of river flow values
Statistical interpretationAutoregression shows that individual values can be predicted (Unbiased
R**2 = 0,32 - 0,36; for 12 to 36 months) with moderate or avarage precision, high peak values are very poorly reproduced.
In the case of a 6 month moving average, the prediction is nearly perfect (Unbiased R**2 = 0,92; for 12 months).
Hydrological interpretationAlthough the pattern of changes can fairly be reproduced, even three
years data from the past are not enough at all to predict the height of peak flows.
But if we consider 6 month averages, they can be predicted almost with full precision.
DataUNESCO, Water Science Division
Question concerning company management What are the factors that influence the economic performance
of a company? Economic performance is measured by the return on capital employed.
Statistical question Can the return on capital be predicted by using a set of
economic and production indicators from those characterizing the company?
How does the prediction change if we are loking for a subset of best predictors?
Statistical model and technique Multiple linear regression Stepwise regression
Business
Use of the selected model and technique
Running REGRESSN
Results The full regression model explains 70% of the adjusted variance
of the dependant variable. Its standard error is about one half of the mean, value of the determinant of the correlation matrix is .79478E-05. There are 8 variables (out of 12) with high covariance ratio values.
The stepwise regression model selects 3 variables for explaining 80 % variance. No multicollinearity (0.77647 ). Standard error of the estimate of the dependent variable = 0.06135 which is quite low: high reliability of estimation.
Business
Business
Statistical interpretationFull regression model: the reliability of prediction is poor. Strong
multicollinearity is shown. Variables, which contribute to multicollinearity can be identified
The stepwise regression model: 3 variables for explaining 80% variance. No multicollinearity. High reliability of estimation.
Interpretation for managementAlthough the full indicator set can give nice prediction, it can not
be suggested for real use because of the poor prediction reliability.
But if we consider 3 carefully selected indicators, we can get a fair prediction.
SourceP.S. Nagpaul, India
Question concerning measurement of knowledge level
Tests are used very often in education for checking the level of knowledge in one or in another subject. Long tests with many questions can meet relatively easily the reliability requirement. The question is if we can make a short interactive, adaptive test from a long test, preserving at least nearly the original reliability.
Statistical question
Can we give a good estimate of the original test value by using a tree structure based prediction?
Statistical model and techniqueRegression tree
Education
Use of the selected model and technique
Running SEARCH
Results
Starting from a standardized test (for checking a specific verbal aptitude) containing 20 questions, a regression tree with 3-4 questions was obtained. The regression tree contains 10 final subgroups (leaves) with estimates for the original test value ranging from 6,4 to 59,2. The explained variance is 90,4%.
Education
Education
Statistical interpretationA very good estimate can be given for the original test value by using the obtained regression tree.
Interpretation for test designersUsing the the tree structure, cumputer assisted test can be constructed, which is much shorter, without loosing the power of the original test.
SourceM. Hunya: Finding optimal interactive test structures (1982)
Top Related