Module B
description
Transcript of Module B
Training Module B
i
Module B Glossary
ANOVA:
ANOVA stands for analysis-of-variance. It is
a collection of statistical models, and their
associated procedures, in which the
observed variance is partitioned into
components due to different explanatory
variables. In its simplest form ANOVA
provides a statistical test of whether or not
the means of several groups are likely to be
equal.
Chi-square tests:
A statistical hypothesis test in which the
sampling distribution of the test statistic is a
chi-square distribution when the null
hypothesis is true, or any in which this is
asymptotically true, meaning that the
sampling distribution (if the null hypothesis
is true) can be made to approximate a chi-
square distribution as closely as desired by
making the sample size large enough.
Cleaning database:
A process to increase the accuracy of the
data and streamline the database, by
removing/correcting duplicate and wrong
data in the database.
Codebook:
A document used for implementing a code.
It reports dictionary information such as
variable names, variable labels, value labels,
and missing values.
Coefficient of variation (CV):
A normalized measure of dispersion of a
probability distribution.int or a missing
component of a data point.
Cohort Survival Rate (CSR):
The percentage of enrollees at the
beginning grade or year in a given school
year who reached the final grade.
Correlation:
A single number that describes the degree
of relationship between two variables.
Correlations are useful because they can
indicate a predictive relationship, possible
causal, or mechanistic relationships.
Coverage:
The extent or degree to which the entire
study area is observed, analyzed, and
reported by the survey.
Cross tabulation (Crosstab):
This displays the joint distribution of two or
more variables. They are usually presented
as a contingency table in a matrix format.
Whereas a frequency distribution provides
the distribution of one variable, a
contingency table describes the distribution
of two or more variables simultaneously.
Data collection:
A process of preparing and collecting data
to keep on record, to make decisions about
important issues, and to pass information
on to others.
Training Module B
ii
Data preparation:
A process of preparing and collecting data
to keep on record, to make decisions about
important issues, and to pass information
on to others.
Data Validation:
A process of ensuring that a program
operates on clean, correct and useful data.
Descriptive statistics:
To describe the basic features of the data in
a study. They provide simple summaries
about the sample and the measures.
Together with simple graphics analysis, they
form the basis of virtually every
quantitative analysis of data.
Disaggregation:
A process of breaking down and analyzing
an indicator by detailed sub-categories. Also,
it is for understanding the degree of
accuracy and its limitations of the survey.
Educational attainment:
A term commonly used by statisticians to
refer to the highest degree of education an
individual has completed.
Estimation:
Any of numerous procedures used to
calculate the value of some property of a
population from observations of a sample
drawn from the population.
Factor analysis:
A statistical method used to describe
variability among observed variables in
terms of a potentially lower number of
unobserved variables called factors.
Frequency:
The number of occurrences of a repeating
event per unit time. It provides statistics
and graphical displays that are useful for
describing different types of variables.
Gender Parity Index (GPI):
A socioeconomic index usually designed to
measure the relative access to education of
males and females. In its simplest form, it is
calculated as the quotient of the number of
females by the number of males enrolled in
a given stage of education.
Household:
A basic residential unit in which economic
production, consumption, inheritance, child
rearing, and shelter are organized and
carried out. Household is broader than
family, which is a group of people related by
blood or marriage such as parents and their
children only.
Household survey:
A process of data collection and analysis for
understanding general situation and
exploring specific characteristics of
households or household population.
Imputation:
To substitute some value for a missing data
point or a missing component of a data
point.
Training Module B
iii
Kurtosis:
A measure of the "peakedness" of the
probability distribution of a real-valued
random variable. Higher kurtosis means
more of the variance is the result of
infrequent extreme deviations, as opposed
to frequent modestly sized deviations.
Liner regression:
An approach to modeling the relationship
between one or more variables denoted y
and one or more variables denoted X, such
that the model depends linearly on the
unknown parameters to be estimated from
the data.
Mean:
The expected value of a random variable.
For a data set, the mean is the sum of the
observations divided by the number of
observations.
Missing value:
This occurs when no data value is stored for
the variable in the current observation.
Missing values are a common occurrence,
and statistical methods have been
developed to deal with this problem.
Nonparametric test:
A statistic (a function on a sample) whose
interpretation does not depend on the
population fitting any parameterized
distributions. Statistics based on the ranks
of observations are one example of such
statistics and these play a central role in
many non-parametric approaches.
OLAP cube:
A multidimensional database that calculate
summary statistics for summary variables
within categories of one or more grouping
variables. The cube allows different views of
the data to be quickly displayed.
Outlier identification:
To identify an observation that is
numerically distant from the rest of the
data.
Pivot table:
A data summarization tool to create output
table formats. Pivot-table tools can
automatically sort, count, and total the data
stored in one table or spreadsheet and
create a second table.
Population census:
A procedure of systematically acquiring and
recording information about the members
of a given population. It includes
information on household members, which
are useful for policy making, planning,
monitoring and evaluation.
Sample design:
To determine what kind of people and how
many people you need to interview to
collect data. A decision about sample size
can be made, based on factors such as: time
available, budget and necessary degree of
precision.
Training Module B
iv
Sampling:
A part of statistical practice concerned with
the selection of an unbiased or random
subset of individual observations within a
population of individuals intended to yield
some knowledge about the population of
concern, especially for the purposes of
making predictions based on statistical
inference. A design of any information-
gathering exercises where variation is
present.
Skewness:
A measure of the asymmetry of the
probability distribution of a real-valued
random variable.
Standard deviation:
A statistic that tells how tightly all the
various examples are clustered around the
mean in a set of data. In other words, they
are measures of variability.
Structured Query Language (SQL):
A standard programming language used for
accessing and maintaining a database. The
key feature of the SQL is an interactive
approach for getting information from and
updating a database.
Syntax:
A set of rules that define the combinations
of symbols that are considered to be
correctly structured programs in the
programming language.
T-test:
The expected value of a random variable.
For a data set, the mean is the sum of the
observations divided by the number of
observations.
Validation rule:
A criterion used in the process of data
validation, carried out after the data has
been encoded onto an input medium and
involves a data vet or validation program.
Variable:
A symbol that stands for a value that may
vary. For instance, a variable can be used to
designate a value occurring in a hypothesis
of the discussion.
Visual Binnig:
To perform automatic creation of new
variables based on grouping contiguous
values of existing variables into a limited
number of district categories. This can
create categorical variable from continuous
scale variables.
Wealth index:
The extent or degree to which the entire
study area is observed, analyzed, and
reported by the survey.
Weighting:
A process, which involves emphasizing
some aspects of a phenomenon, or of a set
of data.
Module B1:
Exploring Household Surveys for EFA Monitoring
Contents
1. Understanding Household Surveys 1.1 Introduction to Household Surveys 1.2 Education Related Questions (or Modules) in Household Surveys 1.3 Inputs from Household Surveys for Aligning Education Policies
2. Brief Information on Common Household Surveys 2.1 Background and Objectives of Selected Surveys 2.2 Structure and Contents of the “Survey Questionnaire” 2.3 Consideration on Sample Design 2.4 Understanding Survey Data Files and Availability of Education Related Data
3. Gathering Survey Data and Getting Ready for Analysis 3.1 Data Sources and Contact Points for Obtaining Census and Survey Data 3.2 Common Obstacles and Approaches in Gathering Population Census and Household
Survey Data 3.3 Quality Issues, Challenges and Recommendations in Using Survey Data 3.4 Use of Survey Data along with EMIS Data/Indicators for Policy Analysis
4. Exercises and Further Studies 4.1 Self-evaluation 4.2 Exercises 4.3 Further Studies
5. Annexes
Annex 1: Population and Housing Census
Annex 2: Education Related Questionnaires from Selected Household Survey
Annex 3: Education Related Variables in the Selected Datasets
Annex 4: List of Key EFA Indicators
Purposes and learning outcomes
To gain better understanding of common household surveys
To understand reasons on limited use of household survey data in education planning and EFA monitoring
To explore the values added and benefits of data from household surveys for education policies
To recognize the questions in common household surveys, which are directly or indirectly useful in exploring access, quality and management of education, and their determinants
To know the key point to be aware in analyzing data from household surveys
1. UNDERSTANDING HOUSEHOLD SURVEYS
1.1 Introduction to Household Surveys
“Household” is defined to be a basic residential unit in which economic production, consumption,
inheritance, child rearing, and shelter are organized and carried out. Household is broader than
family, since family refers only to a group of people related by blood or marriage such as parents
and their children only.
“Household survey” is a process of data collection and analysis for understanding general situation
and exploring specific characteristics of households or household population. The fieldwork of a
household survey investigates and records the facts, observations and experience of sample
households, which represents all households in the study area. Tools for data collection include a
series of questions, observation checklists and records for discussions.
Nowadays household surveys were conducted in almost every country and territory, ad-hoc or
periodically (annually, biennially or once in every three or every fifth year or etc.). There are
different types of surveys (Ref. Section 2).
Most education indicators, especially school-based ones, can be derived from the annual school census or EMIS data collection system. However, EFA monitoring requires more indicators to measure "reaching the unreached" which generally cannot be provided by school data. Some essential EFA indicators which are based on ethnic minority, disabled or illiterate population and out-of-school children can be derived only from the household surveys.
1.2 Education Related Questions (or Modules) in Household Surveys
Two main components of household survey
Household survey generally uses two different questionnaires: a household roster and at least one
detailed or individual questionnaires.
Household roster: this includes listing of all household members and their characteristics such as
age, sex and relationship to head of household for every member; education and literacy status for
the persons aged 5 and above; schooling status to those aged 5-24 (or 6-14, 6-19, etc.), and marital
status for all adults aged 15 and above.
Detailed or individual questionnaire: this explores the main theme of the study, and sometimes,
aim only to the specific respondents such as head of household, married couples, mother of children
under 5, ever married women, out of school children, disadvantaged children, etc.
The fieldwork (data collection) of a household survey is followed by coding, checking and editing,
data entry, data verification, data analysis and drafting of the report. Majority of household surveys
use SPSS (renamed as PASW) for data analysis and also for creation of tables, graphs and charts.
As such, although the survey may enter data using different programs such as dBase, MS Access,
MS Excel, CSPro, IMPS, …, the final data files analyzed are available in SPSS data format.
Household survey and population census
The datasets created from household surveys and population censuses1
normally include
information on household members, which are useful for policy making, planning, monitoring and
evaluation in education, such as:
(i) population by age and sex (and urban/rural residence in larger surveys), and with special
characteristics such as ethnic minority, disability, …);
(ii) literacy status of respondents (self-reporting) and other family members (proxy reporting);
(iii) highest educational attainment of the respondent, and population under study; and
(iv) schooling status (currently attending , dropout or never attended) of children at the school-
going ages.
Apart from the above mentioned information, several household surveys could provide migration
status of household members, and socio-economic characteristics of household such as:
(v) birth place and/or place of residence during five or ten years ago;
(vi) number of income earners in the household;
(vii) household income and expenditure (in some cases, separate health and education
expenditures);
(viii) possession of household amenities or durables; and
(ix) food securities; and so on.
As such, data from household survey and population census can complement the school-based data2
by providing information on aspects of children‟s background that may influence household
schooling decisions and school participation of children (such as enrollment and/or school
attendance).
Household surveys provide broader varieties of information while population census provides more
accurately on age and sex structure, and education and literacy attainment of entire population.
1 Population census is a type of household survey with broader coverage. By international agreement, census consists
of an enumeration of entire population in the specified area regularly at a marked time interval. 2 Ministry of Education, through EMIS (Education Management Information System), regularly collects school-based data
and normally processes and provides limited information on the individual characteristics of pupils, such as age, sex,
grade and performance (flow rates), and little information on the characteristics of their households.
1.3 Inputs from Household Surveys for Aligning Education Policies
Household surveys and population census could also provide data on adult educational attainment
and reported literacy skill (that is, reported by the respondent) by household characteristics such as
rich or poor household, reside in urban, rural or remote area, far or near to the school, and etc…
Key education indicators possible to derive from surveys
The following common education indicators which are essential in formulating and aligning
education policies, and preparing, monitoring and evaluating education development programmes
and projects could be derived from common household surveys and population censuses.3
1) Adult Literacy Rate (for population aged 15 and above);
2) Youth Literacy Rate (for population aged 15-24);
3) Illiteracy rates for different population groups, especially for the vulnerable groups such as
females, ethnic minorities, disabled persons, and those from poor families and remote areas;
4) Educational attainment, measured by the number of years attended school or highest level
of schooling or proportion of adult population who completes primary or secondary school
(adult primary and secondary school completion rates);
5) Gross and net intake rates for primary Grade 1;
6) Gross and net enrolment rates by education level or by age;
7) Transition rates (from primary to lower secondary, and lower to upper secondary level);
8) Student flow rates (promotion, repetition and dropout rates); and
9) Out of School Children.
Moreover, some other measures such as gender parity index, cohort survival rate and measure of
internal efficiency could be derived from the above indicators.
One important benefit for constructing education indicators from the household surveys is the
“ability to compare the indicators among different population groups” such as;
a. male versus females;
b. ethnic minorities vs. other ethnic groups;
c. disabled persons vs. general population;
d. those living in remote areas vs. urban/rural areas;
e. comparing among the families with different wealth levels (measure by quintiles of
household expenditure per capita or ownership of household amenities).
Such information cannot be made available from regular school-based data collection, and are
important in measuring the achievement of education policies and in aligning education policies
for future.
Utilization of household survey data in education
All these information are very valuable for education policy makers and planners, however, such
information are not fully utilized for several reasons:
lack of awareness on existence and accessibility of survey data even in the same ministry due
to bureaucratic procedures, cost, and not knowing where to find or how to request such data;
little information on education and literacy are presented in the main report – only few
paragraphs or just a section on education in the general household survey reports;
3 See “Guide to the Analysis and Use of Household Survey and Census Education Data (UIS, 2004, pp 13-21)” for detailed
framework for analysis and further discussion.
additional analysis on education and literacy status are very rare; and
lack of knowledge and skill on how to capitalize education and literacy data from surveys
particularly to facilitate the evidence-based policy formulation, implementation and
monitoring.
As a result, only a couple of researchers and consultants from international agencies are the ones
who use the education and literacy data from surveys to undertake few additional studies. However,
most of such studies are academic oriented or aimed to serve the specific project purposes set by
the international organization. It is seldom provide the information needs for the policy
recommendations.
It is crucial to build the capacity on analysis of data from survey to the staff from Ministry of
Education and line ministries so as to reflect and incorporate the findings from surveys into the
policy formulation, program implementation, monitoring and evaluation, including those for
achieving EFA goals4.
4 See Annex 4 for List of key EFA indicators.
2. BRIEF INFORMATION ON COMMON HOUSEHOLD SURVEYS
2.1 Background and Objectives of Selected Surveys
Multiple Indicator Cluster Survey (MICS)
The Multiple Indicator Cluster Survey is a household survey developed by UNICEF to assist
countries in filling data gaps for monitoring the situation of children and women. It is capable of
producing statistically sound, internationally comparable estimates of these indicators. MICS was
originally developed in response to the World Summit for Children to measure progress towards an
internationally agreed set of mid-decade goals. The first round of MICS was conducted around
1995 in more than 60 countries, and the second round was conducted in 2000 (around 65 surveys).
The third round of MICS was carried out in 2005 onwards (more than 50 countries). It was focused
on providing a monitoring tool for the World Fit for Children, the Millennium Development Goals
(MDGs), as well as for other major international commitments, such as the United Nations General
Assembly Special Session (UNGASS) on HIV/AIDS and the Abuja targets for malaria. At least 21
MDG indicators can be collected in the current round of MICS, offering the largest single source of
data for MDG monitoring.
Results from the surveys, including national reports, standard sets of tabulations and micro level
datasets are available at UNICEF's web site www.childinfo.org.
Demographic and Health Survey (MEASURE DHS)
Since 1984, the Demographic and Health Survey (DHS) Project has provided technical assistance
to more than 200 demographic and health surveys in 75 countries advancing global understanding
of health and population trends in developing countries. In 1997, DHS became one of four
components of the “Monitoring and Evaluation to Assess and Use Results” (MEASURE)
Program5.
The MEASURE DHS Project gains worldwide reputation for collecting and disseminating
accurate, nationally representative data on health and population in developing countries. The
project is implemented by Macro International, Inc. and is funded by the United States Agency for
International Development (USAID) with contributions from other donors such as UNICEF,
UNFPA, WHO, UNAIDS.
Since October 2003 Macro International has been partnering with four internationally experienced
organizations to expand access to and use of the DHS data: The Johns Hopkins Bloomberg School
of Public Health/Center for Communication Programs; Program for Appropriate Technology in
Health (PATH); Blue Raster; The Futures Institute.
5 MEASURE Program - Together, the four MEASURE partners (MEASURE DHS, MEASURE Evaluation,
MEASURE U.S. Census Bureau- Survey and Census Information, Leadership, and Self Sufficiency (SCILS), and
MEASURE Centers for Disease Control and Prevention - Division of Reproductive Health (CDC/DRH) provide a
full range of related services, which include promoting the demand for quality data; providing technical assistance,
training, systems development, data collection and analysis, and capacity-building services; and disseminating
information and facilitating its use in decision-making. (See http://www.measureprogram.org/)
Every year, different types of household surveys are conducting for different purposes in almost every country. Three most common household surveys in this region, namely, Multiple Indicator Cluster Survey (MICS), Demographic and Health Survey (Measure-DHS), and Living Standard Measurement Study (LSMS) together with the population census are discussed in this section.
The DHS surveys collect information on fertility, reproductive health, maternal health, child health,
immunization and survival, HIV/AIDS; maternal mortality, child mortality, malaria, and nutrition
among women and children stunted. The strategic objective of MEASURE DHS is to improve and
institutionalize the collection and use of data by host countries for program monitoring and
evaluation and for policy development decisions.
LSMS – Living Standard Measurement Survey
LSMS was established by the Development Economics Research Group (DECRG) of the World
Bank to explore ways of improving the type and quality of household data collected by statistical
offices in developing countries. LSMS is a research project that was initiated in 1980 and carried
out several rounds in more than 30 countries. The program is designed to assist policy makers in
their efforts to identify how policies could be designed and improved to positively affect outcomes
in health, education, economic activities, housing and utilities, etc...
Objectives of LSMS include:
to improve the quality of household survey data;
to increase the capacity of statistical institutes to perform household surveys;
to improve the ability of statistical institutes to analyze household survey data for policy
needs; and
to provide policy makers with data that can be used to understand the determinants of
observed social and economic outcomes.
LSMS is providing users with actual household survey data for analyses and also a link to reports
and research done using LSMS data.
Population Census
The oldest type of household survey with broader coverage is the “population census”. By
international agreement, census consists of an enumeration of entire population in the specified area
regularly at a marked time interval. Questions may be asked concerning certain characteristics of
each person, such as age, sex, marital status, education, employment status, and more while
enumerating population. Therefore, census basically provides the data on number and composition
of the entire population at a given time, and selected socio-economic and educational
characteristics of household population in the country.
Since it is based on the complete enumeration of all households in the country, a census can
provide valuable information for policies and the planning of socio-economic development from
the national to the lowest administrative levels. Moreover, census is the source for constructing
sampling frames for selecting households and population for other surveys.
Population censuses are carried out once in every 10 years in most of the countries or once in every
5 years in some economically advanced countries. As such, census is the most comprehensive
source of demographic and socio-economic data for several countries.
Although the main objective of a census is to get reliable population data, the latest United Nations
guidelines6
for preparing population census emphasis on collecting data on literacy, school
attendance, educational attainment, field of study and educational qualifications.
6 “Principles and Recommendations for Population and Housing Censuses”, United Nations Statistical Office, 1998.
2.2 Structure and Contents of the “Survey Questionnaire”
2.2.1 Questionnaire Used in Multiple Indicators Cluster Survey (MICS)
MICS uses three main questionnaires in every survey:
(i) household questionnaire,
(ii) questionnaire for women aged 15-49, and
(iii) questionnaire for children under the age of 5.
The Household Questionnaire comprises of household characteristics, household listing, education,
child labor, water and sanitation, salt iodization, insecticide-treated mosquito nets (ITNs), and
support to children orphaned and made vulnerable by HIV/AIDS, with optional modules for
disability, child discipline, security of tenure and durability of housing, source and cost of supplies
for ITNs, and maternal mortality.
A. Household Identification
B. Household Listing Form
C. Education Module
2.2.2 Questionnaire used in MEASURE DHS
Although DHS surveys aim to collect data to understand fertility; reproductive, maternal and child
health; immunization, survival and nutrition; maternal and child mortality; HIV/AIDS; and malaria,
the key household questionnaire covers several questions on education and its differentials.
Followings are the extracts from the DHS Model Household Questionnaire.
A. Household Identification
B. Listing of all Household Members - 1
C. Listing of all Household Members - 2
2.2.3 Questionnaire used in Living Standards Measurement Survey (LSMS)
LSMS is a comprehensive survey. Its questionnaire set contains (i) household and (ii) community
and (iii) price questionnaires. Household questionnaire expands over 100 pages covering 15
sections including education.
The education section of the LSMS questionnaires has three sections on four pages as follows:
Ref: LSMS Working Paper 130 "Model Living Standards Measurement Study Survey Questionnaire
for the Countries of the Former Soviet Union" by Raylynn Oliver.
2.2.4 Population and Housing Censuses
As mentioned above, a census covers each and every person in the country, and is the most reliable
source of population data. Household roster used in censuses contains basic information on all
household members such as age, sex, marital status, education and literacy status together with
household characteristics such as location and type of residence, and availability of services.
Viet Nam 2009 Population and Housing Census questionnaire includes the following questions on
education and literacy status of entire population. Combining with age, sex, residence, migration
and disability status recorded in other questions, literacy, educational attainment, and participation
and access to education could be analyzed for different population groups.
For further case study, please refer to Annex1.
2.3 Consideration on Sample Design
Census based on all households in the study area (a region, or a territory or a country). Therefore,
the entire household population is included in data collection. During census taking process, there
might be some non-response households, but comparatively very few and generally negligible.
Since it is complete enumeration, census does not require a sample design and the data and
indicators derived from the census are the actual values, not the estimates.
On the other hand, a household survey collects data from the selected households in the area, and
provides the estimates (of the characteristics or indicators) for entire household population in the
area based on the experience of the sample households. That is, not all the households in the study
area are selected in a survey. The quality (accuracy of the estimates) and the usefulness of a
household survey depend on the followings points.
i) Sampling method (how the sample households are selected);
Common sampling methods include SRS (Simple Random Sampling), PPS (Probability
Proportional to Size), cluster sampling, multi-stage sampling, and purposive sampling.
ii) Coverage (whether the entire study area is covered by the survey);
To represent the entire area, sample households must be selected from all households in the
area (country or region) using a random sampling method. Some household surveys select
from the households with specific characteristics (e.g., poultry farmers) or from pre-
assigned parts of the areas only (e.g. households beyond 3 mile radius from a school).
iii) Sample size (how many households are selected) and allocation of samples (how the
sample households were allocated to different parts of the area); and
iv) Data analysis - how to get estimates (values) of the key indicators, perceived standard
errors of estimates, and pre-determined level of disaggregation (e.g. by age, sex, grade,
region, socio-economic status, etc.).
Sample design of the household survey includes the above mentioned information and it is
generally part of the survey report.
For the data users (secondary analysts) it is important to know the sampling method and sample
size of the study before making any analysis. The accuracy will be lower if the estimates are not
calculated in-line with the sampling method of the survey. Similarly, the survey method and how
the sample households were allocated are essential in deciding whether and which weights should
be applied in data analysis. Moreover, the actual coverage of the survey, sample size and set level
of disaggregation will help data user to understand the limitations of the survey including whether
desired disaggregation is appropriate at required degree of accuracy or not.
The data analyst should, first, check the sample design through the accompanying documents such
as survey report or service contract, and/or contact persons of survey organization.
Example:
In a survey which was designed to get reliable estimates up to the provincial level by sex, and if
the estimates of adult illiteracy rate were computed for the adults who are living in remote areas
with lowest socio-economic status (lowest quintile) by district by sex, the derived estimates will
not be reliable. On the other hand, some surveys were designed to capture specific and rare
events. In such a survey, sample size is large and thus sufficient to estimate common education
indicators at lower levels at acceptable accuracy.
2.4 Understanding Survey Data Files and Availability of Education Related Data
This section highlights the education related variables in the main datasets of three common
household surveys and sample outputs on selected variables.
Education Related Variables in MICS Sample Dataset
In MICS sample dataset, four SPSS data files are generated for: (i) household, (ii) individual
household members, (iii) women aged 15-49, and (iv) children under 5. MICS datasets are shared
to a wide range of users. The second data file, which is for all individual household members (or
household listing – hl.sav), contains education and literacy status of population including school-
age children. The sample “hl.sav” data file contained 183 variables for 29,560 cases (persons), and
the following 21 variables are useful for analyzing education and literacy.
HH1 Cluster number
HH2 Household number
HL1 Line number
HL3 Relationship to the head
HL4 Sex
HL5 Age
HL6 Area (urban / rural)
ED2 Ever attended school
ED3A Highest level of sch. attended
ED3B Highest grade at level
ED4 Currently attending school (2004-05)
ED5 Days attended school in last week
ED6A Level of education attended
ED6B Grade of education attended
ED7 Attended school last year (2003-04)
ED8A Level of education attended last year
ED8B Grade of education attended last year
melevel Mother's education
helevel Education of HH head
hhweight Household sample weight
wlthind5 Wealth index quintiles
The following tables, which are useful in analyzing the schooling status of children aged 5-14, are
derived from the sample data file “hl.sav”.
Please see Annex 1 for more case studies.
3. GATHERING SURVEY DATA AND GETTING READY FOR ANALYSIS
3.1 Data Sources and Contact Points for Obtaining Census and Survey Data
Population Census: Censuses are conducted regularly every five or ten years and cover entire
country. Complete census databases are confidential and not sharing to the public or third party
users. However, subsets of those databases could be requested by the government education
departments after complete publishing of the census reports. Census databases are normally
maintained by the Census Bureau or Census Department or Central (or General) Statistical Office
of the country. On the other hand, if Ministry of Education identifies the required population data
and education-related data in tabular forms and requests through higher level authorities
(ministerial level), the census authorities will generate and provide the requested tables.
Major drawback for using census data is long lag time. A population census took over a year to
complete clean databases and the census reports are published two to three years after the census.
As such, Ministry of Education could get the education related datasets at least two years after the
census. There may also be a long delay in providing requested database subsets or tables.
Therefore, not many education ministries are using census databases, but requesting only
population data especially the projections of different school-age population.
Household surveys: They are available more frequently than population censuses. Moreover, the
conducting agencies are willing to share their datasets with simple formal requests. With smaller
workload, conducting agencies could create survey databases faster and most reports are available
within twelve months after completion of the fieldwork (data collection).
Access to datasets varies by survey and from country to country. All major household surveys
conducted or sponsored by international organizations have their own websites.
Please refer to “Further studies” for more information.
Although population census and household survey datasets are rich of information, those datasets are difficult to get and sometimes hard to understand. This section discussed the contact points and some tips on how to get the quality data from different sources.
3.2 Common Obstacles and Approaches in Gathering Population Census and Household Survey Data
As mentioned above, population censuses and household surveys contain useful data for EFA
monitoring. However, there are limitations.
- Common obstacles in gathering population census database
i) Difficult to locate the person (or department) who has the authority to provide census
datasets to the third party user.
ii) Lack of coordination in developing census questionnaire with other ministries and
departments including education ministry so that the questionnaire items in the census may
not directly useful for constructing education indicators.
iii) A census is conducted normally once in every 10 years and the census data may obtain at
least 2 to 3 years after completion of the census. Thus, the usefulness of census data is more
to review historic trend than for unveiling the current situation and status.
iv) Census collects during the school holiday. Census date rarely coincides with the beginning
of school-year, which is the reference date for calculating common education indicators. As
such, there may be minor discrepancies among the indicators calculated from the census and
regularly collected service statistics.
In many countries, very few household survey questionnaires were developed by education related
ministries and agencies. The survey questionnaires were set by the conducting agency and just
distribute to education ministry for comment or just for the information. Compared to population
census data, household survey data are easier to obtain for the education ministries.
- Main barriers in using household survey data for EFA monitoring7
i) Variation in measures of educational participation
Survey questions on educational attainment and current school attendance are phrased quite
differently from survey to survey. In many cases, assumptions were to use in calculating
common education indicators.
For example, a survey inquires (1) the highest grade completed by household members, and
(2) whether the person is currently attending school. To calculate net enrollment rate (NER)
or gross enrollment rate (GER) from these questions, an assumption is required about the
level/grade currently attended by the household member: if a child has completed Grade 4,
and currently attends school, it is to assume that the child is currently attending Grade 5.
ii) Timing and duration of survey fieldwork
7 This portion is extracted from: “Guide to the Analysis and Use of Household Survey and Census Education Data (UIS, 2004)”.
Tips:
How to get census data faster and smoother for analysis?
i) When seeking census data, it is better to contact at the ministerial level. Approaching
census department/agency by a lower level education planner may result in catastrophic
situation – waiting days after days, and never receiving proper response from the census
department.
ii) Limit number of variables in the requested dataset. By requesting data just to meet the
minimum requirements, the education planners may get a faster response and can
conduct analyses easier. Census datasets are very huge, and take time to subset, or
making analyses if several unused variables are included.
When considering education data from household surveys, the timing (when the survey was
started or at which date that a survey referred to) and duration or how long has the survey
taken to complete data collection. If a survey was started just before the end of school-year
and took over a month, then, the grade completed or attending may differ from household to
household depending on when the interview was conducted – in the early days or later days
of the survey. This may not be a problem for the surveys which has set the reference date
clearly like in the population censuses.
iii) Sample size and sampling method
A household survey is designed to provide the facts on or characteristics of the population at
a certain period through a representative sample of households. The representativeness of
sample depends on the survey design, which is influenced by three factors: the sampling
method used, the level of accuracy sought in the estimates for various indicators; and the
level of data disaggregation.
Some surveys especially the rapid assessments and case-control studies do not use
probability sampling techniques, and thus, the findings may not represent the entire
population under study. For the surveys aiming to get estimates for common characteristics
with moderate accuracy require smaller sample size, while for a rare characteristic (or
event) with higher accuracy requires larger sample size. Similarly, for estimating at the
national (and provincial) level only requires smaller sample size while finer sub-
stratification (such as district or lower level) needs larger sample size.
Therefore, it is important to check which sampling method was used in the survey under study, and
whether the sample size is sufficient enough for the particular education indicators at desired level
of disaggregation.
EFA monitoring indicators generally aim to explore the differences among the population groups,
such as normal and the disadvantaged ones. The sample size of a particular household survey may
or may not be sufficient to compute indicators for the disadvantaged group living in a certain area,
depending on the definition of “disadvantaged population” and level of disaggregation.
If the sample size is not sufficient for required disaggregation, it is recommended to reduce the
level of disaggregation or compute the required indicators at the desired disaggregation level and
present the results with sufficient notice.
3.3 Quality Issues, Challenges and Recommendations in Using Survey Data
Generally speaking, data files made available for analysis should be “cleaned”. These files will
have been checked for structural and range errors and edited for internal consistency. Provisions
that compensate for non-response should also be incorporated into the files and fully explained in
the accompanying documentation.
The first step after acquiring a dataset is to familiarize with its structure and the nature of its
variables, the circumstances of data collection, and any limitations on the use of the dataset. The
documentation for a census or household survey, such as reports and a codebook, will provide
important background information on the survey, such as sample size and data quality indicators.
Data manipulation and analysis can be demanding and complex. The following discussions do not
provide a comprehensive set of guidelines for the use of datasets; instead, reviews some key issues
to be considered in analyzing survey data.
(1) Familiarize with the structure of dataset and explore appropriate ways to analyze
First, find out whether records within the data files are at the household or individual level, and
second, whether household or individual weights should be used in estimation procedures.
Since sample surveys do not collect entire population (all households or all individuals) in an
area, weighting factors are required to reconstitute the characteristics of entire population from
the samples. For example, in a survey 5 households are selected from two enumeration area
(EA) of 50 and 60 households respectively; then, the household weight for each of the 5 sample
households from the first EA is 10, and from the second EA is 12. The weights are calculated
while planning the survey, and are provided in the dataset.8
(2) Study the variables in the datasets before analysis
It is important to refer original questionnaires to understand the variables better how to analyze
the data. For example, to analyze the literacy status of population, one should know the nature
of the variable such as: its codes (for example, „1=literate‟, „2=illiterate‟); restrictions (whether
the question was asked to all ages or aged 5+ or aged 15+); relationship to other
questions/variables (whether it was asked to everybody, or only those persons who answered
„no education‟ or „incomplete primary‟ in the question on “highest education level”); and
missing values (code „9‟) and non-response (code „8‟ for the variable “literacy status”). Only
after that, the data analyst can determine which variables were to select and how to handle the
selected variables to produce required indicator estimates efficiently.
(3) Replicate published results before proceeding with additional calculations
If there are reports of results from the data collection activity, try to replicate these results
before calculating any new indicators. Sorting out the difficulties with calculations already done
will bolster confidence in producing new results.
(4) Consider the issue of missing values
Non-response in a survey or census can happen in one of two ways. First the entire record
representing an individual or household was missing since the individual or household refused
to answer, was not available, could not be contacted, etc.; this is called “total non-response”.
The second type of non-response arises when variables within a record are missing and is
termed “item non-response”. The item non-response is common for the variables representing
the question which was not asked or known for all household members, such as whether a child
attends school during the current school year.
8 For detail explanations on weighting, see C-E Särndal et al (1992) “Model Assisted Survey Sampling”, Springer-Verlag;
and WG. Cochran (1977) “Sampling Techniques”, Jonh-Wiley & Sons.
A technique called “imputation” is often used to compensate for missing values in the case of
item non-response. Imputation replaced missing values with the most suitable ones base on
other cases in the same dataset. The resulting file, complete or “square”, allows getting better
estimates in constructing new indicators. Therefore, the data analyst must know how item
missing values were treated in the dataset.
In the case of total non-response weight adjustments method is often used. That is, non-
response records are omitted from the dataset and recalculating the weights. In this case, the
dataset contains two sets of weights “sample weight” and “adjusted/final weights”, and the
users must employ the final weight in calculating indicators.
(5) Calculate the measures of accuracy (coefficient of variation) of the basic estimates to
gauge reliability of the estimated indicators
Depending on the overall sample size of the survey, some tabulations may yield cells with very
small numbers of cases. The indicators estimated based on those tables may not be reliable. For
this reason, it is paramount to calculate some measure of accuracy and to disseminate it
alongside the basic estimate enabling to gauge the reliability of all estimates produced. A good
rule of thumb in this regard is to use the coefficient of variation (CV).
The coefficient of variation (CV) is defined as the square root of the variance divided by the
estimate itself and multiplied by 100 – expressed as a percentage.
Often, national statistical offices advocate basic quality guidelines that estimates having CVs
greater than 35% should not be used to draw statistical inferences and should not be released to
the public. Be sure to properly account for complex survey designs in analysis, particularly
when calculating variances.
In general, national population censuses collect data on all households and individuals in the
population, and thus, sample design and weighting are not at issue. The only exception is when a
different questionnaire with more detailed questions is presented to a sampled fraction of the
population. But even then, no explicit issues of complex survey designs since simple and self-
weighting designs (such as Stratified Simple Random Sampling or Systemic Sampling) are
generally used.
In the case of complex survey designs, forming the estimate itself (for example, primary school net
enrolment rate (NER)) is not an issue since it is easy to take the design into account by simply
applying the survey weights into the estimator. However, there may be critical issues in variance
estimation and thus CV estimation9.
9 See “Guide to the Analysis and Use of Household Survey and Census Education Data (UIS, 2004, pp 36-37)” for further
discussion on issues concerning weighting and calculation of CV in complex sample designs.
3.4 Use of Survey Data along with EMIS Data/Indicators for Policy Analysis
Administrative and household survey data sources measure educational participation in different
ways. Administrative data are based on school reporting at the beginning of the school year, and in
some cases, it can include reporting at the middle or end of the school year. Enrolment rates are
based on the numbers of children enrolled in school and the school-age population estimated from
national censuses and/or vital statistics.
Ideally, household surveys collect data on enrolment and/or school attendance based on a
representative sample of children. Questions concerning children‟s school participation are
typically asked to the head of household. The timing of the survey is varied from one survey to
another and unrelated to the school year. Some survey may actually even cross two different school
years.
Limitation of data
Estimates of educational participation from these two sources may differ for a number of reasons.
One major factor is that the question asked in the household surveys querying children‟s school
attendance is different from that answered by school censuses: attending school may slightly differ
from being enrolled in school. Children may be recorded in school enrolment records and not
actually attending school. Thus, the enrolment rates from the census and surveys may slightly lower
than those from the administrative data.
The different rates of participation can also be attributed to the timing of data collection relative to
the school year. A school census conducted at the beginning of the school year and a household
survey collecting data at the end of the school year will likely find different rates of participation
since some children will have enrolled in school without ever actually attending, and other children
will have dropped out of school during the school year.
In addition, the accuracy of the population estimate and the completeness of school-level data can
affect the calculation of participation rates from administrative data. Similarly, the completeness of
the census enumeration and the sample design for the household survey may also affect the
accuracy of estimates produced by censuses and surveys.
In short, many factors may contribute to variations in the estimates of school participation rates
from administrative data and household surveys. Further research is needed to explore the reasons
for similarities or differences between the measures of participation from these two sources.
However, when the school-age population estimates are not accurate and annual school censuses do
not cover several aspects essential for planning and monitoring, only the population census and
household surveys could provide reasonable indicators for planning and EFA monitoring. For
example, school administrative cannot provide enrolment rates by socio-economic status of the
household or for the disadvantaged groups and also cannot provide reasons for non-participation
(not enrolled) or dropping-out.
As such it is important to use both school administrative data and secondary data from census and
surveys for the policy analysis especially for the EFA monitoring aiming at reaching to the
unreached.
4. EXERCISES AND FURTHER STUDIES
4.1 Self-evaluation
How much do you understand why household survey data are essential in EFA monitoring and evaluation? Very well / Somewhat well / Not so much / Almost None
Do you know which common household surveys are conducted in your country? Very well / Somewhat well / Not so much / Almost None
Do you agree that the selected questions in three common household surveys are directly or indirectly useful in exploring access, quality and management of education, and their determinants? Strongly agree / Agree / Not so much / Disagree
Are you able to share the factors to be aware in analyzing data from household surveys to someone who want to analyze survey data? Very well / Somewhat well / Not so much / Almost None
Are you confident that you could explore a household survey questionnaire and extract key questions which are useful to supplement the regular data collection system for EFA monitoring and evaluation? Confident / Somewhat confident / Not so much / Not at all
4.2 Exercises
i) When was the last population census conducted in your country?
a. Get the census report or tables which may be useful for EFA monitoring.
b. Provide pros and cons for using data from census report(s) for EFA monitoring.
c. Get the census questionnaire and extract the items on education and related to
education.
d. Is that possible to get raw data on education and related fields from Census
Department and why?
ii) What is the most recent household survey conducted in your region (or country) and
describe the followings briefly?
a. When was it conducted?
b. Which sampling method was applied?
c. What was the sample size?
d. Explain briefly about the survey findings on education and literacy provided in
the report.
e. Is data file (dataset) from that household survey available for you?
iii) Connect to internet and find out the MICS website on your country, then,
a. Collect the questionnaire set for the most recent MICS survey in your country (or
in a neighboring country).
b. Download datasets in SPSS format from the most recent MICS survey for your
country (or for a neighboring country).
c. Study the variables, and compile a list of variables which you think is useful to
construct education indicators especially for EFA monitoring.
iv) From the DHS website, find out a recent report (if possible for your country) and
prepare an abstract which is useful for education planners.
v) If you have a chance to discuss, what do you want to add to or delete from LSMS
survey questionnaire, and why?
4.3 Further Studies
- International Household Survey Network (See http://www.internationalsurveynetwork.org )
- Luxembourg Income Study (See http://www.lisproject.org/)
- MEASURE DHS (Demographic and Health Surveys):Quality information to plan and improve
population, health, and nutrition program ( See http://www.measuredhs.com/)
- Rand Family Life Survey ( See http://www.rand.org/labor/FLS/ )
- UNESCO Institute for Statistics (UIS). 2004. Guide to the Analysis and Use of Household
Survey and Census Education Data (Can be downloaded at
http://www.uis.unesco.org/template/pdf/educgeneral/HHSGuideEN.pdf )
- UNICEF. Childinfo: Monitoring the Situation of Children and Women (Multiple Indicator
Cluster Survey) ( See http://www.childinfo.org/)
- United Nations Department of Economic and Social Affairs. 2008. Principles and
Recommendations for Population and Hosing Census Revision 2. (See
http://unstats.un.org/unsd/publication/SeriesM/Seriesm_67rev2e.pdf )
- United Nations Population Funds. Collection and using data: population and housing data (See
http://www.unfpa.org/data/census.cfm )
- United Nations Statistics Division (See http://unstats.un.org/unsd/default.htm )
- USAID‟s DHS EdData Activity website ( See http://www.dhseddata.com/ )
- World Bank. Living Standards Measurement Study (LSMS) ( See
http://econ.worldbank.org/WBSITE/EXTERNAL/EXTDEC/EXTRESEARCH/EXTLSMS/0,,m
enuPK:3359053~pagePK:64168427~piPK:64168435~theSitePK:3358997,00.html )
- Other organizations with links to education data sources
The William Davidson Institute http://www.wdi.bus.umich.edu/
The Development Gateway http://www.ids.ac.uk/eldis/health/health.htm
University of California http://biko.sscnet.ucla.edu/dev_data/
Country case studies
- NEPAL LIVING STANDARDS SURVEY 2002/03 ( See http://siteresources.worldbank.org/
INTLSMS/Resources/3358986-1181743055198/3877319-1181925143929/nlss2_urban.pdf)
- General Population Census of Cambodia 2008
(See http://www.nis.gov.kh/nis/uploadFile/pdf/EnumeratorManual.pdf)
(Household questionnaire refer to p65)
- Vietnam 2009 Population and Housing Census (See http://www.gso.gov.vn )
- 2005 Population and Housing Census of Korea (See http://kostat.go.kr )
- Tanzania poverty monitoring ( See http://www.povertymonitoring.go.tz/index.asp )
5. ANNEXES
Annex1: Population and Housing Census
A1.1 2005 Population and Housing Census of Korea:
This includes just two education items on one question. Even form such limited data, education and
literacy status of population and schooling status of children could be studied by age, sex,
residence, and etc…
A1.2 General Population Census of Cambodia 2008:
This contains the following literacy, education and disability status in the main questionnaire.
Therefore, it is apparent that all population censuses include from a limited number to several
questions on education and literacy status of entire population.
Annex 2: Education Related Questionnaires from Selected Household Survey
A2.1 Household questionnaire of the Nepal Living Standard Survey 10
2002/03:
This contains a section on education covering (i) literacy, (ii) past enrolment and (iii) current
enrolment as followings:
10
NLSS, which is alternative name of LSMS
Annex 3: Education Related Variables in the Selected Datasets
A3.1 Nepal’s 2006 DHS Dataset
The dataset from 2006 Nepal DHS contains seven SPSS data files: (i) Births Recode, (ii) Couples'
Recode, (iii) Household Recode, (iv) Individual Recode, (v) Children's Recode, (vi) Male Recode,
and (vii) Household Member Recode. The last data file NPPR51FL.SAV (for the individual
household members; 44,057 persons x 258 variables) contains all necessary information except for
one important differential of access to and attainment of education, the “wealth index” (households
grouped into five quintiles based on wealth). The wealth index could obtain from the third data file
for the households. The selected variables from NPPR51FL.SAV are:
HV001 Cluster number
HV002 Household number
HV003 Respondent's line number
HV005 Sample weight
HV024 Region
HV025 Type of place of residence
HV026 Place of residence
HV104 Sex of household member
HV105 Age of household members
HV106 Highest educational level
HV107 Highest year of education
HV108 Education in single years
HV109 Educational attainment
HV121 Member attended school during current school-
year
HV122 Educational level during current school-year
HV123 Grade of education during current school-year
HV124 Education in single years - current school-year
HV125 Member attended school during previous school-
year
HV126 Educational level during previous school-year
HV127 Grade of education during previous school-year
HV128 Education in single years- previous school-year
HV129 School attendance status
From the above variables, the following frequency tables could be constructed for the children aged
5-14.
A3.2 Albania’s 2005 LSMS Dataset
The 2005 Albania LSMS covered 3,638 households residing 17,302 persons. The survey datasets
are available on the LSMS website. Since LSMS questionnaire covers several topics and items,
datasets were split into several files. The datasets directly concerned with education are
educationa_cl.sav (for preschool education), educationb_cl.sav (for general education and literacy),
and household_rostera_cl.sav (for age, and sex).
The selected variables from those datasets are:
hhid household identifier
m2b_q00 ID code
m1a_q02 Sex
m1a_q5y Age - Years
m2b_q01 Can read newspaper
m2b_q02 Can write personal letter
m2b_q04 Highest level
m2b_q05 Highest Grade
m2b_q07 Years of preschool
m2b_q09 Currently attending school
m2b_q10 Reason for not attending
m2b_q14 Intends to return to school
m2b_q16 Current level
m2b_q17 Current Grade
m2b_q18 Public - Private
m2b_q20 Distance from dwelling
m2b_q22 Hours to travel
m2b_q23 Minutes to travel
m2b_q24 Transport to school
m2b_q49 Absent from school
m2b_q50 Days missed
m2b_q51 Reason missed school
From the above variables, literacy (read and write) and schooling status for the children aged 7-14
could be analyzed as seen in the following tables:
Annex 4: List of Key EFA Indicators
Goal 1:
ECCE (H)
(H)
(S)
S
(S)
(S)
(S)
(S)
(S)
1. Gross Enrolment Ratio (GER) in ECCE programmes
2. Percentage of new entrants to primary Grade 1 who have attended
some form of organized ECCE programme
3. Enrolment in private ECCE centres as a percentage of total enrolment
in ECCE programmes
4. Percentage of trained teachers in ECCE programmes
5. Public expenditure on ECCE programmes as a percentage of total
public expenditure on education
6. Net Enrolment Ratio (NER) in ECCE programmes including pre-
primary education
7. Pupil/Teacher Ratio (PTR) (child-caregiver ratio)
Goal 2:
UPE H
H
H
H
(H)
(H)
(H)
(H)
(H)
(H)
(H)
(H)
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
8. Gross Intake Rate (GIR)
9. Net Intake Rate (NIR)
10. Gross Enrolment Ratio (GER)
11. Net Enrolment Ratio (NER)
12. Percentage of repeaters
13. Repetition Rate (RR) by grade
14. Promotion Rate (PR) by grade
15. Dropout Rate (DR) by grade
16. (Cohort) Survival Rate to Grade 5
17. Primary Cohort Completion Rate
18. Transition Rate (TR) from primary to secondary education
19. Percentage of trained teachers in primary education
20. Pupil/Teacher Ratio (PTR) in primary education
21. Public expenditure on primary education as a percentage of total public
expenditure on education
22. Percentage of schools offering complete primary education
23. Percentage of primary schools offering instruction in the mother tongue
24. Percentage distribution of primary school students by duration of travel
between home and school
Goal 3:
Lifelong
learning
H
H
(H)
S
(S)
(S)
(S)
25. Number and percentage distribution of the adult population by
educational attainment
26. Number and percentage distribution of young people aged 15-24 years
by educational attainment
27. Gross Enrolment Ratio (GER) for technical and vocational education
and training
28. Number and percentage distribution of lifelong learning/ continuing
education centres and programmes for young people and adults
29. Number and percentage distribution of young people and adults
enrolled in lifelong learning/continuing education programmes
30. Number and percentage distribution of teachers/facilitators in lifelong
learning/continuing education programmes for young people and adults
Note:
H: Household surveys S: School records and school censuses
(H): If collected by Household surveys (S): If collected from ECCE centers and NFE centers
Goal 4:
Adult literacy (H)
(H)
(S)
(S)
(S)
(S)
(S)
(S)
(S)
31. Adult literacy rate (15 years old and above)
32. Youth literacy rate (15-24 years old)
33. Public expenditure on adult literacy and continuing education as a
percentage of total public expenditure on education
34. Number and percentage distribution of adult literacy and basic
continuing education programmes
35. Number and percentage distribution of facilitators of adult literacy and
basic continuing education programmes
36. Number and percentage distribution of learners participating in adult
literacy and basic continuing education programmes
37. Completion rate in adult literacy and basic continuing education
programmes
38. Number and percentage of persons who passed the basic literacy test
39. Ratio of private (non-governmental) to public expenditure on adult
literacy and basic continuing education programmes
Goal 5:
Gender
equality
H
(H)
(H)
(H)
H
H
H
H
H
H
H
H
S
S
S
(S)
S
S
S
S
S
S
S
S
S
S
40. Female enrolled as percentage of total enrolment
41. Female teachers as percentage of total number of teachers
42. Percentage of female school managers/district education officers
43. Gender Parity Index for:
a. Adult literacy rate (15 years old and above)
b. Youth literacy rate (15-24 years old)
c. GER in ECCE
d. GIR in primary education
e. NIR in primary education
f. GER in primary education
g. NER in primary education
h. Survival rate to Grade 5
i. Transition Rate from primary to secondary education
j. GER in secondary education
k. NER in secondary education
l. Percentage of teachers with pre-service teacher training
m. Percentage of teachers with in-service teacher training
Goal 6:
Quality of
Education
S
S
S
S
S
S
S
S
S
S
S
44. Percentage of primary school teachers having the required academic
qualifications
45. Percentage of school teachers who are certified to teach according to
national standards
46. Pupil/Teacher Ratio (PTR)
47. Pupil/Class Ratio (PCR)
48. Textbook/Pupil Ratio (TPR)
49. Public expenditure on education as a percentage of total government
expenditure
50. Percentage of schools with improved water sources
51. Percentage of schools with improved sanitation facilities
52. Percentage of pupils who have mastered nationally defined basic
learning competencies
53. School life expectancy
54. Instructional hours
Module B2:
Introduction to PASW Statistics (SPSS for Windows)
Contents:
1. Selecting Example Software for Analyzing Household Survey Data to Assist EFA Monitoring 1.1 CSPro (Census and Survey Processing System) 1.2 EPI Info 1.3 Microsoft EXCEL (with VBA Programming) 1.4 PSPP 1.5 SAS (Statistical Analysis System) 1.6 Stata 1.7 SPSS (Statistical Package for Social Sciences)
2. Introduction to PASW Statistics 2.1 What is SPSS/PASW Statistics? 2.2 Step-by-Step Procedure for PASW Statistics Installation 2.3 Running PASW and Its User Interface
3. Basic Components of PASW Statistics 3.1 Output Viewers 3.2 Pivot Tables 3.3 Charts 3.4 Saving/ Exporting Outputs 3.5 Online Help
4. Using Data from Other Sources 4.1 Importing Data from Microsoft Excel 4.2 Importing Data from Delimited ASCII Text Files 4.3 Importing Data from Fixed Width Text Files 4.4 Importing Data from Microsoft Access Databases
5. Tips and Exercises 5.1 Tips: Do and Don’t 5.2 Self-evaluation 5.3 Questions and Hands-on Exercises
Purpose and Learning Outcomes:
To inform background of popular statistical analysis software packages
To understand why SPSS / PASW is chosen as a statistical software for assisting EFA monitoring
To practice installation of PASW
To explore basic features and components of PASW
To understand how to import data from other sources to PASW
1. SELECTING EXAMPLE SOFTWARE FOR ANALYZING HOUSEHOLD SURVEY DATA TO ASSIST EFA MONITORING
1.1 CSPro (Census and Survey Processing System)
CSPro is a public domain statistical package which can be used for entering, editing, tabulating,
and mapping of census and survey data. It is widely used by statistical agencies in developing
countries, especially for data entry (fixed-width text file format).
It was designed and implemented through a joint effort among the developers of the Integrated
Microcomputer Processing System (IMPS) and the Integrated System for Survey Analysis (ISSA):
the United States Census Bureau, Macro International, and Serpro S.A. CSPro was designed to
replace both IMPS and ISSA.
The current version of CSPro is 4.0.003 released on 20 October 2009. CSPro 4.0 There are four key
applications (together with several useful utilities) in the CSPro application package:
1) A Data Entry Application contains a set of forms (screens) and logic that a data entry
operator uses to key in data to a file which can be used to add new data or to modify
existing data. Users can create unlimited number of forms (screens) for data entry normally
as a part of the data entry application.
2) A Batch Edit Application can be used to gather information about a data file together with
several run-time features including: writing editing rules for checking validity (values in a
variable) and consistency (between variables/cases) and modifying data values; making
imputations and generate imputation statistics; generating edit reports automatically or
creating a customized report and creating additional variables.
3) A Tabulation Application contains a set of table specifications (structure) and a data
dictionary (an existing or newly defined one) describing a data file to be tabulated.
This application could cross-tabulate variables and producing map results by geographical
area (if applicable) using both existing variables and new variables created "on the fly".
Output tables can contains selected statistics from simple counts and percents to mean,
median, mode, standard deviation, variance, n-tiles, proportions, minimum, and maximum.
Tabulations can be made on the values as it is o the data file or by applying weights.
4) A Data Dictionary describes overall organization of a data file (or) provides a description
of how data are stored in a data file. Data dictionary is the life of CSPro applications. It
must be created for each file being used.
One of the excellent feature of CSPro is requiring very simple and minimal hardware resources to
run. The minimum configuration includes (i) 33MHz 486 processor; (ii) 16MB of RAM, (iii) a
VGA monitor, and Microsoft Windows 98SE (this program runs only on the Microsoft Windows
family of operating systems). It is a public domain software and can be download at no cost.
All in all, CSPro is the most software in conducting data entry and initial analyses for general
surveys and population censuses. It is widely used in current DHS surveys. However, every data
file must have a data dictionary, even for making simple data analysis such as constructing the
frequency tables for the selected variables. Therefore, it is not suitable to analyze a dataset created
in other software (or datasets without predefined data dictionary).
More than 100 statistical software packages are observed on the web. Some of those packages can be run only on-line; some are free or public domain while the remaining are proprietary; some packages stick to a special field while the others are general purpose.
It is impossible to review all packages, and difficult to select example software for this module. Therefore, a review has been made on seven most widely used software in this section.
1.2 EPI Info
“Epi Info”is public domain statistical software for epidemiology developed by Centers for Disease
Control and Prevention (CDC) in Atlanta, Georgia (USA) since 1985. It is a public domain
software package designed for the global community of public health practitioners and researchers.
The first version, Epi Info 1, was an MS-DOS batch file on 5.25" floppy disks released in 1985. It
was developed under MS-DOS platform until the Epi Info 2000, the first Windows-based version.
Starting from Epi Info 2000, data was stored in the Microsoft Access database format, rather than
the text file format used in the MS-DOS versions. In current years, Windows Vista was supported
in version 3.5.1, released on August 13, 2008 and, an open source version, Epi Info 7, was released
on November 13, 2008 where its source codes can be downloaded.
The current versions provide easy form and database construction, data entry, and analysis with
epidemiologic statistics, maps, and graphs. The primary applications within EpiInfo are:
MakeView to create forms and questionnaires which automatically creates a database;
Enter to enter data into database through forms and questionnaires created in MakeView;
Analysis to produce statistical analyses of data, report output and graphs;
EpiMap to develop GIS maps with overlaying survey data; and
Epi Report to combine analysis output, enter data and any data contained in Access or SQL
server and present it in a professional format. The generated reports can be saved
as HTML files for easy distribution or web publishing.
Although “Epi Info” is a CDC trademark, the programs, documentation, and teaching materials are
in the public domain and may be freely copied, distributed, and translated. The 2003 analysis
documented 1,000,000 downloads from 180+ countries and its manual and/or programs have been
translated from English into 13 additional languages.
One of the most attractive functions of Epi Info is supporting all steps from developing of
questionnaire to data analysis and creating a tailor-made report. First, the users must develop a
questionnaire with Epi Info's "MakeView". Base on that questionnaire, one can customize the data
entry process, enter data into the database (that was created when developing questionnaire), and
finally, analyze the data. For epidemiological uses, such as outbreak investigations, being able to
rapidly create an electronic data entry screen and then do immediate analysis on the collected data
can save considerable amounts of time versus using paper surveys.
As such, it is one of the best software for using survey developers and researchers especially on
epidemiological research/surveys. However, it is not easy to analyze a dataset created in other
software, which the main theme of this Module.
1.3 Microsoft EXCEL (with VBA Programming)
Microsoft Excel (full name Microsoft Office Excel), a component of Microsoft Office, is a
spreadsheet application of Microsoft for both Windows and Mac OS X operating systems. Excel
was first established in 1985 on Mac OS, and the first Windows version in November 2007.
Microsoft Excel has became the most widely used spreadsheet application since the release of
Version 5 in 1993. The most recent commercial versions are Microsoft Office Excel 2007 for
Windows and 2008 for Mac.
Key features of Microsoft Excel include: calculation, graphing tools, pivot tables (or OLAP Cubes)
and a macro programming language in Visual Basic for Applications (VBA). It also has the ability
to carry out several database management functions including supports to SQL (Structured Query
Language) and Network DDE (Dynamic Data Exchange) allowing spreadsheets on different
computers to exchange data.
Since 1993 version, Microsoft Excel supports programming through Microsoft's Visual Basic for
Applications (VBA). VBA is based on Visual Basic and adding the ability to automate tasks in
Excel and to provide user-defined functions (UDF) for the use in worksheets. Moreover,
programming with VBA allows spreadsheet manipulation impossible with standard spreadsheet
techniques. Programmers may write VBA codes directly using the Visual Basic Editor (VBE). On
the other hand, users can record VBA codes replicating their actions on the spreadsheets, and thus
allowing simple automation of regular tasks.
Through VBA, a programmer can assess a database (or dataset) which is placed on a spreadsheet or
from the different files (created in non-Excel formats). Then, Visual Basic modules can be written
for constructing frequency and crosstab tables, calculation of different statistics, and conducting
transformation, sorting, selection and formatting. The results, intermediate or final, could be
concurrently written back to a spreadsheet or saved in a separate file.
The most favoring feature of Microsoft Excel is its wide accessibility as a component of Microsoft
Office. Microsoft Excel is one of the most frequently used software since almost all computer
literates can use it easily.
On the other hand, only few users are familiar with VBA, Pivot Table and database functions which
are the essential part for analyzing household survey data for EFA monitoring. However, Microsoft
Excel is the most suitable software for making final touches on statistical output tables produced by
other software, such as modifying a table format and adding graphs and charts.
1.4 PSPP
A free, open-source alternative software to the proprietary statistics package SPSS. It is an
application for analysis of sampled data and it has a graphical user interface and conventional
command line interface. It is written in C, uses GNU Scientific Library for its mathematical
routines, and "plotutils" for generating graphs. PSPP was start distributing since 1998, and the most
recent once (version 0.6.2) was released on 11 October 2009.
PSPP provides basic, but very useful, statistical analyses such as constructing frequency and
crosstab tables; making non-parametric tests, significant tests and reliability tests; fitting of
different linear regression models; factor analysis and computing basic statistics. It also provides
some database management features such as sorting and selecting cases, computing new variables,
recoding into existing and new variables, and more.
Users can select outputs (tables and graphics) in ASCII, pdf, postscript or html formats. Some
graphs such as histograms, pie-charts and np-charts can also be generated. PSPP can open SPSS
data files and able to import data from Gnumeric, OpenDocument, Microsoft Excel spreadsheets,
databases, comma-separated text files and ASCII text files. It can save data files in the SPSS
'portable' file format (*.por), SPSS 'system' file format (*.sav) and ASCII text file format. Some of
the libraries used by PSPP can be accessed programmatically; PSPP-Perl provides an interface to
the PSPP libraries.
The program file and manual can be downloaded from "http://www.gnu.org/software/pspp/". The
program can be installed freely and used without limitations. However, its documentations and help
system are not much useful for the beginners.
1.5 SAS (Statistical Analysis System)
SAS is an integrated system of software products from "SAS Institute Inc.". SAS enable
programmers (users) to perform many different kinds of analysis, data management and output
generating functions such as:
data entry, retrieval, management, and mining
report writing and graphics
statistical analysis
business planning, forecasting, and decision support
operations research and project management
quality improvement
applications development
data warehousing (extract, transform, load)
platform independent and remote computing
In addition, SAS has many business solutions that enable large scale software solutions for areas
such as IT management, human resource management, financial management, business intelligence,
customer relationship management and more.
SAS is driven by SAS programs that define a sequence of operations to be performed on data stored
as tables. SAS Library Engines and Remote Library Services allow access to data stored in external
data structures and on remote computer platforms.
SAS functions via application programming interfaces, in the form of statements and procedures. A
SAS program is composed of three major parts namely, (a) the DATA step, (b) procedure steps,
and (c) a macro language.
The DATA step identifies file structure, and reading and writing of records, and closing of the file.
All other tasks are accomplished by procedures in the procedure steps. Procedures are not
restricted to only to built-in ones but allow extensive customization, controlled by mini-languages
defined within the procedures. SAS also has an extensive SQL procedure, allowing SQL
programmers to use the system with little additional knowledge.
The macro programming extensions allows using of the "open code" macros or the interactive
matrix language SAS/IML component. Macro code in a SAS program undergoes preprocessing. At
runtime, DATA steps are compiled and procedures are interpreted and run in the sequence they
appear in the SAS program. A SAS program requires the SAS software to run. SAS consists of a
number of components, which require separately licenses and installations.
SAS runs on IBM mainframes, Unix machines, OpenVMS Alpha, and Microsoft Windows; and
code is almost transparently moved between these environments. SAS requires extensive
programming knowledge and it is the most expensive and comprehensive statistical analysis
software.
1.6 Stata
The name "Stata" is taken letters from the words "statistics" and "data". It is a general-purpose
statistical software package with full range of capabilities including data management, statistical
analysis, graphics, simulations, custom programming. It is used by many businesses and academic
institutions around the world. Most of its users work in research, especially in the fields of
economics, sociology, political science, and epidemiology.
Stata was first commercialized in 1985 by StataCorp and released a new major release roughly
every two years in recent years. The most recent version is Stata 11 distributed on 27 July 2009.
There are four major builds on each version of Stata:
Stata/MP for multiprocessor computers (including dual-core and multi-core processors)
Stata/SE for large databases
Stata/IC the standard version
Small Stata a smaller, student version of educational purchase only
Stata emphasizes on command-line interface to facilitate replicable analyses although a graphical
user interface (that is, menus and dialog boxes facilitate access to built-in commands) has initiated
since Stata 8.
It allows opening one dataset at a time for review and editing in spreadsheet format, but the dataset
must be closed before other commands are executed. When working with Stata, it holds entire
dataset in memory, which limits its use with extremely large datasets. The dataset is always
rectangular in format, that is, all variables hold the same number of observations (with some entries
may be missing values).
Stata's proprietary file formats are platform independent, so users of different operating systems can
easily exchange datasets and programs. Stata's data format has changed over time, although not
every Stata release includes a new dataset format. Every version of Stata can read all older dataset
formats, and can write both the current and most recent previous dataset format. Thus, the current
Stata release can always open datasets that were created with older versions, but older versions
cannot read newer format datasets.
Stata can read and write SAS XPORT format datasets natively and it can import data from ASCII
formats (CSV or fixed-width) and spreadsheet formats (including various Microsoft Excel formats).
Just some other econometric applications can directly import data in Stata file formats.
An advantage for using Stata is independency of OS for both datasets and programs. Another
advantage is allowing to operate user-written commands together with built-in commands. Several
useful commands are available to download from the internet (these command files are called ado-
files). Stata's version control system is designed to give a very high degree of backward
compatibility, ensuring that codes written for previous releases continues to work in newer version.
Some of the difficulties in suing Stata are requiring a thorough understanding of working on its
command line interface and basic commands. It seems that only those with extensive programming
experience could use Stata through self-learning. That is, a tailor-made training may be required for
the beginners before working effectively with Stata.
1.7 SPSS (Statistical Package for Social Sciences)
SPSS is one of the most popular data analysis software allowing various statistical methods and
procedures. SPSS was first developed in 1968 at the Stanford University for internal use only (see
brief history of SPSS/PASW Statistics in Section 2.1 of this module). Starting from March 2009,
the name SPSS had been changed to PASW Statistics (Predictive Analytics SoftWare)1.
The recent versions of SPSS/PASW Statistics could handle multiple datasets with almost unlimited
number of variables and cases. It allows importing and exporting of data and outputs to different
formats including Microsoft Excel and various text formats. Both menu (and dialog boxes) driven
graphical interface and command line (syntax) interface are available for the users.
It is the most user-friendly statistical software for the beginners to do basic analysis. It offers
excellent on-line help, complete users' manuals and self-learning tutorials. The package covers
almost all statistical methods required from basic to advanced analysis, good data management and
data documentation.
It is also found out that a vast majority of household surveys were analyzed with SPSS and/or final
survey datasets are available in SPSS (*.sav) format.
For these reasons, PASW Statistics is chosen as the example software to demonstrate household
survey data analysis for EFA monitoring purposes in this module. At the same time, with better
availability and acquaintance with intended users of this module, Microsoft Excel is also selected as
another example software especially for finalizing outputs and presentation purposes.
1 Recently, PASW Statistics has been changed to IBM SPSS Statistics after becoming part of IBM in late 2009.
Disclaimer
UNESCO does not recommend using a particular software. PASW Statistics and Microsoft
Excel are used only as the “example” software in this module. A software is just a tool to
assist in exploring EFA monitoring indicators from the household survey datasets, and users
can choose any statistical software.
Review and selection of the statistical software are solely based on the limited experience of
the author of this module. It does not reflect UNESCO's view or perspective.
Several facts are obtained from the user manuals of underlying software, and from the
Wikipedia, the web-based free encyclopedia.
2. INTRODUCTION TO PASW STATISTICS
2.1 What is SPSS/PASW Statistics?
Brief History
In 1968 at the Stanford University, Norman H. Nie a social scientist and doctoral candidate, C.
Hadlai (Tex) Hull who was just completed master of business administration, and Dale H. Bent a
doctoral candidate in operations research, developed a software system based on the idea of using
statistics to turn raw data into information essential to decision-making. This statistical software
system was called SPSS, the Statistical Package for the Social Sciences, which is the root of
present day PASW, the Predictive Analytics Software.
Nie, Hull and Bent developed SPSS out of the need to quickly analyze volumes of social science
data gathered through various methods of research. Nie represented the target audience and set the
requirements; Bent had the analysis expertise and designed the SPSS system file structure; and Hull
programmed. The initial work on SPSS was done at Stanford University with the intention to make
it available only for local consumption. With the launch of the SPSS user‟s manual in 1970, the
demand for SPSS software was taken off. Moreover, the original SPSS user‟s manual has been
described as “Sociology's most influential book2”. With growing demand and popularity since
1970, a commercial entity, SPSS Inc. was formed in 1975. Up to mid-1980s SPSS was available
only on mainframe computers.
With advances of personal computers in early 1980s, the SPSS/PC was introduced in 1984 as the
first statistical package appeared on a PC working on MS DOS platform. Similarly, the first
statistical product on the Microsoft Windows (version 3.1) operating system was again SPSS,
which was released in 1992.
Versions of SPSS in Recent Years
SPSS regularly updates to be fit in and also to exploit the advance features of new operating
systems, and to fulfill the growing needs of users.
SPSS 16.0.2 - April 2008
SPSS Statistics 17.0.1 - December 2008
PASW Statistics 17.0.2 - March 2009 (PASW = Predictive Analytics SoftWare)
PASW Statistics 18.0.1 (or) IBM SPSS Statistics 18.0.1 - August 2009
PASW is just enhancement and renaming of SPSS and not even the version number is restarted.
SPSS Users
At the beginning, SPSS users were limited academic researchers, mostly around large universities
with mainframe computers. With relatively very high price, employment of touch security systems
and less user-friendliness, number of SPSS users were not many at the early age of SPSS/PC+. Use
of SPSS is increasing rapidly after the release of SPSS for Windows which are user-friendly with
enhanced availability (fully functional evaluation version with a specified trial period could be
downloaded easily).
2 Wellman, B.; Doing it ourselves, Pp 71-78 in Required Reading: Sociology's Most Influential Books. Edited by Dan Clawson,
University of Massachusetts Press, 1998, ISBN 9781558491533
Statistical Package for the Social Sciences (SPSS) was the first comprehensive data analysis software available on personal computers. Its original SPSS user’s manual is widely accepted as the “Sociology's most influential book".
Moreover, the cost for obtaining an SPSS/PASW license is minimal for the students, and it is
within the reasonable range for the members of corporations/organizations. Yet, PASW Statistics is
still expensive for general users. Nowadays, its users include market researchers, health
researchers, survey companies, government, education researchers and marketing organizations.
Strengths of SPSS/PASW Statistics
In addition to superb statistical analysis, PASW offers good data management (case selection, file
reshaping, creating derived data) and data documentation (a metadata dictionary is stored with the
data). PASW data files are portable (smaller in size compared to other database systems) and its
program (PASW syntax) files are quite small.
Organization of PASW Statistics (SPSS) Software Package
PASW organizes as the base system and optional components or modules. Most of the optional
components are added on to the base system. However, some optional components such as Data
Entry is working independently.
The base system, main component for running PASW, has the following functions:
Data handling and manipulation: importing from and exporting to the other data file
formats, such as Excel, dBase, SQL and Access and allowing sampling, sorting, ranking,
subsetting, merging, and aggregating the data sets;
Basic statistics and summarization: Codebook, Frequencies, Descriptive statistics,
Explore, Crosstabs, Ratio statistics, Tables, and etc.;
Significant testing: Means, t-test, ANOVA, Correlation (bivariate, partial, distances), and
Nonparametric tests; and
Inferential statistics: Linear and non-linear regression; Factor, Cluster and Discriminant
analysis.
Some of the optional components (add-on modules) available in version 17.0 are:
Data Preparation provides a quick visual snapshot of your data. It provides the ability to
apply validation rules that identify invalid data values. You can create rules that flag out-of-
range values, missing values, or blank values. You can also save variables that record
individual rule violations and the total number of rule violations per case. A limited set of
predefined rules that you can copy or modify is provided.
Missing Values describes patterns of missing data, estimates means and other statistics, and
imputes values for missing observations.
Complex Samples allows survey, market, health, and public opinion researchers, as well as
social scientists who use sample survey methodology, to incorporate their complex sample
designs into data analysis.
Regression provides techniques for analyzing data that do not fit traditional linear statistical
models. It includes procedures for probit analysis, logistic regression, weight estimation,
two-stage least-squares regression, and general nonlinear regression.
Advanced Statistics focuses on techniques often used in sophisticated experimental and
biomedical research. It includes procedures for general linear models (GLM), linear mixed
models, variance components analysis, loglinear analysis, ordinal regression, actuarial life
tables, Kaplan-Meier survival analysis, and basic and extended Cox regression.
Custom Tables creates a variety of presentation-quality tabular reports, including complex
stub-and-banner tables and displays of multiple response data.
Forecasting performs comprehensive forecasting and time series analyses with multiple
curve-fitting models, smoothing models, and methods for estimating autoregressive
functions.
Categories performs optimal scaling procedures, including correspondence analysis.
Conjoint provides a realistic way to measure how individual product attributes affect
consumer and citizen preferences. With Conjoint, you can easily measure the trade-off
effect of each product attribute in the context of a set of product attributes - as consumers do
when making purchasing decisions.
Exact Tests calculates exact p values for statistical tests when small or very unevenly
distributed samples could make the usual tests inaccurate. Available only on Windows OS.
Decision Trees creates a tree-based classification model. It classifies cases into groups or
predicts values of a dependent (target) variable based on values of independent (predictor)
variables. The procedure provides validation tools for exploratory and confirmatory
classification analysis.
Neural Networks can be used to make business decisions by forecasting demand for a
product as a function of price and other variables, or by categorizing customers based on
buying habits and demographic characteristics. Neural networks are non-linear data
modeling tools. They can be used to model complex relationships between inputs and
outputs or to find patterns in data.
EZ RFM performs RFM (recency, frequency, monetary) analysis on transaction data files
and customer data files.
Amos™ (analysis of moment structures) uses structural equation modeling to confirm and
explain conceptual models that involve attitudes, perceptions, and other factors that drive
behavior.
Another version of PASW, PASW Server, is also available which is developed in client/ server
architecture with some features not available in the normal version, such as scoring functions.
2.2 Step-by-Step Procedure for PASW Statistics Installation
First, the user must have the PASW Statistics software package with official license or just to
install an evaluation version for 21 days trial period. In this manual, evaluation version of PASW
Statistics 17.0 for Windows will be used for demonstration.
Follow the following steps in order to install evaluation version of PASW Statistics 17.0:
Step 1: Check Installed SPSS Versions
Make sure no older version is already installed. If a previous version exists, please uninstall
it before starting the installation process.
Step 2: Insert Installation CD and Run “PASW_Statistics_1702_win_en.exe”
Insert the Installation CD and open “PASW 17.0 for Windows” folder.
Double-click the file named “PASW_Statistics_1702_win_en.exe” to begin extraction of the
contents automatically by the PASW InstallShield Wizard”.
The system requirements to install PASW Statistics 17.0:
Operating System: Microsoft Windows 7, Vista, XP or 2000
System Requirements: Intel Pentium-compatible processor, 256MB RAM, 700MB free
disc space, VGA monitor, and Internet Explorer 6.0 or above
Step 3: Follow the “InstallShield Wizard” until Successfully Complete the Installation
When requesting to choose license type, select “Single user license” and click Next to
continue to the license agreement.
Select I accept the terms in the license agreement and click Next to continue.
Immediately, a dialog window with additional information for the users will appear. Read
the information and click Next to continue.
Fill in “User Name” and “Organization” accordingly and click Next to continue.
A window will pop-up requesting the place (folder) to save program files. It is strongly
recommended to accept default location and just click “Next” to proceed.
Leave serial number blank to install
evaluation version!
Locate where to install
PASW InstallShield Wizard will again confirm to begin the installation.
Click Install to start installation or Back to review and change the installation settings.
As soon as clicked on “Install” button, PASW installation begins. It takes just a few minutes.
During installation, do not press a key or click mouse buttons since it may interrupt the work.
When installation is complete, the Wizard will request to register PASW.
1. Click OK to begin registering process.
2. Select “Enable a temporary trial period” and Click Next.
3. Click browse button.
4. Select the trial license file “trial.txt” and click Open to get the trial license file.
Do NNOOTT press or
click here
5. Click Next to continue and the next windows will inform the enabling of trial period.
6. Click Finish to complete installing the PASW Statistics 17.0 with 21 days trial period.
At this point, the installation of PASW Statistics 17 is successfully completed.
2.3 Running PASW and Its User Interface
After successful installation, a program group called “PASW Statistics 17” will be placed under
“SPSS Inc.” in the “Start Menu”. There will be at least two items in the menu:
1) PASW Statistics 17, and
2) PASW Statistics 17 License Authorization Wizard.
More items may be displayed in the menu, depending on which optional components (add-on
modules) have been installed.
2.3.1 Starting and Ending a PASW Session
To start PASW, just click the “PASW Statistics 17” menu item as following.
Or, double-click any PASW (or SPSS) data or syntax file to start PASW Statistics. In this case, the
file double-clicked will also be opened in an appropriate Window.
To start just click
“PASW Statistics 17”
To browse and open
data file not in the list
When running PASW for the first time, a superimposed dialog window will be displayed on top of
the Data Editor window. This window is aiming to assist initiating a task when starting PASW. It
helps users in performing an initial task such as opening a data or syntax or output file, or running
the tutorial for beginners, or conducting new data entry, or activating an existing query or creating a
new query to import data from another database file. Among the others, opening an existing data
file, from the list or by browsing, is the first common task in PASW statistics.
By default, up to nine most recently used files will be listed in both “Open an existing data source”
and “Open another type of file”. There will be no file in both lists while running PASW for the first
time. An unlisted data file could be opened by double-clicking “More Files…” item and following
the steps of a regular “open file” dialog box. One can double-click the listed file names or select a
file from the list and click OK button to open one of the most recently used files.
By checking the box , only the Data Editor will appear when starting
PASW Statistics in future sessions. It is recommended just to click the “Cancel” button to close the
dialogue window to keep showing the superimposed dialogue window in the coming sessions. In
this case, a blank Data Editor window will be appeared.
For using the “evaluation” version, the following message will be appeared every time running the
program. There will be 21 days if you are using PASW for the first time after installation.
And, it will become 20 days in the following day, and so on. After completing the trial period,
PASW processor will no longer work, that is, commands will not produce any result.
2.3.2 Data Editor and Data Views
In PASW (and earlier versions of SPSS also), data files are displayed in the “Data Editor”. In the
Data Editor, if the mouse cursors on a variable name (the column headings) a more descriptive
label for that variable is displayed for every variable that has been defined with a label.
Data editor has two views: “Data View” and “Variable View”.
Data View: the actual data values are displayed in the cells by default. The „case numbers‟ are
displayed as row captions (as „row number‟ in Microsoft Excel), and the variable names as the
column captions. For the cells, users can choose to display descriptive value labels (for example: to
display “Male” and “Female” instead of coded 1 and 2), from the menus by choosing View, then,
click Value Labels as following:
Tips:
Save the syntax and output files frequently!
Active running session of PASW will end and exit automatically if the user closes the last active
dataset (or data file). Whenever exit PASW, it will ask to save all unsaved windows – including
data, output and syntax windows. It does not have automatic recovery feature and there is no
“undo” for data transformations. Thus, it is important to save the syntax and output files
frequently. Data files should be saved under different name after applying any transformations or
erasing any variables, not to lose the original data files.
or, simply, click the Labels button . Value labels are easier to interpret the responses in the
household survey.
The following is the dataset for individual household member of Bangladesh Demographic and
Health Survey 2007 in the Data View with Value Labels.
Relationship
to HH Head Age Sex
The Data View shows the cases (or observations) in rows and each column represents a variable (a
characteristic that is being measured). In the above example, each individual „member of selected
households‟ is a case, and each „item in the questionnaire‟ is a variable. For example, „relationship
to head of household‟, „age‟ or „highest education level‟ is a variable. Each cell contains a single
data value of a variable for a case. The cell is where the case and the variable intersect, for
example, if the case represents the „head of household‟ (row 13) and variable is „sex‟ (HV104), the
cell is „sex of the head of household‟. When displaying the actual data values, the cell will show
“2”, or it will become “Female” if selected to view in value labels. PASW data files are stored in
flat-file format and data cells cannot store any formula.
Variable View: This displays the metadata dictionary where each row represents a variable and
shows the attributes (or characteristics or properties) of the variable on 10 columns:
1) variable name;
2) type: numeric, comma, dot, scientific notation, date, dollar, custom currency, and string;
3) variable width, i.e. number of digits or characters;
4) number of decimal places;
5) variable label;
6) value labels;
7) codes for user-defined missing values;
8) column width in data view;
9) cell alignment, i.e., left, right or center when displaying in data view; and
10) type of measurement (scale, ordinal or nominal).
All attributes are saved with data values in the file.
Number of rows and columns (size or dimension) of the data file are determined by the number of
cases and variables used in that file. Data can be entered in any cell, even in a cell which is outside
the boundaries of the defined dataset. In this case the dimension of the data view is extended to
include all the rows and columns to cover that newly entered cell. Variable names for the undefined
columns will automatically be assigned as “VAR00001”, then “VAR00002”, and so on.
The cells without entering data in the newly expanded data range (in both rows and columns) will
be filled-up with “.” (a system-missing value) for the numeric variables, and “ ” (blank is valid
string values in PASW) for the string variables.
In this case, type of the new variables is automatically defined as „numeric‟ and default attributes
for the numeric variable are set by PASW. Users could change all attributes, including variable
name and type, in the Variable View.
Apart from directly putting in Variable View, the following two methods can be used in defining
variable properties:
Copy Data Properties Wizard provides the ability to use an external data file or another
dataset that is available in the current session as a template for defining file and variable
properties in the active dataset. Similarly, variables in the active dataset could be used as
templates for other variables in the same dataset. „Copy Data Properties‟ is available on the
„Data menu‟ in the main SPSS window.
Define Variable Properties, which is also available on the „Data menu‟, scans the data and
lists all unique data values for any selected variables, identifies unlabeled values, and
provides an auto-label feature. This method is particularly useful for categorical variables
that use numeric codes to represent categories, for example, 0 = Male, 1 = Female.
2 variables just
created automatically
Value just typed-in
New properties
typed-in / changed
3. BASIC COMPONENTS OF PASW STATISTICS
3.1 Output Viewers
The outputs created by the program are displayed in the “PASW Statistics Viewer”. By default, all
outputs including, command syntax used during the analysis, output tables, charts, notes and the
activity logs during the session are recorded in the Viewer. Users are allowed to determine which
output items were to display or hide in the viewer. It could be set through the “Viewer” tab of
“Options” sub-menu in the “Edit” menu.
If PASW is stated through opening a data file, a Viewer (with the name Output1 [Document1]) will
automatically activate and record the command syntax used to open the data file under the “Log”
tag. If it is decided not to show the command syntaxes in future, for example, the user can set to
hide “Log” initially as shown in the above exhibit. Otherwise, the following log will be displayed
when opening the data file “BDHR50FL.SAV”.
Both “Data Editor” and “PASW Statistics Viewer” will be automatically opened when starting a PASW Statistics session. A user-friendly Help system is available and ready to serve whenever requested by pressing F1 key: the opening page “Getting Help” of the “Base System Help” will be displayed if working on data editor or output viewer; or context sensitive “PASW Command Syntax Guide” for the specific command when working on the syntax.
Options for:
Log
Warnings
Notes
Title
Page title
Pivot table
Chart
Text output
Tree model
Model viewer
A typical PASW Viewer, after running the cross-tabulation (crosstab) of “Highest education level”
by “sex”, can be seen in the following illustration. Six types of outputs are recorded in the Viewer:
(i) Command Log; (ii) Title; (iii) Notes; (iv) Active Dataset; (v) Case Processing Summary; and
(iv) the output table (Highest educational level * Sex of household member Cross-tabulation).
PASW Statistics Viewer is useful in:
browsing the results like in the Windows explorer;
showing or hiding selected output item (notes, tables and charts);
deleting selected output items;
changing the display order of results; and
moving items between the Viewer and other applications.
In the viewer, double-click the appropriate icon in the left pane to unhide any hidden item and
doing so to a visible item will hide it. For example, notes are hidden by default in outputs and
double-click the notes icon will display the notes.
Drag-and-drop can be applied on icons in the left pane to change the location of any item (order in
the output pane). Click the icon to activate the associated item, and press “delete” key to eliminate
that item (and its icon) from the output.
Click to select
Double-click toggles
hide / unhide
Drag-and-drop to
change location
(order in output)
Notes are hidden!
Double-click here
to unhide
Tips:
If some particular items from the output were to use in other applications like in MS Excel or
Word, just simple copy and paste technique can be used. Moreover, almost any object, a paragraph
or a chart, can be paste on to the output view as usually do in popular application programs.
3.2 Pivot Tables
Pivot table is a data summarization tool to create output table formats. Pivot-table tools can
automatically sort, count, and total the data stored in one table or spreadsheet and create a second
table. For example, user can change the variables displayed in rows to columns and vice versa. This
ability of "rotation" is known as pivoting and a table with this ability is called a “pivot table”. One
of the significant features of PASW Statistics Viewer is its ability to handle pivot tables.
Most of the output tables in PASW Viewer can be pivoted interactively. User has the choice to
setup and change the table structure by dragging and dropping the variables or by selecting the
specific items of the layer variables whether the results represent the entire dataset, or just a subset
of data.
Options for manipulating a pivot table include:
transposing rows and columns;
moving rows and columns;
creating multidimensional layers;
grouping and ungrouping rows and columns;
showing and hiding rows, columns, and other information;
rotating row and column labels; and
finding definitions of terms.
The followings illustrate how one can use pivoting in data analysis and presentation.
First, run cross-tabulation of “Educational Attainment” by “sex” by “type of place of residence”
(click Analyze on Main Menu and select Crosstabs under Descriptive Statistics, then, select the
variable, click appropriate arrowhead to move variable name to row or column or layer, and finally
click OK – see in the next module for a detail illustration).
The following is the main results obtained by the above cross-tabulation command.
Then, go through the following steps for pivoting an output table:
1) Double-click the output table located in right result pane to go into table editing mode;
2) The main menu will contain a new item “Pivot”;
3) Select “Pivot” menu and click “Pivoting Trays”; and
4) In the pivot tray, arrange the row, column and layer variables (including statistics) as
necessary by drag-and-drop the variable names,
The followings illustrate the use of pivot table method on the crosstab table.
Double-Click
any place
on this Table
Drag and
drop
Click to get “Pivot Trays
New Item
3.3 Charts
(a) Creating Charts while Analyzing Data
PASW provides high-resolution charts by a click from several procedures on the “Analyze” menu.
For example, in the bottom-left area of “Crosstab” command, there is a check-box “Display
clustered bar charts” which could help create useful graphs for the selected variables.
(b) Creating Chart through Builder
Different types of charts and plots could be produced by the procedures in the “Chart Builder” item
under “Graphs” menu. The Chart Builder helps building charts from predefined gallery charts
(templates/ samples) or from the individual parts (axes and bars). A chart can be built by dragging
and dropping the gallery charts or basic elements onto the canvas, which is the large area to the
right of the Variables list in the Chart Builder dialog box. When building a chart the canvas will
display a preview of the chart with defined variable labels and measurement levels. The preview
does not reflect the actual data since it uses randomly generated data to provide a rough sketch of
how the chart will look.
Using the gallery is the preferred method for the new users. It is also possible to build a chart from
basic elements which is more complex since the chart options were to define explicitly by the users.
Construct a chart by using gallery
First, click the “Chart Builder” item under “Graphs” menu, and the following Chart Builder
window with superimposed warning will appear. Click OK since users can define temporary
variable types while building charts.
Then, follow the steps for building a chart from the gallery as:
1) Click the Gallery tab if it is not already displayed.
2) In the Choose From list, select a category of charts. Each category offers several types.
3) Select the suitable type of chart again by dragging onto the canvas, or double-clicking, the
picture of the desired chart type. If the canvas already displays a chart, the gallery chart
replaces the axis set and graphic elements on the chart.
4) Drag variables from the Variables list and drop them into the axis drop zones and, if
available also to the grouping drop zone. If an axis drop zone already displays a statistic and
if it is the statistic desired, do not drag a variable into the drop zone. Add a variable to a
zone only when the text in the zone is blue. If the text is black, the zone already contains a
variable or statistic. Refer to Statistics and Parameters for information about the available
statistics.
In building the charts, measurement level of variables is important. The Chart Builder sets defaults
based on the measurement level while building the chart. Furthermore, the resulting chart may also
look different for different measurement levels. The user can temporarily change a variable's
measurement level by right-clicking the variable and choosing an option.
5) If the user needs to change statistics or modify attributes of the axes or legends (such as the
scale range), click Element Properties. In the “Edit Properties Of” list, select the item needs
to change and change as needed and after making any changes, click Apply.
6) Click OK to create and display the chart in the Viewer.
Notes: (a) If it is necessary to add more variables to the chart (for example, for clustering or paneling), click the
Groups/Point ID tab in the Chart Builder dialog box and select one or more options. Then drag categorical
variables to the new drop zones that appear on the canvas.
(b) To transpose the chart (for example, to make the bars horizontal), click the Basic Elements tab and then
click Transpose.
(c) If many default settings for a specific chart were to change often, the current settings could be saved as a
favourite and use it later. Please refer to PASW manuals for detailed instructions.
(d) Canvas is the area of the Chart Builder dialog box where building the chart.
(e) An axis set defines one or more axes in a particular coordinate space (like 2-D rectangular or 1-D polar).
Adding a gallery item to the canvas automatically creates an axis set. Each axis includes an axis drop zone
for dragging and dropping variables. Blue text indicates that the zone still requires a variable. Every chart
requires adding a variable to the x-axis drop zone.
(f) The graphic elements are the items in the chart that represent data. These are the bars, points, lines, and
so on. In the illustration, the graphic element is a bar.
(g) The variable list displays the available variables. If a variable selected in this list is categorical, the
category list shows the defined categories for the variable. A variable's measurement level can be
changed temporarily by right-clicking its name and choosing desired measurement level.
(h) Drop zones are the areas on the canvas to which drag and drop a variable from the Variables list. The
basic drop zone is the axis drop zone. Certain gallery charts (like clustered or stacked bar charts) include
grouping drop zones. The illustration shows a grouping zone that contains Sex as the grouping variable.
After clicking on the OK button, the following chart will be placed in the Viewer.
Canvas
Variable List
Statistics in axis
drop zone
Variable in
grouping zone
1
3 2
4
5
Category List
6
To generate a bar chart of the “percentage of male and female head of household in each district”,
first, click Element Properties button on the Chart Builder window and follow the steps below:
1) In the “Element Properties” window, change the desired statistics to “Percentage()”;
2) Click Set Parameters button;
3) Select “Total for Each X-Axis Category” as the denominator for computing percentage in
the set parameters drop-down list;
4) Click Continue; and
5) Click Apply to activate changes
And, finally, click OK button on the Chart Builder window to get the following graph.
1
3
4
5
2
(c) Using Graphboard Visualization to Create Customized Graphs
Creating a graph from the “Graphboard Template Chooser”
This is a new feature in PASW Statistics 17. Through this command (located in the “Graph” menu),
graphs can be created from ready-made templates called “Graphboard Visualizations” which
contains graphs, charts, and plots. PASW Statistics ships with built-in visualization templates
covering 23 different types of graphs which are sufficient for the general users. Another product,
PASW Viz Designer, is available to create own visualization templates.
To use built-in templates, select “Graphboard Template Chooser” in the “Graph” menu and follow
the following steps:
1) In the “Graphboard Template Chooser” window, click basic tab to start selecting
appropriate variable(s);
2) Click (with control key starting from the second variable) the variable name(s) to create the
graph. Here, PASW just list the variable names, instead of labels. As soon as a variable is
selected, all possible graph types which are suitable for the selected variable will be
displayed in the right pane of the window. Similarly, if two variables are selected, possible
types for those two variables will be displayed;
3) Double-click the icon of the preferred graph type from the displayed samples;
4) Optionally, click:
(a) Detailed tab to change chart type, variables, and etc.;
(b) Titles tab to set chart title, sub-title and footnote; and
(c) Options tab to set output label and other options.
5) Click OK to start creating the preferred graph.
It should be noted that creating graphs through this “Graphboard Template Chooser” requires more
resources, such as processing time, better processor, and larger memory. Moreover, the graph
created from this option is difficult to edit.
(d) Graphs through Legacy Dialogs
Graph can also be created from the "legacy dialogs". Almost all graph types are available and can
be customize the view such as title, sub-title and so on while creating the graph through this option.
The following exhibits show the types of graphs available under “Legacy Dialogs” and the
population pyramid of sample household population created through the legacy dialogs.
1
2
Different types of
charts available in
“Legacy Dialogs”
The following dialog shows the generating a population pyramid from sample household
population by age and sex.
And, the pyramid produced by the above setting is as following:
3
4
5
6
7
8
Drawing a Population Pyramid:
1) Select “Legacy Dialogs” in
“Graphs” menu;
2) Click “Population Pyramid”;
3) Drag “Age of household
members” and drop in “Show
Distribution over” box;
4) Drag “Sex of household
members” and drop in “Split
by” box;
5) Click “Titles…” button;
6) Type in “Population Pyramid of
Sample Households” in Title
Line 1;
7) Click “Continue”; and
8) Click OK on “define
Population Pyramid” dialog
3.4 Saving and Exporting Outputs
Starting from PASW Statistics 16, outputs are saved only in Viewer format (*.spv). The PASW
viewer no longer supports output files of earlier versions in the proprietary file format (*.spo).
From PASW Viewer, outputs can be selected, copied and paste in any spreadsheet software or word
processors or graphical presentation software.
Outputs in the Viewer can also be exported to different formats such as: Excel (*.xls); HTML
(*.htm); Portable Document Format (*.pdf); Power Point (*.ppt); different text formats (*.txt) such
as plain text, UTF8 and UTF16; and Word/RTF (*.doc). Moreover, graphical outputs can be saved
into such formats as: Bitmap (*.bmp); Enhanced Meta File (*.emf); Encapsulated Postscript
(*.eps); JPEG file (*.jpg); Portable Network Graphic (*.png); and Tagged Image File (*.tif).
In exporting outputs, one can select:
i) to export all items, including hidden, both selected items and non-selected items;
ii) visible (non-hidden) items only; or
iii) just selected items.
For exporting multiple items, one can select different items by clicking the item while pressing
control key, and follow the steps as described in the following example.
For exporting PASW outputs to MS Excel,
1) Select the item(s) to export on the left pane of the PASW Statistics Viewer;
Selected
Output
Tables
2) Click Export in File Menu and an “Export Output” window will appear;
In the “Export Output” window:
3) Check “Selected” option button to export only selected output items (tables, notes,
summaries, …);
4) Select “Excel file (*.xls)” from the “File Type:” dropdown;
5) Click Browse button and select the location of the export file and file name or type in the
file name with full path, e.g., “C:\Documents and Settings\User\My Documents\SPSS
Training\Sample\Test-exporting.xls”;
6) Click OK to begin the export process;
3
4
5
6
At the end of exporting process, the exported file can be seen in the designated folder.
For exporting only the graphics without any notes, tables, etc., select “None (Graphics only)” while
choosing the Document Type in Step 4 (the last item in the drop-down list). Then, the Graphic
section of the “Export Output” window will activate and the Document section will inactivate (that
is, user can no longer set any options or select other than document type). In this case, users can
select the graphic format (together with graphic options) to be saved and the root file name to save
the graphics. If the root file name is “text.png” and if there were 3 charts in the active Viewer, three
graphic files will be created with the name: “test1.png”, “test2.png”, and “test3.png”.
3.5 Online Help
PASW Statistics provides a comprehensive help system together with tutorial for every key aspects.
Context-sensitive help topics in dialog boxes could guide on every specific task. A help window
will pop-up whenever the help key “F1” is pressed. It shows the base system help while working
with data editor or output viewer, or command syntax guide of the closest command while in the
syntax editor. Similarly, various types of PASW help can be accessed through “Help” menu.
The first item and the most important for the beginners under the Help menu is the item “Topics”.
“Topics” provides access to the basic PASW Help system with Contents, Index, and Search tabs,
from which users can find the explanation of specific topic or command procedure.
The second item, “Tutorial” illustrates step-by-step instructions on how to use many of the basic
features. Users can choose the topics required to grasp, skip around and view topics in any order.
The index or table of contents can be used to find specific topics. “Case studies”, the third item,
provides hands-on examples of how to create various types of statistical analyses and how to
interpret the results. The sample data files used in the examples are provided in the PASW package.
Table of contents of the tutorial can be observed in the following illustration.
The “Statistics Coach”, using a wizard-like approach, helps finding the commands or procedures
needed. After making a series of selections, the Statistics Coach opens the dialog box for the
statistical, reporting, or charting procedure that meets selected criteria. It provides access to most
statistical and reporting procedures and several charting procedures in the Base system.
The above mentioned help items are useful for all users – from beginners to advanced developers.
A part from those, more help topics such as “Command Syntax Reference” and “Statistical
Algorithms” are available interactively for the advanced users, and the “Developer Central” and
“Technical Support Website” for the on-line users.
Like in other modern software, PASW provides “Context-sensitive Help” in several places in the
user interface as:
1) Most dialog boxes have a Help button that takes directly to a Help topic for that dialog box.
The Help topic provides general information and links to related topics.
2) Right-click terms in an activated Pivot Table in the Viewer and choose “What's This?” from
the context menu to display definitions of the terms.
3) In a command syntax window, position the cursor anywhere within a syntax block of a
command and press F1 on the keyboard. A complete command syntax chart for that
command will be displayed. Complete command syntax documentation is available from
the links in the list of related topics and from the Help Contents tab.
Select any place in the Command Line and Click <F1>
4. USING DATA FROM OTHER SOURCES
Generally, PASW Statistics can read data files created in:
all versions of PASW Statistics (*.sav) and SPSS/PC+ (*.sys) formats;
spreadsheets (EXCEL, Lotus and SYLK);
database tables (dBase, MS Access, FoxPro, Oracle, SQL Server, etc.);
statistical software (SAS, SYSTAT, and Stata); and
different text formats (fixed width, comma delimited/ CSV, tab or space delimited, etc.).
Data files created by spreadsheets and other statistical software could open directly as PASW data
files. Similarly, PASW can open dBase files, text data files and other files without converting the
files to an intermediate format or entering data definition information. On the other hand, complex
database files such as MS Access, FoxPro and SQL databases could be accessed through the
database wizard or SQL queries.
Opening a data file makes it the active dataset. The active dataset is the one, from which PASW
will read and write during the session if there is no specific command to change to other dataset. If
there are one or more open data files (or datasets), those remain open and available for subsequent
use in the session. Clicking anywhere in the Data Editor window for an open data file will make it
the active dataset.
A PASW data file could be saved (or exported) to other file types. However, some file types could
save only data values while PASW keeps both values and data dictionary (or attributes). The data
dictionary or attributes such as variable label, value labels, missing values, etc. will be lost if it is
save to other formats including Microsoft Excel format.
4.1 Importing Data from Microsoft Excel
Importing data from Microsoft Excel is the easiest among the data sources. First arrange the
spreadsheet in tabular format fulfilling following six recommendations:
i) Names of the variables on the first row of the data range;
ii) Variable names comply with PASW Statistics naming rules3;
iii) For all numeric variables, there should be no blanks in the second row of the data range;
iv) Data range should be continuous – no blank rows or columns;
v) Clear of any graphs, labels, and extra text or data on the worksheet; and
vi) Delete unnecessary worksheets (which are not going to import).
3 Starting from Version 12.0, the following rules apply in variable names:
1. must be unique; duplication is not allowed and cannot contain spaces;
2. up to 64 characters in English;
3. starting with a letter or @, #, or $ and follow by letters, numbers, period (.), and non-punctuation characters;
4. starting with a “#” is a scratch variable, which can create only with command syntax;
5. starting with a $ sign is a system variable, and not allowed for a user-defined variable;
6. the period, underscore, and the characters $, #, and @ can be used within variable names, e.g. “A._$@#1”;
7. shall not end with a period or an underscore ;
8. not allow to use reserved keywords: ALL, AND, BY, EQ, GE, GT, LE, LT, NE, NOT, OR, TO, and WITH;
9. allows mixture of uppercase and lowercase characters, and “case” is preserved for display purposes; and
10. wrap long names in output – breaking at underscores, periods, and where changed from lower to upper.
In general, PASW Statistics can read datasets created by almost all popular statistical software and databases. A PASW dataset is also possible to save in several popular formats. Therefore, PASW data format (*.sav) is the common format in sharing/distributing survey datasets.
If the data in Excel file is spreading over several worksheets, it is better to create a new Excel file
with just one worksheet containing all necessary data including variable names.
Then, follow the steps:
On the main menu click:-
1. File;
2. Open;
3. Data;
And, an “Open Data” pop-up window will be appeared. In this window:-
4. Change Files of type to “Excel (*.xls, *.xlsx, *.xlsm)”;
5. Select the folder containing Excel data file from Look-in box;
6. Select the correct Excel data file (in 97-2003 or 2007 format); and
7. Click Open;
The “Opening Excel Data Source” pop-up window will be appeared and on that window:-
8. Clear the check box next to “Read variable names from the first row”, if and
only if the first row of the Excel data sheet does not have variable names;
9. Select the worksheet containing data, if the file has more than one worksheet;
10. Type in the range of data to be imported (for example A1:V100 for the first 99
cases or 100 rows including the row for the variable names); and
11. Click OK.
1
2 3
4
5
6
7
If the data file in Excel was prepared with six recommendations mentioned above, steps 8 to 10
could be skipped since there is only one sheet in the Excel file, the data range is continuous and
there is no extra cells or objects in the sheet rather than the data to be analyzed.
At the end of this process, data from Excel file has been transferred into PASW dataset. At this
time, it is important to save the current SPSS dataset with an appropriate name in designated place.
Data files in Excel or text format or databases do not have data dictionary, that is, no information
on data attributes such as variable labels, value labels, missing values, etc. Therefore, it is important
to define such attributes to all variables, and save the data file again.
8
9
10
11
4.2 Importing Data from Delimited ASCII Text Files
When requesting data from other agencies and departments, sometimes, data are provided in text or
ASCII file format. Normally, data in an ASCII file are arranged with fixed width format, that is, a
variable is placed in same location for every case or separated by a specific character such as tab,
space, comma, semicolon and any other specific character which is unique throughout the file and
did not use in the data values.
To import data from a delimited text file, first, review the file on a text editor such as notepad or
Word and check the character used for delimitation (normally, tab, space, comma or semicolon).
Then, follow the steps:
On main menu click:-
1. File;
2. Read Text Data;
Then, “Open Data” pop-up window will be appeared with text (*.txt) file type.
In Open Data window:-
3. Select the folder containing text data file from Look-in box;
4. Change “File of type” to “All Files (*.*)”;
5. Select the correct data file (*.txt, *.dat, *.csv, *.prn, etc.); and
6. Click Open;
And, a “Text Import Wizard” will begin automatically and guide through the importing process.
Note: Sometimes, text data files have different file extension than “.txt” and “.dat”, such as “.prn” or “.csv”.
If “Read Text Data” menu item is chosen, PASW will display the files only with extension “.txt” and
“.dat”. To search for a text data file with other extensions, choose “All Files (*.*)” in “Files of type”
field to display all files.
1
2
3
4
5
6
The wizard contains the following 6 steps:
Step 1/6: Click Next to forward Step 2 of 6;
In the Step1 of the Wizard, one can apply a predefined format (previously saved from
the Text Wizard) or follow the steps.
Step 2/6: (i) The Wizard will sense and opt whether the data is arranged as “Delimited” or
“Fixed width”, but check and identify correctly (in “Data.csv” file, the variables
are separated by a comma “,”, and thus the file structure is delimited); and
(ii) identify whether the variable names are included at the top (first line) of the data
file or not (in this example, “Yes”), and click Next to forward to Step 3 of 6.
Step 3/6: (i) Since data file begins with variable names, the first case of data begins on line 2.
Otherwise, user should identify the line number that the data begins.
(ii) If a line represent a case (one person, for example), just click Next; otherwise,
select the second option on “How are you cases represented?” and specify
number of variables per case before clicking Next.
Step 4/6: The Wizard will automatically identify the delimiter(s) between variables. However,
it is important to check and specify correctly. Some software export text in quotes,
i.e. expressed as “text” or „text‟, then the character of text qualifier (or quotation
mark) must be specified by the radio buttons of the second question, and click Next.
The first line contains the variable name!
Step 5/6: In this step, variable names and data formats can be specified (or) changed from the
default settings. Then, click “Next” to continue or “Finish” to start importing data.
Step 6/6: In this step, just click “Finish” to start importing and the task will complete in a few
minutes.
Sometimes, the Wizard may identify
wrong delimiters. Users must check and
post correct delimiter(s).
4.3 Importing Data from Fixed Width Text Files
In some text data files variables are aligned in fixed width columns. That is, a variable is at the
same column throughout the data file. For example, sex of household member is situated in column
33 of every line in the “Data(Fix).txt” file, which is extracted from the Bangladesh DHS 2007.
To import data from a text file with fixed width data structure it is important to have the data
dictionary of the variables, that is, which variable is located on which column(s). After that:
On main menu click:-
1. File;
2. Read Text Data;
Then, “Open Data” pop-up window will be appeared with text (*.txt) file type, and
3. Select the folder containing text data file from Look-in box;
4. Select the correct data file (*.txt or *.dat); and
5. Click Open;
Then, the “Text Import Wizard” will begin automatically and guide through the importing process.
The wizard contains the following 6 steps:
Step 1/6: Simply, click Next to forward Step 2 of 6;
Step 2/6: (i) The Wizard will sense and opt whether the data is arranged as “Delimited” or
“Fixed width”, but check and identify correctly (“Data(Fix).txt” contains no
separation character and file structure is delimited); and
(ii) identify whether the variable names are included at the top (first line) of the data
file or not (in this example, “No”), and click Next to forward to Step 3.
Not require to change “Files of type”
since the extension of file name is “.txt”
Step 3/6: Since there is no variable name in the first line, the first case of data begins on line
number 1. Sometimes, a case spans over one lines, users have to identify the number
of lines per case. Unless, just click “Next” to continue.
Step 4/6: This is the most crucial step in importing a fix width data file. Use the data dictionary
to identify and split the case into variables accordingly. In this example, one line of
data represents a case, and the location of variables are as following:
Variable number Column Variable Name Variable Label
1 1-8 HV005 Sample weight
2 9-10 HV009 Number of household members
3 11-12 HV024 Division
4 13 HV025 Type of place of residence
5 14 HV026 Place of residence
6 15-16 HV218 Line number of head of household
7 17 HV219 Sex of head of household
8 18-19 HV220 Age of head of household
9 20 HV270 Wealth index
10 21-28 HV271 Wealth index factor score (5 decimals)
11 29-30 HV101 Relationship to head
12 31 HV102 Usual resident
13 32 HV103 Slept last night
14 33 HV104 Sex of household member
15 34-35 HV105 Age of household members
16 36 HV106 Highest educational level
17 37-38 HV107 Highest year of education
18 39-40 HV108 Education in single years
19 41 HV109 Educational attainment
20 42 HV110 Member still in school
21 43 SH08 Marital status
22 44 SH15 Employment status
First line does NOT contain variable names!
The Wizard will put in separation lines or break lines wherever explicit (for example,
if a column contains blank(s) consistently across the lines, the Wizard will insert a
break line). A break line can be inserted or deleted with the “Column number” input
box below the data view. For example, to insert a break line in the column 13, put in
13 in the “Column number” input box and press the “Insert Break” button. Similarly,
to delete a break located on column 28, just type in 28 and click “Delete Break”
button. In this step, the user has to check and identify all break lines to get correct
data import.
After defining the location click Next to proceed to Step 5.
Step 5/6: In this Step, one can select “Finish” to start importing data with default variable
names (V1, V2, …, Vn), and data formats (all numbers will be numeric and the
remaining be string). Or, user can put in variable names and formats individually.
Step 6/6: Simply click “Finish” to start the text data importing task.
In text data import wizard, the user can save the format (including break lines and
variable names) for future use.
It will take just a few minutes to import the text data into PASW Statistics Data Editor. It is
strongly recommended to check and edit (or create) variable attributes such as variable labels, value
labels, missing values, etc. It is important to define such attributes to all variables, and save the data
file again.
4.4 Importing Data from Microsoft Access Databases
Data from the databases which are using the Open Database Connectivity (ODBC) drivers can be
read directly by PASW Statistics if respective drivers are installed in the computer. Commonly
used ODBC drivers are provided with the PASW installation package. Among the others, Microsoft
Access is the most widely used database system and step-by-step guide to grab data from MS
Access will be presented in this section. The same steps, with minor variations, could be followed
to import data from the databases created on other platforms.
Before importing data from an MS Access database, check whether the database contains a table in
flat file format (like a worksheet) with all variables needed to import or not. If the data to import
spread over several database tables, that is variables are located in different database tables, first, it
is better creating a simple table containing all variables in MS Access before importing.
To begin, click followings on main menu:-
1. File;
2. Open Database; and
3. New Query;
A “Database Wizard” window will appear for identifying ODBC data source. All available ODBC
data sources will be listed on the right pane and click the one which matches the database to be
imported. If there is no appropriate source, a new driver file for that particular source must be
installed or added before importing data from that database.
Normally, there is “MS Access Database” in the list, and:
4. Select MS Access Database from ODBC Data Sources; and
5. Click Next to continue.
1
2 3
4
5
For the first time, “ODBC Driver Login” window will appear. If it is not the first time that
this import procedure is running, the Wizard may skip this step.
6. Click Browse to browse the folders and file, and select the correct database file to
open; and
7. Click OK to open that database file.
At this point, the user can setup a new link also.
Then, the “Database Wizard” window will come up with two panes: “Available Tables” on
the left and “Retrieve Fields in This Order” on the left.
8. Click table name to expand and double-click the field name(s) to select or
double-click the table name to select all variables in that table; and
9. Click “Finish” to start importing all cases (53,413 cases) from the database.
It is important to save the data file after the import process.
10. On the other hand, one can click “Next” to go to another step where users can select
the cases to import based on some criteria (filtering). The following example shows
how to import the cases where age of household member is between 6 and 15 years.
Here, only 12,621 cases will be imported instead of 53,413 cases in the entire
database. It is important to save the data file at the end of importing process.
To import selected fields (variables),
click here to expand
and double-click the desire field names
To import all
variables, just
double-click
table name
6
7
8
9
11. Again, by pressing “Next” to redefine variable names and to process auto-recoding
string variables before pressing “Finish” to start importing.
Although all variables been imported, PASW Statistics assigns F8.2 (floating-point format; total of
8 digits including 2 decimal places) to all numeric variables, and A255 (alpha-numeric format; up
to 255 characters) to string variables. Therefore, it is import to realign formats for all variables, and
also to set column widths to display appropriately. Moreover, it is recommended to recode string
variables for easier analyses. The following section will explain how to refine imported data sets.
If there are several tables in the source database file, one can link through identification fields and
import variables from different tables (please see: online tutorial on PASW Data Manipulation).
However, it is more convenient to link tables and create a special table with all required variables in
MS Access (or in the original database software) before importing into PASW Statistics.
HV105 is “Age of household member”, and
the criteria is “Age > 5 and Age < 15” or
“5 < Age < 15”
10
5. TIPS AND EXERCISES
5.1 Tips: Do and Don’t
i) Do… check whether any previous version of PASW Statistics or SPSS for
Windows or SPSS/PC+ has already been installed in the computer.
Don’t… install any version of PASW Statistics without checking the existence of
any working PASW Statistics.
ii) Do… check whether any installed PASW Statistics is a license version.
Don’t… uninstall any license version of PASW Statistics before ensuring the
transferability of legitimacy to new PASW software.
iii) Do… uninstall existing PASW Statistics or SPSS for Windows or SPSS/PC+ if
the new software has a valid license or decided to use for evaluation which
allowed for 14 or 21 days.
Don’t… install new version of PASW Statistics before completing un-installation
process.
iv) Do… study and make yourself expert of PASW Statistics components and
survey files including data, questionnaire and codebook before conducting
any analysis.
Don’t… change anything in the dataset! And also do not start analysis with the new
dataset before understanding the questionnaire and codebook of the
survey.
v) Do… familiarize with data file, especially if it is in other format than PASW, for
text data files: review on a text editor such as Word, notepad, etc. check
whether the first line comprises of variable names or not; and which
separation character (blank, comma, tab, etc.) been used.
Don’t… save the original data file after reviewing in MS Word or any text viewer
to avoid altering format and edited characters.
5.2 Self-evaluation
Are you able to explain to your colleagues on background information of some popular statistical analysis software packages? Very well / Somewhat well / Not so much / Almost None
Do you understand why SPSS / PASW is chosen as a statistical software for assisting EFA monitoring? Very well / Somewhat well / Not so much / Almost None
Can you install evaluation version of PASW statistics without any assistance? Certainly / Somewhat certain / Not so much / Not at all
Can you explain your friends on the following basic components of PASW: o Output Viewers Very well / Somewhat well / Not so much / Almost None o Pivot Tables Very well / Somewhat well / Not so much / Almost None o Charts Very well / Somewhat well / Not so much / Almost None o Export Outputs Very well / Somewhat well / Not so much / Almost None o Online Help Very well / Somewhat well / Not so much / Almost None
Are you confident that you can import data from the following sources to PASW: o Microsoft Excel Confident / Somewhat confident / Not so much / Not at all o Delimited text files Confident / Somewhat confident / Not so much / Not at all o Fixed width text files Confident / Somewhat confident / Not so much / Not at all o Access databases Confident / Somewhat confident / Not so much / Not at all
5.3 Questions and Hands-on Exercises
i) Provide three reasons for appropriateness of using PASW Statistics for analyzing
census and household survey data for assisting EFA monitoring.
ii) What are the key components of PASW Statistics?
iii) Open “B2_a.txt” file in any text editor and record (a) how many variables in this file,
and (b) which separation character has been used on a blank sheet.
iv) Import “B2_a.txt” file to PASW data editor and review characteristics of new dataset.
v) Connect internet and
(a) find available household survey data files for your country;
(b) download the most recent survey data file;
(c) find and review the questionnaire and codebook for that survey;
(d) note down the variables which are useful to calculate education indicators,
especially for EFA monitoring, and
(e) prepare for importing data, if it is needed.
Module B3:
Checking, Editing and Preparing Household Survey Data for Analysis
Contents:
1. Metadata Preparation 1.1 Defining Data: Setting Variable Properties 1.2 Setting and Editing Metadata through Wizard 1.3 Copying File and Variable Properties
2. Data Manipulation 2.1 Changing, Inserting and Deleting Data, Cases and Variables 2.2 Computing New Variables 2.3 Recoding
3. Data Preparation 3.1 Selecting Cases 3.2 Sorting Cases 3.3 Rearranging Variables
4. Data validation 4.1 Validation with Single-Variable Rules 4.2 Cross-Variable Rules 4.3 Multi-Case Rules
5. Tips and Exercises 5.1 Tips: Do and Don’t 5.2 Self-evaluation 5.3 Hands-on Exercises
Purpose and learning outcomes:
To gain knowledge on defining data and checking data quality with PASW
To understand basic techniques of data validation
To understand how to prepare datasets for conducting effective data analyses
1. METADATA PREPARATION
One of the most famous computer and ICT terms is GIGO, “Garbage in Garbage out”. It simply
indicates that if dataset under analysis is prone to errors, outputs generated from that dataset are not
reliable or unusable. Therefore, after loading a dataset, keep in mind that it is not yet ready for start
producing analytical outputs. PASW Data Editor can display only the contents, but cannot secure
the quality of data.
To conduct meaningful analyses, it is also important to understand the data collection procedure,
questionnaire and coding rules, and how dataset was prepared and distributed. Moreover, if and
only if the data in the set is defined properly, the data analyst can understand correctly and
conducting meaningful data analyses.
Therefore, logical steps after loading dataset include:
Metadata preparation:
Defining data
This step requires when data was imported from other formats such as Excel, text or
databases. While importing data from those formats, only data values with variable name,
and at most, the defined missing values will be in the new PASW dataset. In this case, data
management should begin with defining data – providing appropriate variable name and
value labels, and setting missing values and measurement level for each and every variable.
Editing data definition
All PASW datasets should begin with reviewing variables in the dataset and determine their
valid values, labels, and measurement levels. Identify combinations of variable values that
are impossible but commonly miscoded. Define validation rules based on this information.
This is a time-consuming task, but worthwhile to ensure the quality of data.
Data preparation:
Even the active dataset is reliable (clean or data with good quality) it may not perfectly fit in
with the type of analyses to perform. The active dataset may require manipulations such as
sorting, aggregation, creation of new variables, conditional selection of cases, and
sometimes merging of datasets.
Data validation:
Run basic checks and checks against defined validation rules to identify invalid cases,
variables, and data values. When invalid data are found, investigate and correct the cause. If
it is impossible to correct, determine whether to omit the entire cases or include the case but
setting the invalid values as missing or special category.
Once the dataset is clean and well prepared, it is ready to analyze with PASW modules. The
following sections highlight the tools provided in PASW base system for metadata preparation,
data preparation, and data validation.
This section will emphasize on metadata preparation while data manipulation, preparation and
validation will be discussed in the Section 2, 3 and 4 respectively.
1.1 Defining Data: Setting Variable Properties
While obtaining data from other sources such as: Excel, text, or Access database, only the variable
name, format (numeric or string, width and decimal places) and data values are imported. Few
more properties, such as missing values, could be assigned while importing from databases,
however, there will be no description of variable (variable label), and the meaning of the data
values (value labels) especially when the codes, instead of texts or words, were imported from the
source. Examples are introduced as following;
In the above dataset, variable “HV104 (Sex of household member)” has values 1 or 2 only.
However, users cannot know “what 1 and 2 stand for?” since 1 could stand for "Male" or "Female"
depending on the coding scheme.
Therefore, it is impossible to answer a simple question: “how many household members are
female?” from the above frequency table created by PASW Statistics.
In PASW, metadata or data dictionary is part of the dataset. It covers such properties as variable label, value labels, formats, and measurement level: scale, ordinal or nominal.
Similarly, from the above frequency table of the HV106, no one could know:
“What is HV106?”
“What are valid values 0, 1, 2, 3, 8 and 9 stand for?” and
“Why the codes jump to 8 after 3, and where are 4, 5, 6, and 7?”
To answer such questions, the next step, after importing data or opening an existing data file, is to
specify, or check and edit, variable label, value labels, missing values and measurement level for
each and every variable in the dataset. For entering variable labels, value labels and missing values,
the codebook, or survey questionnaire if the codes are printed on, is essential.
To define variable label just click the appropriate cell and type in directly as following.
Again, to define the value labels, select “Variable View” in the PASW Statistics Data Editor.
Then, follow the steps below:
1. Click the cell under “Values” and “Value Labels” window will pop-up;
2. Type the code in “Value” box;
3. Type the appropriate label in “Label” box;
4. Press “Add” button and the value and its label will appear in the space below;
5. Repeat Steps 2, 3 and 4 until all value labels been defined and press “OK”, after
entering for the last valid code, to complete defining the value labels.
Note: Starting from the version 17.0, PASW Statistics allows checking spelling of value labels (click the
“Spelling” tab). Similarly, users can identify “missing values” by clicking the cell under “Missing”
and follow the similar procedure in defining value labels.
The same analysis (frequencies) to the variable “HV106” after defining the variable label, value
labels and missing values will provide the following output which is easier to understand and ready
to place in a report or presentation.
Within Variable View, all properties (or definitions) of the variables: name, type, measurement
level, etc., can be added, changed or removed as required. By default, PASW assigns measurement
level for the imported variables automatically as “scale” for numeric variables and “nominal” for
string variables. It is insufficient for some advanced analyses, and thus, the measurement level of
the variables must be checked and changed. For example, type of measurement for the variable
“HV106” can be changed from nominal to ordinal, which is more suitable for the variable.
Click here to define
value labels!
Repeat until
all value labels
have been added
5
1
2 3
4
1.2 Setting and Editing Metadata through Wizard
The "Metadata Wizard" can also be applied to the imported data files instead of setting manually as
described in the previous section.
Steps in this procedure are:
1. Click “Data” on main menu bar; and
2. Select “Define Variable Properties…”.
The “Define Variable Properties” window would pop-up and let choosing variables to be
defined. For demonstration purpose, select just two variables HV219 “Sex of head of
household” and HV104 “Sex of household member” in the following example.
3. Click the variable name(s) to select the variable(s) to be defined;
4. Double-click or click to move variable name to the right “Variables to scan”
pane; Repeat Steps 3 and 4 until all required variables been placed in the right pane;
5. After selecting all variables, click “Continue” to start scanning the variables.
A new “Define Variable Properties” window will appear and show the scanned results by
variable. In this window, one can set:
(i) Variable label (type into blank spaces provided),
(ii) Data type (select from the dropdown), width and decimal places (type-in), and
(iii) Measurement level (select from the dropdown).
After completing for the variable HV219, select HV104 and follow the same procedure
described in steps (i), (ii) and (iii). Then,
6. Complete “Setting variable properties” by clicking “OK”.
PASW provides a wizard-like method of setting variable properties for the new variables, and also for checking and editing variable properties for existing variables in a dataset.
3
5
4
1 2
Alternatively, after setting for HV219, its properties can be copied to HV104 since both variables
have the same nature and using the same codes: 1=Male and 2=Female (i.e. same value labels).
To copy variable properties, except variable label, from HV219 to HV104:
(a) Press “To Other Variables...” button.
Then, in the “Apply Labels and Level to” window:
(b) Select the variable HV104; and
(c) Click “Copy” to copy the variable properties.
All properties of the variable HV219, except variable label, are copied to HV104. Thus,
(d) Type in variable label for HV104, and click “OK” to complete the process.
It should be noted that copying variable properties can be applied only among the
variables scanned during the same session.
Type-in
Variable label
and
Value label
Click and select
measurement level and
variable type
(a)
(c)
(d)
(b)
6
Type-in
Variable label
for HV104
And, the dataset will appear in the Variable View as follow:
Setting of variable properties should be carried out on all variables in the dataset for easier
understanding and effective analyses.
Tip:
Sometimes, source data file contains data in “text format” for some variables, such as “male” or
“female” instead of 1 and 0. In this case, it is essential to code such variables for easier analysis.
PASW Statistics provides automatic coding through AUTORECODE command. For detail
information on AUTORECODE command, please refer to “Base User Guide” for PASW
Statistics 17.0.
1.3 Copying File and Variable Properties
The “Copy Data Properties” in the “Data” menu provides the ability to use an external PASW
Statistics data file as a template for defining file and variable properties in the active dataset.
Similarly, properties of variables in the active dataset can also be copied to other variables in the
same dataset.
The “Copy Data Properties” wizard allows:
• Copy selected file properties from an external data file or open dataset to the active dataset. File
properties include: documents, file labels, multiple response sets, variable sets, and weighting.
• Copy selected variable properties from an external data file or open dataset to matching
variables in the active dataset. Variable properties include: value labels, missing values, level of
measurement, variable labels, print and write formats, alignment, and column width used in the
Data Editor.
• Copy selected variable properties from one variable in (i) an external data file, (ii) open dataset,
or (iii) the active dataset to many variables in the active dataset.
• Create new variables in the active dataset based on selected variables in an external data file or
open dataset.
When copying data properties, the following general rules apply:
• If an external data file is using as the source, it must be in PASW Statistics format;
• Undefined (empty) properties in the source dataset do not overwrite defined properties in the
designated dataset; and
• Variable properties are copied from the source variable only to target variables of a matching
type--string (alphanumeric) or numeric (including numeric, date, and currency).
Variable properties can be copied from the source file to matching variables in the active dataset.
Variables "match" if both the variable name and type (string or numeric) are the same. For string
variables, the defined length must also be the same.
Moreover, the variables which are not in the active dataset can be created using the properties of
the selected variables in the source file. To do this, source list must be updated to display all or
variables in the source data file. If you select source variables that do not exist in the active dataset
(based on variable name), new variables will be created in the active dataset with the variable
names and properties from the source data file.
If the active dataset contains no variables (a blank, new dataset), all variables in the source data file
are displayed and new variables based on the selected source variables are automatically created in
the active dataset. This is the easiest way to create a new dataset (like Excel worksheet) for direct
data entry and, also can be shared the dataset without data as electronic codebook.
To copy the data file properties and variable properties, which may require after importing from
other file formats, first, select “Variable View” of “Data Editor” and follow the steps below:
1. Click “Data” on main menu bar; and
2. Select “Copy Data Properties…” and “Copy Data Properties” wizard will appear;
3. Click the “Browse” button on the bottom right area and select the PASW data file
which were to use as source of the properties;
OR, type in the file name with its full address, for example,
“C:\PASW Training\Sample\Data1.sav”
Copying variable and file properties from a well-defined data file to another data file is an easy task in PASW Statistics.
4. Then, click “Next” to proceed to the Step 2 of the Wizard;
The Wizard will scan both source and target datasets, and display the “match” variables
from source file in the left pane and from active dataset in the right pane. Number of
selected variables is displayed in the bottom of the list.
5. Click “Finish” to copy with the default settings, or “Next” to change the settings;
2
3
1
4
5
The following settings can be changed in Steps 3 and 4 of the Wizard.
If the Wizard is followed Step-by-Step, the summary of “what would be copied” will be displayed
on Step 5. After pressing “Finish” button, whether at the end of step 2, 3, 4 or 5, the active dataset
will have the selected properties as in the source PASW data file.
Alternatively, properties can be copied from an open dataset, if more than one datasets are opened.
Just select “An open dataset” as “Source of the properties” in Step 1, and follow the same steps.
Here, new variables from the source dataset will be added to the active dataset if “Create matching
variables in the active dataset if they do not already exist” is ticked in using set properties. All
variables (press <Ctrl>A) or only some variables (click variable name with <Control> key) can be
selected from the source list. In this case, at the bottom of the list of active dataset will display both
(i) matching variables, i.e. 12 in this example; and (ii) variables to be created, 10 in this example.
Newly inserted variables
No valid data here!
New variables
In the above example, 10 new variables will be added into the active dataset with the same variable
names and properties by copying the properties of all variables from the source dataset. It should be
noted that the data values were not be copied to the active dataset.
PASW Statistics also allows copying variable properties from one variable to another in the same
dataset. For example, in the sample dataset, two variables: sex of head of household (HV219) and
sex of household member (HV104) are sharing the same codes “1=Male” and “2=Female”, and 9 as
the missing value. If the codes were entered and missing value has been identified for the head of
household (HV219), those properties can copy to household member (HV104).
To do this, select the third option in “Choose the source of the properties”, which is “The active
dataset” in Step 1 of the Wizard. Then, click a source variable, and click again the target
variable(s). As usual, user must press <Control> key while clicking the next variable name(s). After
selecting all target variables, just click “Finish” to begin copying process.
In this option, user must type-in appropriate variable labels for the target variables.
Variable labels are
the same as the
source variable
User must change
these variable labels!
2. DATA MANIPULATION
Preparing for data analysis
The following two steps are essential after setting variable properties to conduct an appropriate and
productive data analysis:
(1) the prospective outputs should be listed and laid out suitable analytical methods.
(2) check which outputs can be generated directly from the existing datasets, and which outputs
may require further manipulations such as sorting; calculation/creation of new variables
(temporary or permanent); transformation (coding, grouping, etc.); and creation of new
datasets (aggregation, subsetting and merging the existing datasets).
PASW allows data transformations ranging from as simple as collapsing categories for analysis, to
more advanced tasks, such as creating new variables based on complex equations and conditional
statements. In this chapter some important techniques of data manipulation and transformation will
be discussed.
Surveys could provide very rich information. However, most survey datasets are yet to be ready for analysis and producing output tables to construct EFA monitoring indicators.
Example:
The working dataset contains data extracted from a household survey with personal
records of all household members with the variables: age, sex, schooling status, and the
class/grade currently attending. And, the requirement is to produce “age-specific
enrolment rate (ASER) for the children aged 6 to 14 by sex” on dataset. It is impossible to
compute ASFR directly from the working dataset since:
(a) total number of children aged 6 to 14 by single year of age by sex (which is
denominator); and
(b) number of children aged 6 to 14 who are currently attending school by single year
of age by sex (which is numerator), are not available in the current dataset.
For this task, it requires the following Steps:
(a) Extracting the cases for aged 6-14 only;
(b) Counting of all children, irrespective of schooling or not, by age and sex, for
denominator;
(c) Counting of children who are currently attending school by age and sex, for
numerator; and
(d) Calculation of ASER by age and sex.
Step (a) can be carried out by “case selection” command, while “aggregate” command is
suitable for Steps (b) and (c), and “compute” command to create a new variable, ASFR, in
Step (d).
2.1 Changing, Inserting and Deleting Data, Cases and Variables
Changing the identification (or properties) of a variable:
To change the properties of a variable, for example, variable name, select the cell with the variable
name that you want to change in “Variable View” and type-in new appropriate name. All variable
properties can be changed as such in “Variable View”. Cautions must be put in changing variable
types: if change a string variable to numeric, all alpha-numeric data values will become missing
values (“.”); and only blanks (zero length string data) will get if changing back to string type later.
This may happen with some other data types also.
If data values were to change, select “Data view”, locate the cell and type in the new value, one cell
after another, as in a spreadsheet program.
Adding variables or cases to an existing dataset:
For example, a variable, education level “EdLevel”, should be added to have better understanding
of educational attainment of all household members. To add a new variable, select “Variable View”
and right-click the row number where to insert the new variables. PASW Statistics will insert the
variable before the existing variable on that row with the name “Var00001”, “Var00002”,
“Var00003”, and so on…. Variable type for a newly created variable is numeric with F8.2 format
(8 digits, 2 decimal places). There will be no variable label and value labels. The user can input or
import the variable attributes, as presented in the above section, for new variables including
variable name, type, width and decimal places, variable label and measurement level. As and where
applicable, value labels should also be identified.
In PASW Statistics Data Editor, t is simple to change the value of a specific cell, or properties of a variable, such as name, type, label, value labels and measurement scale.
Type in
variable name,
and edit
properties as
necessary!
Select row and
click RIGHT
mouse to get
pop-up, and click
“Insert Variable”
A new variable could also be inserted on “Data View” by clicking the existing variable name where
to insert the new variable before. Then, go to the “Variable View” and change the properties. On
the other hand, just click variable name (while working on Data View) or click the row number (on
Variable View) and press “Delete” key to delete a variable.
Inserting cases can be carried out only on “Data View”. Select the row (or several rows
continuously) where to insert new case(s), right-click and select “Insert Cases”. Similarly, select
case(s) and press “Delete” key will delete the selected cases. Alternatively, you can use the Clear
command in the Edit menu.
2.2 Computing New Variables
Creation of new variables from existing variables is a common and essential task in data analysis.
Example:
Total service of primary school teachers in many annual school censuses was recorded in months
for better accuracy. However, it requires to summarize or to relate with other variables in years.
Then, a new variable “service in year” must be computed as “service in month” divided by “12”.
Case study:
In the sample dataset extracted from “Bangladesh Demographic and Health Survey 2007” contains
highest education level (HV106) and highest year of education (HV107) for all household
members. However, there is no educational attainment in usual “Grade” or “Grade-level”, that is,
“Primary 2” or “Secondary 4” or …. To study the highest grade-level attended by adult household
members (aged 15 and above), a new variable “Grade” must be calculated from two existing
variables as:
Grade = HV106 * 10 + HV107, for HV106 = 0, 1, 2, 3 and HV107 is not 98; and
Grade = Missing, if HV106 = 8 (Don‟t know) or HV107 = 98 (Don‟t know).
To calculate the new variable “Grade”, the “Compute Variable” command is available under
“Transform” menu in the Data Editor. To create a new variable:
1. Click “Transform” on main menu bar; and
2. Click “Compute Variable” item and “Compute Variable” window will appear.
3. Fill-in “Target Variable” name, and optionally, the type and label of new variable
can also be set by clicking the button under target variable name;
Use Compute to get values for a variable, an existing one or a newly created one, based on numeric transformations of other variables.
Compute only for the
cases which are not
“unknown” for both
education variables
and Age > 15,
6
1
2
3 4
5
4. Set the numeric expression the existing variables together with numbers, PASW
Statistics built-in functions, and operators such as +, - , >, <, etc.;
5. If only the cases which meet certain criteria were to include, press button
located at the lower left corner of the window and fill-in the conditions; and
6. Click “OK” to complete the task.
A new variable, “Grade”, has been added in the current dataset, at the end of variable list. Although
a new variable name was provided, the result variable from the “Compute” command can also take
an existing variable name. After creation of a new variable, it is important to define thoroughly by
setting labels, missing values and measurement level.
2.3 Recoding
Recording is a common task in data preparation. Sometimes, values (or categories or codes) in a
nominal or ordinal variable require regrouping to make further analyses. For example, grouping of
single-year population into school-going age groups is essential to calculate education indicators.
Sometimes, data entering in text format, for example area names, should be changed into numeric
values for the ease of analysis. These tasks can be carried out by the following PASW commands:
1. Automatic Recode;
2. Recode into Same Variables; and
3. Recode into Different Variables.
Automatic Recode
It is useful for string variables with limited number of different values, for example, male or
female; urban, suburban, rural or remote. When the existing categorization of a variable is no
longer needed after recoding, “Recode into same variables” option can be selected or select
“Recode into Different Variables” to maintain the original variable.
To perform automatic recoding:
1. Click “Transform” on main menu bar; and
2. Click “Automatic Recode”, and a new window will appear;
3. Select one variable and send to the area under “Variable New Name”;
4. Type appropriate name for the recoded variable in “New Name” box;
5. Click “Add New Name” button; Repeat Steps 3, 4, and 5 for all variables to recode.
6. Select whether to recode starting from the “Lowest value” or “Highest value”;
7. Select whether to “use the same recoding scheme for all (selected) variables”, and
whether to “treat string values as user-missing” or not; and
8. Click “OK” to complete the task.
RECODE changes, rearranges, or consolidates the values of an existing variable. RECODE can be executed on a value-by-value basis or for a range of values.
1
2 3
4
5 6
7
8
Then, two new variables “Division” and “SES” will be added to the current dataset with the
following coding schemes (codes and value labels).
In some cases, there are more than one variable sharing the same values, for example, „Sex of head
of household (HV219)‟ and „Sex of household member (HV104)‟ must have only two valid values
“Male” and “Female”. Similarly, several variables could take just “Yes”, “No” and non-response or
missing value; for example, „Usual resident (HV102)‟, „Slept last night (HV103)‟ and „Member
still in school (HV110)‟ are such variables in the sample dataset. To recode such group of variables,
just tick the checkbox of “Use the same recoding scheme for all (selected) variables” in Step 7.
The following exhibit shows the automatic recoding of two variables, HV103 and HV102.
„Automatic recode‟ is simple and useful in exploring the newly imported file or for the beginners.
7
All properties are the
same for both variables
8
Recode into Different Variables
The “Recode into Different Variables” is the most useful recoding procedure for the general users.
In this procedure, users can select all the recode options, and both old and new variables are
maintained in the dataset. Before manual recoding, it is important to see the frequency distribution
of the variable under study. The variable “Highest education level (HV106)” will be used as an
example in this section. The frequency table for the variable HV106 is as following:
Here, 6 different items: „9‟, „DK‟, „Higher‟, „No education, preschool‟, „Primary‟ and „Secondary‟
are listed as valid values of the variable. Through the codebook of the DHS Survey, „9‟ is
representing the missing value and „DK‟ represents „Do not know‟. Since the variable under study
is „educational attainment‟, it is valid for those aged 6 and above only. Thus, it is logical to code
as following for the population (household members) aged 6 and above:
0 = No education, preschool 3 = Higher
1 = Primary 8 = DK, and
2 = Secondary 9 = (system) missing value.
To do this,
1. Click “Transform” on main menu bar; and
2. Click “Recode into Different Variables” and a new window will appear;
3. Select the variable “Highest education level (HV106)” and send to the area
“Input Variable Output Variable:”;
4. Input a new “Name” and appropriate variable “Label” for the output variable, and
click “Change” button to set new variable name and label;
5. Click “Old and New Values” button and a new window will appear for setting;
In “Old and New Values” window:
(i) Type in the old value (or a range), e.g. “Primary”;
(ii) Type in new value, e.g. “1”; and
(iii) Press “Add” button to add transformation rule into the process;
(iv) Repeat above steps for all pairs of values and click “Continue” to complete
selection and return to main recode window;
6. Click “If…” button and a new window will appear for case selection setting;
In “If Cases” window:
(a) Select “Include if case satisfies condition:” button;
(b) Construct (or type in) the condition, e.g. “HV105 > 5”; and
(c) Click “Continue” to return to main recode window; and
7. Click “OK” on “Record into Different Variables” window to complete the task.
After creating a new variable with recode command, all necessary properties must be set to the new
variable, such as variable format (type, width and decimal places), value labels, missing values, etc.
The new variable can be observed as following:
Similar steps were to carry out to “Recode into same variable”.
Step 6
3 4
5
6
(i) (ii)
(iv) “Continue”
(a)
(c)
(b)
No value labels yet! Just set width and decimal places
Since age (HV105) is < 6 yr, EdLevel is “Missing”
Since age (HV105) is > 6 yr, EdLevel code is “0”
(iii)
Step 5
Visual Binning
PASW Statistics also provides “Visual Binning” under “Transform” menu to perform automatic
creation of new variables based on grouping contiguous values of existing variables into a limited
number of distinct categories. Visual Binning can assist to:
• Create categorical variables from continuous scale variables. For example, a scale variable
“age” to create a new categorical variable that contains 5-year age groups.
• Collapse a large number of ordinal categories into a smaller set of categories. For example,
collapse the twenty 5-year age groups into 5 groups: 0-19, 20-39, 40-59, 60-79, and 80+.
To conduct visual binning, first select a scale variable (HV105 Age of household members) and
follow the steps below:
1. Click “Transform” on main menu bar; and
2. Click “Visual Binning” and a new window will appear;
3. In the “Visual Binning” window:
(i) select the scale variable(s) to bin and move those variables into “Variables
to Bin” pane; and
(ii) click “Continue” button when complete selecting;
(ii)
2
1
(i)
Step 3
PASW Statistics will analyze the selected variables, and present a graphical distribution of
the variable after binning in the new “Visual Binning” window. Here,
4. Input an appropriate “name” for the binned variable;
5. Input variable “label” for the binned variable; and
6. Click on the “Make Cutpoints…” button to define cutting points for the binning;
and “Make Cutpoints” window will appear to set cutpoints;
Cut points can be constructed based on three options: (i) equal width intervals;
(ii) equal percentiles based on scanned cases; and (iii) cutpoints at mean and selected
standard deviations (1 or 2 or 3 SD) based on scanned cases.
Generally, making cutpoints with equal width intervals is more common and
suitable in analyzing household surveys on education.
In the “Make Cutpoints” window:
7. Input “4” as first cutpoint location since the first age group of common 5-year
interval is 0-4;
8. Input “5” as the Width (or class interval), and the “number of cutpoints” will be
filled automatically, 19 in this example;
9. Click “Apply” and Visual Binning window will appear with set intervals.
Then, in the main “Visual Binning” window:
10. Click “Make Labels” button to generate value labels automatically and the user can
change labels as appropriate; and
11. Finally, click “OK” to create a new binned variable called “Age”.
As usual, properties of the new binned variable must be checked and changed as necessary.
4 5
6
10
The frequency table of the variable “Age” is as following:
7
8
9
10
11
3. DATA PREPARATION
After checking and editing of dataset, setting the variable properties, and recoding as necessary, the
dataset is ready to start preparation for data analyses.
Before making any analysis:
(1) the prospective outputs should be listed and laid out suitable analytical methods.
(2) check which outputs could be generated directly from the existing datasets, and which may
require further manipulations such as sorting; calculation/creation of new variables (temporary
and/or permanent); transformation (coding, grouping, etc.); and creation of new datasets
(aggregation, subsetting and merging the existing data sets).
PASW Statistics allows data transformations ranging from as simple as collapsing categories for
analysis, to more advanced tasks, such as creating new variables based on complex equations and
conditional statements. In this chapter some important techniques of data manipulation and
transformation will be discussed.
For having effective data analysis, users must prepare dataset efficiently. The most frequently used data preparations techniques include sorting and selecting of cases.
Example:
the working dataset contains data extracted from a household survey with personal records of all
household members. The variables include: age, sex, schooling status, and the class/grade
currently attending; and the requirement is to produce “age-specific enrolment rate (ASER) for
the children aged 6 to 14 by sex”. In this situation, it is impossible to compute ASFR directly
from the working dataset since the analyst needs to have a dataset with:
(a) total number of children aged 6 to 14 by single year of age by sex [which is denominator];
(b) number of children aged 6 to 14 who are currently attending school by single year of age by
sex [which is numerator], before computing age-specific enrolment rate, ASER.
In this situation, it requires:
(a) selection of cases (extracts cases of aged 6-14);
(b) aggregation of personal data to get grouped data by age and sex, that is, counting of all
children irrespective whether schooling or not, and of children who are currently attending
school, by age and sex; and
(c) calculation of ASER by age and sex.
[Note: The calculation is much easier and simpler if “Custom Tables” option is installed.]
3.1 Selecting Cases
Selection of cases is essential whenever to analyze a specific subset of data based on set criteria, for
example, to study the percentage of “out-of-school girls aged 6-14”. To do this:
1. Click “Data” on main menu bar; and
2. Click “Select Cases”, which is the second last item on the list. Then,
3. “Select Cases” window will appear and select “If condition is satisfied” and;
4. Click “If” button and a new window “Select Cases: If” will appear.
5. Construct selection statement using variables, operators and functions;
then, click “Continue”;
6. Select output option:
i. Filter out unselected cases;
ii. Copy selected cases to a new dataset (to provide the new dataset name); and
iii. Delete unselected cases;
7. Click “OK” button and a new Data Editor window will appear with selected cases.
Select Cases provides several methods for selecting a subgroup of cases based on criteria that include variables and complex expressions. Users can also select a random sample of cases.
5
1
2
3
4
6
There are three output options;
“Filter out unselected cases” - cross-signs (X) will be put on unselected cases as following picture
shows. The unselected cases will not be used in future analyses and run select cases with Select All
Cases option to retain original dataset.
“Copy selected cases to a new dataset” - this creates new dataset and leave current dataset intact.
Users can switch between the original dataset and newly created dataset or use both datasets
together through PASW syntax.
“Delete unselected cases” – this deletes all unselected cases from the current dataset. With this
option, original dataset cannot be retained, and thus, it is important to save the original dataset
before, and the sub-dataset contains only selected cases should also be saved with an appropriate
name as soon as completing the selection process.
The following cross-tabulation provides the percentage of out-of-school girls aged 6-14 in single
year.
Unselected cases
Selected cases
Unselected cases
3.2 Sorting Cases
Cases can be sorted in ascending or descending order based on one to all variables in the dataset. In
the sample dataset, households can be sorted by wealth index to observe the characteristics of
households in similar wealth status. Moreover, some PASW Statistics commands require pre-sorted
dataset, for example “aggregate” command requires sorted dataset by the breaking variable(s).
Sorting can be carried out through “Sort Cases” command under “Data” menu as following:
1. Click “Data” on main menu bar; and
2. Click “Sort Cases”. Then, “Sort Cases” window will appear;
3. Select the first key variable and send to “Sort by” pane and set “Sort Order”;
Repeat this Steps 3 for all key variables in the order of importance;
4. Click “OK” button to start sorting.
The following example sorts current dataset with two variables: „Education in single year (HV108)‟
in ascending order and „Age of head of household (HV220)‟ in descending order.
Sorted data:
SORT CASES reorders the sequence of cases in the active dataset based on the values of one or more variables. Optionally cases can be sort in ascending or descending order, or combinations of ascending and descending order for different variables.
1st.
Key 2nd.
Key
1
2
3
4
3.3 Rearranging Variables
Sometimes, the original dataset cannot provide the variables in good order, for example, education
related variables may spread in several locations. Other occasions, linked variables are far apart that
it cannot be visually observed the linkage. In such cases, putting those associated or linked
variables or variables under investigation could be grouped into a new dataset or moved to the top
of the variable list.
Relocating Variables
To move a variable form current position to the new one is just click the selected variable, drag-
and-drop at the desired position in “Variable View” or “Data View”. For example, to place “Line
number of head of household (HV218)” to the second position in the list:
1. Select the variable by clicking on the row number (HV218 at row 6) on Variable View; and
2. “Drag and Drop” at the desired location (in this example, after the first variable in the list).
A red hairline will show the position if the user drop the dragged variable at that time.
Relocating of variables does not have any impact on the results of data analyses. However, it makes easier to decide which variables to use for getting required outputs.
Thin Red Line shows the destination
At new location after moving
Variable Sets
In case of several variables in the dataset, it is recommended to define and use “Variable Sets”.
Define Variable Sets under Utilities menu creates subsets of variables to display in the Data Editor
and variable lists in dialog boxes. Defined variable sets are saved with PASW format data files.
A variable set can be defined with any combination of numeric and string variables, and a variable
can belong to multiple sets. The order of variables in the set has no effect on the display order of
the variables in the Data Editor or variable lists in dialog boxes.
Two variable sets “Education” and “HH_Head” are defined in the following example with nine
variables in “Education” variable set and eight in the other with four common variables.
To create a variable set:
1. Click “Utilities” on main menu bar; and
2. Click “Define Variable Sets”, and a new window will appear;
3. In “Define Variable Sets” window, first put in the set name following PASW naming
convention (can be up to 64 bytes long; valid any characters including blanks);
4. Select and put variables into the “Variables in Set” pane;
5. Click “Add Set” button to create the variable set;
Define as many sets as needed by repeating steps 3-5.
6. Click “Close” button to complete creation of variable sets.
It is strongly recommended to save the dataset with the new name after defining the variable sets.
In this example, the dataset is saved as “BDPR50FL2.sav”.
6
5
1
2
3
4
To use a variable set:
1. Click “Utilities” on main menu bar; and
2. Click “Use Variable Sets”, and a new window with the list of variable sets will appear;
The list of available variable sets includes all variable sets defined, plus two built-in sets:
(i) ALLVARIABLES: contains all variables in the data file, including new variables
created during a session;
(ii) NEWVARIABLES: contains only new variables created during the current session;
(iii) Education: the first user-defined variable set containing 9 variables; and
(iv) HH_Head: the second user-defined variable set containing 9 variables.
3. In “Use Variable Sets” window, first, check the desired variable set(s) and uncheck all
others under “Select variable sets to apply”;
At least one variable set must be selected. If ALLVARIABLES is selected, any other
selected sets will not have any effect, since this set contains all variables. In this example,
“Education” variable set is selected.
4. Click “OK” to complete selection and the following new Data View will appear.
5. To get all variables back, click “Show All Variables” under Utilities menu.
3
5
1
2
Display 9 variables
of Education set
4
4. DATA VALIDATION
Why data validation is required?
With rapidly expanding computing power and increasing storage capacity at reasonable cost, many
surveys in current years were designed to collect several items (which will result more variables)
with better coverage (i.e., larger sample size and thus more cases in PASW). It creates more
workloads for the data handlers – coding staff, entry clerks, and data editors. Generally, with time
pressure to complete the task on one hand and inefficiencies in training and recruitment of staff on
the other, the quality of data transmitted from data manager to analyst is in question. In some cases,
surveys were planned without a step to check the coding, and not at all to verify the data entered.
For the education data analysts, it is expected to obtain survey data concerning education from
various sources, and thus, there is no way to conduct rechecking of coding or data entry. Therefore,
it is important to use validation rules to check the data validity and consistency before using the
data set.
Validation rules
Generally, there are three types of rules in validating a dataset:
1. Single-variable rules
2. Cross-variable rules, and
3. Multi-case rules.
In PASW Statistics 17.0, these rules are not available in the base system, but become part of the
optional “Data Preparation” add-on module. However, these tasks can be carried out through
common PASW Statistics commands. It is easier if the user understands PASW syntax
(programming) language.
The first two types, single-variable rules and cross-variable rules, require understanding “case
selection” which was discussed in the previous section. The third type of rules is more complicated
and it may need several steps of data manipulations such as creating temporary variables, matching,
aggregation and selection of cases.
PASW Statistics provided a procedure: “Identify Duplicate Cases” in “Data” menu to identify
duplicate cases in a data file which is the most important part of the third, multi-case rules.
This section will introduce simplest data validation procedures, but those are powerful in pointing
out improper or invalid cases and values.
Validate Data helps identifying suspicious and invalid cases, variables, and data values in the active dataset.
4.1 Validation with Single-Variable Rules
These rules consist of a set of checks apply to a variable. Normally, checks for out-of-range or invalid
values and missing values include in this category. For example, a value of 5 was entered for the
“highest education level (HV106)” where valid codes are only 0, 1, 2, 3 and 8; values other than 1
and 2 (or “Male” and “Female”) are entered in variable “sex of household members (HV104)”, etc…
Checking validation consists of three stages followed by editing of invalid cases. The first stage in
validating a variable is obtaining valid values or ranges from the codebook, for example, valid values
for HV104 (sex) are 1 and 2 only. Therefore, any values except 1 and 2 are invalid.
The second stage is constructing a frequency table. If there is no invalid values displayed in the
frequency table, the variable under observation is „valid‟ with the single-variable rule. If irrelevant
values were observed in the frequency table, for example “3” in variable representing “sex”, it is
required to identify “where these erroneous cases are?” And, thus, the third stage for checking
validation is using “select cases” to split out and observe the irrelevant cases.
To check the validity of “sex of household members (HV104)”, follow the steps:
1. Click “Analyze” on main menu bar;
2. Click “Descriptive Statistics”;
Those validation rules which check internal inconsistencies such as invalid values and cases within a variable are known as Single-Variable Rules.
Invalid values
for Sex
1
2
Step 4
3
3. Then, click again “Frequencies”; and
4. On “Frequencies” window, select the variable to study (HV104) and click “OK” to
construct frequency table.
In the above frequency table, 5 cases with the values 3, 4, and 5 are invalid. Therefore, it is
necessary to check which case contain such invalid values through conducting the third stage: “case
selection” of invalid cases.
To select invalid cases:
1. Click “Data” on main menu bar;
2. Click “Select cases”;
3. On “Select cases” window, check the option button “If condition is satisfied” and
click “If” button;
4. On “Select cases: If” window, type in criteria: “not (HV104=1 or HV104=2)” or
“~(HV104=1 | HV104=2)” and click “Continue”;
5. Check “Copy selected cases to new dataset” option button and provide the new
dataset name, e.g. “Invalid_Cases”; and
6. Click “OK” to execute the case selection command.
1
3
2
5
Set in Step 4
6
The output, new dataset contains only 5 invalid cases (after moving variable HV104 to second
position to get a better view) as below:
In this case, the user must decide, and act, whether to erase the entire case from the dataset or change
the invalid ones to “missing values”, or check other datasets where there have different values and to
correct the invalid values in the current dataset.
Invalid values
for Sex Case Number
4.2 Cross-Variable Rules
In cross-variable rules, users have to use cross-tabulations instead of frequency tables to specify whether there
exist invalid cases or not, and to imply slightly different rule for conditional selection of invalid cases.
In the sample dataset, the “highest educational level (HV106)” has no invalid cases if checked it alone using
frequency tables command. However, when cross-checking with “age of the household members (HV105)”,
there are few susceptible entries as follow:
From the above cross tabulation of age and highest education level, one can easily judged that there are 2 cases
of “age 4 in primary education” and 1 case of “age 12 in higher education” are invalid. Moreover, there are few
more cases which are not reliable (or on the margin) in all education levels. There are few options in
developing cross-variable validation rules:
Option 1 – to sip out all susceptible cases (invalid and marginal ones):
i) with primary education at aged 5 or below (the official entrance age is 6),
ii) with secondary education at aged 10 or below (the official starting age is 6+5=11), and
iii) with higher education at aged 15 or below (the official starting age is 6+5+5=16).
Rules for checking inconsistencies in a variable through the values of other variables in the same case is called Cross-Variable Rules.
Reference
NO
visible
invalid
values
Invalid
On the margin
Valid
Option 2 – to review just certainly invalid cases, one can use the following cross-variable rules with a grace
period (early entrance) of one year:
i) with primary education at aged 4 or below (the official entrance age is 6 but 5 can be allowed),
ii) with secondary education at aged 9 or below (the official starting age is 6+5=11), and
iii) with higher education at aged 14 or below (the official starting age is 6+5+5=16).
Then, the “If” statements to be used in case selection are:
Option 1: (HV105 <= 5 and HV106 = 1) or (HV105 <= 10 and HV106 = 2) or (HV105 <= 15 and HV106 = 3)
Option 2: (HV105 < 5 and HV106 = 1) or (HV105 < 10 and HV106 = 2) or (HV105 < 15 and HV106 = 3)
And the following outputs will be obtained after running appropriate case selection procedures as
presented in the previous section.
Option 1: Both invalid and marginal cases
Option 2: Only certainly invalid cases
Age
Case Number Ed. Level to be checked &
corrected
4.3 Multi-Case Rules
The multi-case rules are defined by a procedure (sequence of logical expressions) that flags invalid
cases. The most common and useful application of multi-case rules is checking whether there are
duplicates in the dataset: entered twice or more for a household member or two heads in a single
household or two persons in the same household have the same personal ID, and so on.
PASW Statistics allows checking duplicate cases and inspection of unusual cases. Follow the steps
below to check duplicate cases:
1. Click “Data” on main menu bar; and
2. Select “Identify Duplicate Cases”. Then, a new window will appear;
3. Select variables to identify duplicate cases (or press Ctrl+A to select all and release
unnecessary variables) and send to the space below “Defined matching cases by:”;
4. Set the options:
(a) “Sort within matching group” - select the variable(s) from the remaining
ones in the list, as the key for sorting within the matching groups;
(b) “Sort” - if a key variable for sorting is selected, define the sort order;
(c) “Variables to create” – tick in the check box, if the user wants a frequency
table showing “how many duplicates are detected?”, or to point out which are
the duplicate cases; then, also could identify:
i. which is the primary case, the first or last case among the duplicates?
ii. whether to count all duplicate cases sequentially or just count only non-
primary cases (the primary case is not considered as duplicate);
A user-defined rule that can be applied to a single variable or a combination of variables in a group of cases is a Multi-Case Rule.
3
1
2
4(a)
4(b)
4(d)
i.
ii.
4(e)
5
4(c)
(d) Tick “Move matching cases to the top” to review duplicates easier; and
(e) Tick “Display frequencies for created variables” if required;
5. Click “OK” to proceed.
With the above set options, the result of checking duplicate cases is displayed in the following
frequency table:
The above frequency table shows that there are 6 duplicates among the 1,889 cases. All of those
may be the same (just one primary case and the group of 7 cases are the same in all variables) or
there may be 6 pairs of duplicates (6 primary cases and one duplicate for each primary case). It is to
review the dataset for understanding the nature of duplicates and how to deal with those duplicates.
The following exhibit shows the groups of duplicates displayed on top of the dataset.
After validation checks, the dataset should be edited as and where necessary. After data validation
and preparation, the next step is analyzing “clean data” using appropriate PASW procedures under
“Analyze” menu.
Duplicate Cases
Primary
Duplicate
Primary
Duplicate
Primary
Duplicate
Primary
Duplicate
Primary
Duplicate
Primary
Duplicate
Values of all variables are same in both cases
5. TIPS AND EXERCISES
5.1 Tips: Do and Don’t
i) Do… request to provide documents such as project proposals, questionnaire sets,
codebooks, documents on fieldworks, and survey reports while approaching
agencies/departments to get survey data;
Don’t… judge the usefulness on the spot and do not leave any survey documents and
datasets which are available in survey agencies/departments.
ii) Do… make understand, check and edit metadata (a set of data that describes and
gives information about other data) before using secondary dataset;
Don’t… leave any variable without proper definition: variable label, value labels,
missing values and measurement level (scale, ordinal and nominal).
iii) Do… save the dataset with an appropriate filename whenever changes have been
made, and record properly what changes were made from earlier version;
Don’t… save the current dataset in original filename after making changes, but do not
replace the original data file with edited ones.
iv) Do… copy variable properties whenever available;
Don’t… leave it as it is after copying variable properties (must check and edit as
necessary).
v) Do… define and use variable sets for the ease of analysis, and subset new datasets
by selecting variables as well as cases;
Don’t… change variable type and measurement level without sound understanding.
vi) Do… recode string variables into numeric codes using “automatic recode” and use
“visual binning” for the continuous variable (or numeric variable with several
different values) to reduce the number of items;
Don’t… recode into same variable since it is irreversible (and also, the original
variable can easily be deleted when it is no longer needed.
vii) Do… validate data through single-variable and multiple-variable rules and check
the existence of duplicate cases before conducting any analysis;
Don’t… change the values in the dataset with imagination or self-imposed
assumptions. Always contact to the primary data source for correction or
omit those cases if not many.
5.2 Self-evaluation
Do you understand how to set variable properties in PASW statistics? Very well / Somewhat well / Not so much / Almost None
Are you confident that you can do the followings in an active dataset? o Compute a new variable:
Confident / Somewhat confident / Not so much / Not at all o Recode into a different variable:
Confident / Somewhat confident / Not so much / Not at all o Selecting cases with girls under 15:
Confident / Somewhat confident / Not so much / Not at all o Sorting cases with wealth index factor score and highest education attained:
Confident / Somewhat confident / Not so much / Not at all o Check erroneous values in a variable (validate with single/cross variable rule)
Confident / Somewhat confident / Not so much / Not at all o Check existence of duplicate cases in the dataset
Confident / Somewhat confident / Not so much / Not at all
Do you understand visual binning? Very well / Somewhat well / Not so much / Almost None
5.3 Hands-on Exercises
1) Import the attached “data1(tab).dat” and define all variables appropriately.
2) From the dataset obtained from Exercise 1 above, recode all string variables.
3) Create single-variable rules to check the validity of three education related variables.
4) Create two multi-variable rules to check the validity of (i) current schooling status of
household members, and (ii) education in single year of household members.
5) Find duplicate cases from the current dataset and propose how to handle those cases.
Module B4:
Basic Data Analysis Techniques in PASW Statistics
Contents:
1. Reports 1.1 Codebook 1.2 Case Summaries: Listing Selected Cases 1.3 OLAP Cubes (Online Analytical Processing Cubes)
2. Descriptive Statistics 2.1 Frequencies 2.2 Descriptive 2.3 Explore 2.4 Crosstabs 2.5 Ratio Statistics
3. Tips and Exercises 3.1 Tips: Do and Don’t 3.2 Self-evaluation 3.3 Hands-on Exercises
4. Annexe: Web Links for Further Study on SPSS/PASW Statistics
Purpose and leaning outcomes:
To introduce basic data analysis techniques in PASW
To understand how to derive PASW to get required outputs (tables and charts)
To know how to interpret PASW output
1. REPORTS
The first command under ANALYZE menu is the REPORT. The REPORT procedures can provide
all univariate statistics available in the DESCRIPTIVES statistics and subpopulation means
available in the MEANS. In addition, some statistics available in the report procedures, such as
computations involving aggregated statistics, are not directly accessible in any other command
procedures.
By default REPORT provides complete report format but a variety of table elements can be
customized, including column widths, titles, footnotes, and spacing. Because it is flexible and the
output has so many components, it is often efficient to preview report output using a small number
of cases until finding the format that best suits the needs, especially when listing individual cases.
The group of REPORT commands comprises of Codebook, OLAP Cubes, and Summarize –
containing „Case Summaries‟, „Report Summaries in Rows‟ and „Report Summaries in Columns‟.
Codebook
This procedure reports the dictionary information and summary statistics for all or specified
variables and multiple response sets in the active dataset.
Summarize procedure (or Case Summaries)
“Case summaries” produces subgroup statistics for variables within categories of one or
more grouping variables. All levels of grouping variable are cross-tabulated. Summary
statistics for each variable across all categories are also displayed. The order in which the
statistics are displayed can be chosen. Moreover, data values in each category can be listed
or suppressed. With large datasets, only the first n cases or all cases can be listed.
Report Summaries in Rows
It produces reports in which different summary statistics are laid out in rows. Case listings
are also available, with or without summary statistics; and
Report Summaries in Columns
Produces summary reports in which different summary statistics appear in separate
columns.
OLAP Cubes (Online Analytical Processing Cubes)
It calculates totals, means, and other univariate statistics for continuous summary variables
within categories of one or more categorical grouping variables. A separate layer in the
table is created for each category of each grouping variable.
Procedures in the REPORT command group can provide all univariate statistics available in other procedures. In addition, computations involving aggregated statistics are directly accessible only in the REPORT procedures.
Among the others, Codebook and OLAP Cubes are included in the most essential procedures for the education data analysts.
1.1 Codebook
Summary statistics produced by Codebook for the nominal and ordinal variables, and multiple
response sets include counts and percents. For scale variables, summary statistics include mean,
standard deviation, and quartiles. As such, codebook is very useful for preliminary analysis.
To obtain a codebook of the current dataset:
1. Click “Analyze” on main menu bar;
2. Click “Reports”; and
3. Click again “Codebook”, and a new window will appear with complete variable list.
4. Select and send the variables to “Codebook Variables” pane;
Here, just three variables with different measurement scales are chosen. And,
5. Click “OK” to proceed with the default settings for Output and Statistics.
Codebook reports such dictionary information as variable names, variable labels, value labels, and missing values. It also provides summary statistics for all or specified variables and multiple response sets in the active dataset.
5
4
1
2 3
The output table obtained by above procedure for the first variable is:
Since the first selected variable “HV009 – Number of household members” is a scale variable, the
statistics produced for the variable are mean, standard deviation and three quartile values.
However, the other variables, “HV024 – Division” is nominal and “HV025 – Type of place of
residence” is ordinal, only count and percentage of each valid value (category) are provided as
statistics.
In the Codebook procedure, measurement level of variables can be changed temporarily by clicking
right-mouse button after pointing on the variable. The following exhibit shows changing
measurement level of “HV270 – Wealth index” from “ordinal” to “scale”. Keep in mind that
changing from “ordinal” to “scale” type is temporary and only useful in the codebook procedure.
And, the followings are the options available in Codebook command at its default setting.
Select and click
right-mouse button
Click here to change
„Ordinal‟ to „Scale‟
The following output table is the codebook for “HV009 – Number of household members” after
changing: (i) measurement level to “Ordinal”; (ii) select “Measurement level” and “Weight status”;
and (iii) to display only “Percent” in statistics option.
(i) These values are
displayed because of changing to “Ordinal”
temporarily
File information
set in (ii)
Real measurement level
„Scale‟ is displayed
Display only “percent”
as set in (iii)
1.2 Case Summaries: Listing Selected Cases
Case Summaries under Report is useful to filter and list the cases with specified characteristics.
For example, to list “20 out-of-school children aged 6-14 from the lowest socio-economic status
from the sample households” with their age, sex, highest education level, etc…
It should be noted that the dataset must be (A) limited only to the household members aged 6-14
who are out of school (use “Select Cases”), and (B) sorted in ascending order by “Wealth index
factor score” (use “Sort Cases”)before exercising the case summaries.
The following exhibits explain the preparatory steps before executing “Case Summaries” briefly.
After conducting the preparatory tasks, the “PASW Statistics Data Editor” shows the
“Out_of_School_6_to_14” dataset with the selected cases sorted in ascending order of “HV271 –
Wealth index factor score”. The original sample dataset contains altogether 53,413 cases while the
filtered dataset contains only 974 cases containing out-of-school children aged 6-14 only.
Occasionally listing of selected cases with limited number of variables is required for validity (error) checking, reporting, printing and presentation purposes.
Case Summaries can help in such tasks.
Condition for selecting
aged 6-14 only
Condition for selecting
only “out-of-school”
Dataset for the
selected cases only
i.
ii. iii.
iv. v.
vi.
(a)
(b)
(c)
(d)
B - sorting
A – select cases
After completing data preparation work, follow the steps to execute “Case Summaries” command:
1. Click “Analyze” on main menu bar;
2. Click “Reports”; and
3. Click again on “Case Summaries”. Then, a new window will appear with the
complete list of variables in the current dataset.
4. In “Case Summaries” window, select the variables in desired sequence;
5. Set number of cases to display in the “Limit cases to first”, for example, 20 ;
6. Click “OK” button to create a case summary report.
1 2
3
4
5
6
(HV105) (HV104) (HV101) (HV109)
(HV025)
The output table of the above procedure is as following:
The following table is copied from PASW Statistics Viewer and pasted directly into MS Word.
Then, few minor touches on output layout are applied in MS Word.
Listing of Out-of-school Children from Poorest Households a
Case Number Division
Age of household members
Relationship to head
Educational attainment
Wealth index factor score (5 decimals)
Male 1 1 Sylhet 14 Son/daughter Incomplete primary -102597
2 4 Dhaka 12 Son/daughter Incomplete primary -97182
3 8 Sylhet 12 Son/daughter Incomplete primary -95883
4 9 Sylhet 12 Other relative Complete primary -95883
5 10 Rajshahi 11 Son/daughter Incomplete primary -95868
6 11 Dhaka 12 Son/daughter Incomplete primary -95747
7 12 Barisal 9 Son/daughter Incomplete primary -95184
8 14 Rajshahi 10 Son/daughter Incomplete primary -94601
9 15 Barisal 12 Son/daughter Incomplete primary -94539
10 17 Rajshahi 12 Son/daughter Incomplete secondary -94185
11 19 Rajshahi 13 Son/daughter Incomplete secondary -93875
12 20 Rajshahi 14 Son/daughter Incomplete primary -93649
Male Mean 11.92 -95766.08
Female 1 2 Dhaka 10 Son/daughter Incomplete primary -97793 2 3 Barisal 8 Son/daughter Incomplete primary -97330 3 5 Dhaka 10 Son/daughter Incomplete primary -97182 4 6 Barisal 13 Son/daughter Incomplete primary -96696 5 7 Chittagong 14 Grandchild Incomplete primary -96592 6 13 Dhaka 11 Son/daughter Complete primary -95028 7 16 Dhaka 12 Son/daughter Incomplete primary -94331 8 18 Dhaka 10 Son/daughter Incomplete primary -93976 Female Mean 11.00 -96116.00
Total Mean 11.55 -95906.05
a. Limited to first 20 cases.
The next table shows the same list of 20 out-of-school children, but by “Division”.
“Report Summaries in Rows” produces reports in which different summary statistics are laid out
in rows. Case listings are also available, with or without summary statistics. Similarly, “Report
Summaries in Columns” can provide summary reports, in which different summary statistics
appear in separate columns.
The outputs of both commands are in text format and cannot use pivot table techniques. Moreover,
all such outputs could be created from “Case Summaries” command described before.
The following table is the summary statistics obtained from “Case Summaries” command without
displaying individual cases. The variable selected to display summary statistics is “number of years
effectively studied by a household member (HV108 – Education in single year)”. And, the report
will provide such statistics as:
(i) number of cases;
(ii) mean year of study (average of HV108);
(iii) standard error of mean; and
(iv) median year of study by:
a. sex,
b. residence, and
c. district without listing individual cases.
Case Summaries Education in single years
Resid-ence Division
Sex of household member
Male Female Total
N Mean Std.Err. Mean Median N Mean
Std.Err. Mean Median N Mean
Std.Err. Mean Median
Urban Barisal 22 3.36 0.387 3.00 17 3.71 0.605 4.00 39 3.51 0.338 3.00
Chittagong 35 3.11 0.366 3.00 35 3.83 0.372 4.00 70 3.47 0.263 3.00
Dhaka 54 3.22 0.314 4.00 60 3.23 0.340 3.00 114 3.23 0.231 3.00
Khulna 22 2.95 0.419 3.50 16 4.19 0.467 5.00 38 3.47 0.324 4.00
Rajshahi 27 3.26 0.448 4.00 20 2.75 0.497 2.50 47 3.04 0.332 3.00
Sylhet 41 2.95 0.356 3.00 19 3.42 0.509 3.00 60 3.10 0.291 3.00
Total 201 3.14 0.153 3.00 167 3.46 0.184 3.00 368 3.29 0.118 3.00
Rural Barisal 54 2.98 0.296 3.00 30 3.67 0.344 4.00 84 3.23 0.228 4.00
Chittagong 63 2.90 0.230 3.00 45 4.02 0.352 4.00 108 3.37 0.205 3.50
Dhaka 71 2.73 0.262 2.00 58 3.72 0.268 4.00 129 3.18 0.192 3.00
Khulna 33 3.48 0.289 4.00 15 4.00 0.569 3.00 48 3.65 0.265 3.50
Rajshahi 54 3.09 0.280 3.50 32 3.41 0.401 3.00 86 3.21 0.230 3.00
Sylhet 99 3.24 0.208 3.00 52 3.62 0.350 4.00 151 3.37 0.182 3.00
Total 374 3.05 0.105 3.00 232 3.72 0.146 4.00 606 3.31 0.087 3.00
Total (Urban+ Rural)
Barisal 76 3.09 0.238 3.00 47 3.68 0.306 4.00 123 3.32 0.189 4.00
Chittagong 98 2.98 0.197 3.00 80 3.94 0.255 4.00 178 3.41 0.161 3.00
Dhaka 125 2.94 0.201 3.00 118 3.47 0.218 3.50 243 3.20 0.149 3.00
Khulna 55 3.27 0.241 4.00 31 4.10 0.360 4.00 86 3.57 0.205 4.00
Rajshahi 81 3.15 0.238 4.00 52 3.15 0.312 3.00 133 3.15 0.189 3.00
Sylhet 140 3.16 0.180 3.00 71 3.56 0.288 3.00 211 3.29 0.154 3.00
Total 575 3.08 0.087 3.00 399 3.61 0.115 4.00 974 3.30 0.070 3.00
Std.Err. Mean = Standard error of mean.
1.3 OLAP Cubes (Online Analytical Processing Cubes)
It creates a separate layer for each category of every grouping variable in the table. The summary
variables are quantitative (continuous variables measured on an interval or ratio scale), and the grouping
variables are categorical. The values of categorical variables can be numeric or string.
OLAP Cubes provides a wide variety of summary statistics such as: sum, number of cases, mean,
median, grouped median, standard error of the mean, minimum, maximum, range, variable value of the
first category of the grouping variable, variable value of the last category of the grouping variable,
standard deviation, variance, kurtosis, standard error of kurtosis, skewness, standard error of skewness,
percentage of total cases, percentage of total sum, percentage of total cases within grouping variables,
percentage of total sum within grouping variables, geometric mean, and harmonic mean.
Some of the optional subgroup statistics, such as the mean and standard deviation, are based on normal
theory and are appropriate for quantitative variables with symmetric distributions. OLAP cube uses
the pivot table techniques, but with specific statistics and output options which cannot be obtained
from other procedures such as cross-tabulation.
Example: OLAP Cubes
Among the variables in the sample dataset, only “HV108 – Education in single year” is the
education related continuous (interval or ratio scale) variable. Since the continuous variable(s) must
be selected as “Summary” variable, HV108 is selected in this example. Thus, the following exhibits
demonstrate how “OLAP Cubes” is useful in exploring the “average number of study years by the
adult household members” by four grouping variables: Sex; Age Group; Residence and Division.
Before using “OLAP Cubes” procedure, only the adult household members (aged 15 and above)
must be selected using “Case Selection”.
The OLAP Cubes procedure can produce variety of summary statistics for summary variables within categories of one or more grouping variables.
5
1
2 3
4 (i)
4 (ii)
6 7
8
Preparing Dataset for analyzing adults only
(HV105) (GAge)
(HV025) (HV024)
(HV108)
After selecting only adults:
1. Click “Analyze” on main menu bar;
2. Click “Reports”; and
3. Click “OLAP Cubes”, and a new window will appear with complete variable list.
4. In “OLAP Cubes” window, select Summary and Grouping variables as planned;
5. Click “Statistics” to set the desired summary statistics:
a. By default, the six summary statistics are selected (can leave it as it is);
b. Users can double-click any unselected statistics to be selected and vice versa;
c. Click “Continue” when complete selection of summary statistics;
6. Click “Differences” button to compute absolute or percentage differences for all
measures selected in the Statistics dialog box. This step is optional.
(a) Default stats
(c).
Step 5
Step 5
Step 6
Step 7
(b) Selected Statistics
The "Differences" dialog box allows calculating percentage and absolute differences:
Differences between Variables calculates differences between pairs of
variables. At least two summary variables must be selected before specifying
differences between variables.
Differences between Groups of Cases calculates differences between pairs of
groups defined by a grouping variable. One or more grouping variables must be
selected in the main dialog box before specifying differences between groups.
The differences are calculated between summary statistics values by
subtracting the value of the “minus” variable/category from the values of the
first in the pair. Percentage differences use the value of the summary statistic of
the second (the Minus) as the denominator.
7. Click “Title” button to create custom table titles. This step is optional.
Title of output table or a caption (add below the table) can be added in this step. If the
title or caption expands over one line, inset \n for wrapping (a line break in the text).
Enter appropriate title and caption, and Click “continue” button when completed.
8. Click “OK” button on “OLAP Cubes window” to start creating with the set options.
When complete creating the OLAP Cube, the following output will be placed in the output viewer.
The default output provides three summary statistics: number of cases (N), mean, and standard
error of mean for “HV108 – Education in single year” for the entire sample (valid cases).
Although this table seems simple and unattractive, one can select for each and every category of
“Grouping variables” as in the Pivot tables. To do this, double-click on the table in Output Viewer,
and then, click on the dropdown icon and select any category in the list. The following exhibit
shows the statistics for the “Males aged 15-29 who lived in the urban areas”.
User can change categories: from the default “total” to any
item in the Dropdown list
Statistics
Layer
Caption
Title
Again, one can pivot the output table to be more attractive as followings:
2. DESCRIPTIVE STATISTICS
Most frequently used procedures in PASW Statistics are Descriptive Statistics. From making
initial analysis and checking validity to extracting education data and constructing indicators from a
household survey, “Descriptive Statistics” are essential. Although “Report” could provide similar
statistics, “Descriptive Statistics” are user-friendly and provide more varieties of charts.
2.1 Frequencies
“Frequencies” procedure can produce such statistics as: frequencies (counts), percentages,
cumulative percentages, mean, median, mode, sum, standard deviation, variance, range, minimum
and maximum values, standard error of the mean, skewness and kurtosis (both with standard
errors), quartiles and percentiles. Moreover, it can produce bar chart, pie chart, and histogram.
For better display in the output table and charts, distinct values can be arranged in ascending or
descending order of category labels or of their counts. The frequencies report can be suppressed
when a variable has many distinct values. Charts produce by this command can be labeled with
frequencies (default) or percentages. To produce a simple frequency table:
1. Click “Analyze” on main menu bar;
2. Click “Descriptive Statistics”; and
3. Click again “Frequencies”, and a new window will appear with complete variable list.
4. Select (categorical) variables to produce frequency tables (each variable will have a table);
5. Click “Format” button, and set the output formats on:
a. how to order categories in the frequency table – ascending or descending order of
values or count?
b. how to organize the outputs if more than one variable is selected?; and
c. whether to display or suppress the table with several categories (to set maximum)?;
6. Click “OK” button to start creating the frequency tables with selected charts and format.
Frequencies is the procedure to start analyzing a dataset. It provides statistics and graphical displays that are useful for describing all different types of variables.
1
3 2
4
5
6
Step 5 (a)
(b)
(c)
(c)
(HV104) (HV109) (HV105)
And the following outputs will be obtained from the steps present in the above exhibit.
It should be noted that only two frequency tables are generated although three variables are
selected. It is because PASW Statistics suppressed the frequency table of “HV105 – Age of
household member” since the number of categories is more than set value of 15 (roughly 100).
Generally, there are two key purposes in using frequencies: (a) to get frequency table of categorical
variables with limited number of different items, for example, sex, educational attainment, age
group, etc.; and (b) to get summary statistics of the continuous variables without frequency table
(i.e., for the variables in interval or ratio scales and values are widely different).
Moreover, bar charts, pie charts and histograms can be created automatically for the categorical
variables with limited number of different items by clicking “Charts” button. Then, select the chart
type and option after current Step 5. In the above example, “Pie chart” is appropriate to review
gender composition (HV104) of sample population while “Bar chart” should be use for the
education levels (HV109). Therefore, those two variables cannot join together at the same time.
(HV104)
Similarly, one can choose types of statistics to be displayed by clicking “Statistics” button after
selecting charts. The following exhibits show the outputs for the variable “HV105 – Age of
household member” without frequency table by age.
(HV105)
(HV109)
2.2 Descriptives
Because it does not sort values into a frequency table, it is an efficient means of computing
summary statistics for continuous variables. Almost all statistics provided in DESCRIPTIVES can
also be obtained from other procedures such as FREQUENCIES, MEANS, and EXAMINE.
Although “Frequencies” could also provide univariate statistics, “Descriptives” displays summary
statistics for several variables in a single table. It can also calculate and save the standardized
values (Z-scores). Variables can be ordered by the size of their means (in ascending or descending),
alphabetically, or by the order in which user selects the variables (default).
When Z-scores are saved, they are added to the current dataset and are available for analyses and
listings. When variables are recorded in different units, e.g., „household members‟ and „education
in single years‟), the Z-score transformation places variables on a common scale for easier visual
comparison. Moreover, “Descriptives” is efficient for large files: with tens of thousands of cases.
To use Descriptives:
1. Click “Analyze” on main menu bar;
2. Click “Descriptive Statistics”; and
3. Click again “Descriptives”, and a new window will appear with complete variable list;
4. Select continuous (interval or ratio scale) or dichotomous (just 0 and 1) variables;
5. Click “Options”, (i) select the preferred statistics from the lists, (ii) define the order of the
variables to be displayed in the output table, and (iii) click “Continue”;
6. Optionally, tick “Save standardized values as variable” to save the Z-score (or standardized
values) of the selected variable(s) in the current dataset; and
7. Click “OK” button to start calculating summary descriptive statistics.
Descriptives computes univariate statistics, such as mean, standard deviation, minimum, and maximum for numeric variables and displayed in a single table for better comparison.
Note: Two scale variables: “HV105 – Age of household members”
and “HV108 – Education in single years”, and one dichotomous
nominal variable: “HV110 – Member still in school” are used in this example.
Step 5
1
3 2
(iii)
6
5
4
(ii)
(i)
7
(HV105) (HV108) (HV110)
In calculating the descriptive statistics (and also in most statistical analyses), it is important to
check and edit the variables under study to contain only valid values in the analysis. For example,
in the variable “HV108 – Education in single years”, code 97 is used for “Inconsistent values”,
code 98 represents “DK or Do not know”, and code 99 is “missing values”. In this case, 97, 98 and
99 are not valid years of study and should not be in the analyses, therefore, put all those codes into
“missing values” to be excluded from computing statistics (see Module B2 to edit missing values).
Two similar “Descriptive Statistics” tables are presented in the above example: (i) constructed with
the default missing values, that is, using the codes 97 and 98 as valid; and (ii) constructed after
setting 97 and 98 as missing. Since number of cases is large, differences in the summary statistics
are minimal. However, if the same calculation is conducted for a subset with limited number of
cases, the differences could be significant. In the above output table, the mean value 0.61 of the
variable "member still in school" can interpret as “61% of 20,540 persons are still in school”.
The following example presents all available statistics (set in the options) in “Descriptives”.
(HV009) (HV026) (HV026)
It should be noted that the descriptive statistics calculated for the variable “HV024 – Division” are
useless in any analysis. “HV024” is just a nominal variable with codes 1 to 6, representing 6
districts of Bangladesh, and their mean value 3.48 cannot point out anything.
One of the significant features of “Descriptives” is its ability to save standardized values (Z-score)
for the selected variables to be used in further analyses. To add Z-scores of a variable into current
data set, just tick the checkbox next to “Save standardized values as variables”. Then, PASW will
add new variables affixing Z in the original variable name as the first letter of the new variable, for
example, the new variable for Z-score of “HV009” is simply “ZHV009”.
Newly created variable
2.3 Explore
Data screening aims to examine the existence of unusual values, extreme values, data gaps, or other
peculiarities. By exploring data, users can determine whether the statistical techniques under
consideration for further analyses are appropriate or not. It may help deciding whether to transform
the data (in case the technique requires a normal distribution) or to use nonparametric tests.
Dependent variables or variables to be explored [List (a) in following chart] can be quantitative
(interval or ratio-level measurements). Factor variables [List (b)], with short string or numeric
values, will break the dependent variables into groups of cases. The factor variables should have a
reasonable number of distinct values, generally, not more than 10 categories. The case label
variable [List (c): allowed only one variable], used to label outliers in boxplots, can be short string,
long string (but use only first 15 bytes), or numeric. To analyze with Explore:
1. Click “Analyze” on main menu bar;
2. Click “Descriptive Statistics”; and
3. Click again “Explore”. A new “Explore” window will appear with complete variable list;
4. Select continuous (interval or ratio scale) variables to produce univariate statistics;
With voluminous outputs produced by “Explore”, just one variable “HV108 – Education in
single years” with simple (mostly default) settings been used in the following example.
Explore produces summary statistics and graphical display, either for all cases or separately for groups of cases. It is particularly useful in data screening, outlier identification, description, assumption checking, and characterizing differences among subpopulations (groups of cases).
Step 6
1
5
2
Step 7
6 7
9
8
Default Settings
Default
Setting
Default
Additional
3
(a)
(b)
(c)
4 (HV108)
Step 5
5. Click “Statistics”, set the preferred statistics from the lists, and click “Continue”;
6. Click “Plots”, set the preferred types of plots from the lists, and click “Continue”;
7. Click “Options”, set how to handle the missing values, and click “Continue”;
8. Select “Display” option (only statistics or plots, or both) on “Explore” window; and
9. Click “OK” button to start “Explore”, and the following outputs will be displayed.
By selecting all statistics and available charts, exploring “HV108 – Education in single year”
factored by “HV024 – Division” produced altogether 33 tables and charts as in the following output
(starting from “Case Processing Summary” to “Spread-versus-Level Plot”):
1
5
10
15
20
25
30
33
2.4 Crosstabs
“Frequencies” and “Explore” are efficient in analyzing univariate statistics, but those procedures
could not provide information on the relationship between categorical variables. For example,
frequencies could provide “number of household heads by education level” and “number of
household heads by sex” or “number of households by economic status (wealth index)”, but cannot
provide “number of female headed households in the poorest category” or even simple question as
“percentage of female headed households”.
In crosstabs, use values of a numeric or short string variable to define categories of each variable.
For example, codes “1 and 2” or “male and female” or “M and F” are valid for the variable “sex”.
Ordinal variables can be either numeric codes that represent categories, for example, numeric codes
“1 to 5” can be used for variable “Wealth Index” representing “1 = poorest, 2 = poorer, 3 = middle,
4 = richer, and 5 = richest” or string values “a to e” as “a = richest, b = richer, c = middle, d =
poorer, and e = poorest”.
In PASW Statistics, the alphabetic order of string values is assumed reflecting the true order of the
categories. Therefore, if a string variable with codes “L, M, H” representing “low, medium and
high” is used, the order of the categories in the output will be “H, L, M” and the results might be
misinterpreted. In general, it is more reliable to use numeric codes and provide appropriate value
labels to represent ordinal data.
Selection of Variables
For cross-tabulation, at least one variable each must be selected to the rows and columns of the
output table. Then, other variables could be put as layers and known as “factor” variables. The
variables used in crosstabs procedure must be categorical ones (measured in nominal or ordinal)
with limited number of value items (generally, less than 10 different values).On the other hand,
discrete scale variables could also be used to get statistics if the range of values are not too large
and suppress the table output. The factor variables must be categorical.
Statistics Option
In Crosstabs, statistics and measures of association are computed for two-way tables only. If a table
is formed in multi-ways as “row, column, and layer (control) variables”, the Crosstabs procedure
forms one panel of associated statistics and measures for each value of the layer (or a combination
of values for two or more control variables). For example, if “sex” is a layer factor for a table of
“educational attainment” against “wealth index”, the results for a two-way table for the males and
for the females are computed separately.
Crosstabs is one of the procedures producing a variety of statistics as:
Chi-square tests of independence/association is generally used for 2 x 2 tables. One can select:
Pearson chi-square, the likelihood-ratio chi-square, Fisher's exact test, and Yates' corrected chi-
square (continuity correction). For tables with any number of rows and columns, select Chi-
square to calculate the Pearson chi-square and the likelihood-ratio chi-square.
Spearman's rank correlation coefficient (rho) is calculated if both rows and columns contain
ordinal variables (numeric data only). When both row and column variables are quantitative,
Pearson‟s correlation coefficient (r), a measure of linear association, is calculated.
For more explanations on statistics please see "PASW Statistics 17 Base User Guide".
Cells Display Option
By default, Crosstabs displays the “count” or the number of cases actually observed in each cell.
Optionally, number of “expected” cases could be selected to display. Similarly, row, column and
total percentages can be displayed in the cells together with the observed number of cases (count).
Crosstabs is useful for investigating the relationship between two or more categorical variables by providing information about the intersection of variables.
To uncover the patterns in data contributing to a Chi-square test, three types of residuals (deviates)
that measure the difference between observed and expected frequencies could be displayed.
Unstandardized: the difference between an observed value and the expected value.
Standardized: the residual divided by an estimate of its standard deviation. Standardized
residuals, also known as Pearson residuals, have a mean of 0 and a standard deviation of 1.
Adjusted standardized: the residual for a cell (observed minus expected value) divided by an
estimate of its standard error.
Non-integer weights Option
Cell counts are normally integer values. But if the dataset is weighted by a variable with fractional
values (e.g. 1.25), cell counts can be fractional values. Then, counts can be truncated or rounded
either before or after calculating the cell counts, or use fractional cell counts for both table display
and statistical calculations.
Using Crosstabs
Follow the steps:
1. Click “Analyze” on main menu bar;
2. Click “Descriptive Statistics”;
3. Click “Crosstabs” and a new “Crosstabs” window will appear with complete variable list;
4. Select categorical variables (or scale variables with limited number of different values) and
send to rows, columns and layers (click “Next” to add another layer). Layer variables can be
organized as: all on the same layer (one set of tables per each layer variable) or on different
layers (just one set of tables with cross-layers cells).
5. Select appropriate statistics to be calculated;
In this example, no statistics is selected although both row and column variables are ordinal
and thus chi-square, correlations, Gamma and Kendall‟s tau are appropriate to calculate.
6. Select the contents of the cells in the cross-tabulation;
7. Set the row order: ascending or descending;
1
2
10
5
3
4(a)
4(b)
4(c)
9 8
6 7
(HV026)
(HV270)
(HV219)
8. Set whether to get the clustered bar charts;
9. Set whether to suppress tables (or display the main crosstab table); and
10. Click “OK” to start constructing tables and charts as selected.
In this example, no optional settings are set and just two tables, (1) Case Processing Summary, and
(2) basic cross-tabulation table with simple counts in cells, are produced. In cross-tabulation the
missing values are handled list-wise (across variables), and thus it is important to observe the
“number of valid cases” in the “case processing summary” statistics.
If different cell display options, such as number of observed and expected counts; row, column and
total percentages, and residuals, are selected in the Step 6, the following crosstab table is created
after using pivoting capabilities offered in PASW statistics and a few minor touches.
Step 5
All settings in Step 5 through Step 9 are
"as it is in the Default”
in this example
Step 6
Step 7
Note: The original output table is huge and difficult to read
since all statistics are placed together.
It is edited: (1) shortened a long value label; (2) hid the
variable label of HV026); and (3) moved “Statistics” to
“LAYER” in the “Pivoting Trays”.
The following tables present percentage distribution of households within “Place of residence” and
within “Wealth index” by “Sex of household head”, which are extracted from the above pivot table.
Newly selected
options
Step 6
click here and select what the cells to display
By selecting both “Display clustered bar charts” and “Suppress tables” options, the following
charts will be produced without producing any output tables:
No output tables except
“Case Processing Summary”
2.5 Ratio Statistics
In Ratio Statistics, outputs can be sorted by values of a grouping variable in ascending or
descending order. Grouping variables must be nominal or ordinal level measurements and it is
better to use numeric codes or short strings. The ratio statistics report can be suppressed in the
output, and the results can be saved to an external file.
It provides statistics on: central tendency (median, mean, weighted mean); confidence intervals
for mean and median; measures of dispersion (AAD – average absolute deviation, COD –
coefficient of dispersion, PRD – price-related differential or index of regressivity, median-centered
coefficient of variation, mean-centered coefficient of variation, standard deviation, range, minimum
and maximum values), and the concentration index (ratio between a user-specified range or
percentage within the median ratio).
Practical Example:
In analyzing household survey data for participation in general education, using total number of
children aged 6-15 (var1) and those who are currently attending primary schools (var2) with sex (or
urban/rural residence or division or etc.) as grouping variable, the age-specific enrolment ratios for
the children aged 6-15 by sex can be calculated. Moreover, variation in the distribution of ratios
between male and female can also be observed.
However, there is no variable which could get “number of children at age x” after summing up
within the grouping variable. Therefore, one variable must be created, say, “pop” with value 1 for
each and every children aged 6-15. Use the “Compute” command as follow:
And, define the variable label (“Population aged 6-15”) and format (Display: 5 and Decimal: 0).
Ratio Statistics provides a comprehensive list of summary statistics for describing the ratio between two scale variables with positive values.
After complete creating new variables, use the Ratio Statistics as following:
1. Click “Analyze” on main menu bar;
2. Click “Descriptive Statistics”; and
3. Click again on “Ratio”. A new “Crosstabs” window will appear with complete variable list;
4. Select two scale variables for “Numerator” and “Denominator”, and a categorical
(nominal or ordinal) variable for “Group” variable;
5. Set whether to sort group variable in ascending or descending order;
6. Set whether to display results or not (just to save in a new file);
7. Set whether to save results to a new data file for further analyses;
8. Click “Statistics” button and select required statistics in “Statistics” window; and
9. Click “OK” button to start constructing statistics as selected.
Computing first time
without IF condition
Computing second time
with IF condition
Warning:
Caution must be taken in using DHS survey data for the “current schooling status” since DHS
asks the question “Whether xx is still in school or not?” to those who have been to school
only, and thus, who have never been to school were omitted or treated as “missing”.
To obtain the correct “current schooling status” of every person, another variable must be
created, say “schooling”, from “HV110 – Member still in school” by setting “schooling = 1”
for the case “HV110=1”, and “schooling = 0” all other cases. Here, the new variable
“schooling” can be created by using “compute” command twice: first, compute “schooling =
0” for all cases, then compute “schooling = 1” for those who are currently attending school,
that is, HV110 = 1. Then, set appropriate properties to the new variable.
The following exhibit shows both the “statistics options” selected and the “results” obtained.
5
1
2
3
4
6 7
8
9
Normally, the group variable is displayed on the rows and statistics on the columns. If several
statistics are chosen, the output table may be difficult to read or print. In such case, double-click the
table to get into Pivot Table editor. Then, apply “Transpose Rows and Columns” under “Pivot”
menu to view the statistics on rows and groups on columns to become the table well accessible.
3. TIPS AND EXERCISES
3.1 Tips: Do and Don’t
i) Do… first, use the “codebook” procedure to be acquaintance with the household
survey dataset if complete documentation is unavailable;
Don‟t… waste time by searching/ requesting actual coding scheme or by running
frequency tables for all variables.
ii) Do… study the survey questionnaire and “codebook” to select the variables of
interest, and make new datasets or variable sets for further analyses;
Don‟t… try selecting variables on a “trial and error” basis without studying proper
survey documentation or codebook in analyzing a newly available dataset.
iii) Do… make acquaintance with OLAP Cubes procedure; run several frequency
and crosstab tables and practice using the OLAP Cubes;
Don‟t… display several variables in multiple layers in a table since it is difficult
to get the essence of the statistics displayed, and unusable or easily
misinterpret.
iv) Do… expert the data preparation and management techniques such as computing
new variables, selecting cases, creating new variable sets, data validation,
and etc.;
Don‟t… waste time to edit/correct secondary household survey dataset (obtained
from other sources: departments, agencies, organizations, ….
v) Do… start analysis by running “frequencies” to every variable except for the
continuous (scale) variables with several different items. For the
continuous (scale) variables use “Descriptive” procedure to explore their
basic structure;
Don‟t… go into in-depth analyses or calculation of ratio statistics before well
understanding the variables.
vi) Do… crosstab between variables with intrinsic linkages and export the outputs to
a spreadsheet software for better presentation, and create and present
graphs and charts as appropriate in PASW or Excel;
Don‟t… create oversized crosstab tables with multiple layers (use “pivot” technique
to simplify the crosstab tables).
vii) Do… run the crosstab tables (or frequency tables) to get baseline data correctly
and make further calculations and analyses in spreadsheet software;
Don‟t… try to run (and use the outputs) of “ratio statistics” procedure if you are not
sure that the process is perfectly correct.
3.2 Self-evaluation
Do you know when to use codebook procedure in PASW statistics? Very well / Somewhat well / Not so much / Almost None
Are you confident that you can run the following procedures in an active dataset? o Codebook:
Confident / Somewhat confident / Not so much / Not at all o OLAP Cubes:
Confident / Somewhat confident / Not so much / Not at all o Frequencies:
Confident / Somewhat confident / Not so much / Not at all o Crosstabs:
Confident / Somewhat confident / Not so much / Not at all o Ratio Statistics:
Confident / Somewhat confident / Not so much / Not at all
Do you think you can demonstrate to your colleague on how to run: o Simple frequency tables:
Definitely / Could be / Not so sure / Not at all o Frequency tables with appropriate charts:
Definitely / Could be / Not so sure / Not at all o Simple crosstab tables:
Definitely / Could be / Not so sure / Not at all o Crosstab tables with layers:
Definitely / Could be / Not so sure / Not at all o Simple OLAP Cubes:
Definitely / Could be / Not so sure / Not at all o Pivoting crosstab tables:
Definitely / Could be / Not so sure / Not at all 3.3 Hands-on Exercises
1) Import the attached “data1(tab).dat” and define all variables appropriately, and run the
codebook procedure to check whether you have defined the dataset effectively.
2) From the dataset obtained from Exercise 1 above, recode all string variables, and run
the codebook procedure to check whether you have recoded and defined the dataset
effectively.
3) Begin data analysis with selected procedures of your choice to get education
indicators which are useful for EFA monitoring.
4) Get a recent household survey dataset from your country, then note down the step-by-
step procedure on how to make use of it in education planning, especially for EFA
monitoring.
5) Follow the steps defined in the previous question and get the “data, information and
indicators” which you have defined.
4. ANNEX: WEB LINKS FOR FURTHER STUDY ON SPSS/PASW STATISTICS
1. Central Michigan University. SPSS (PASW) On-Line Training Workshop
(See http://calcnet.mth.cmich.edu/org/spss/index.htm )
2. College of Humanities and Social Sciences. Topics in Multivariate Analysis.
(See http://faculty.chass.ncsu.edu/garson/PA765/index.htm)
3. Creative Research Systems: Survey Research Aids
(See http://www.surveysystem.com/resource.htm )
4. East Carolina University. PASW/SPSS Lessons: Univariate Analysis.
(See http://core.ecu.edu/psyc/wuenschk/SPSS/spss-lessons.htm )
5. Newcastle University. Statistics Support.
(See http://www.ncl.ac.uk/iss/statistics/docs/ )
6. Research Method Knowledge Base.
(See http://www.socialresearchmethods.net/kb/index.php )
7. SPSS Web-Based Training.
(See http://www.spss.com/training/wbt/ )
8. Statistical Exercised Using PASW Statistics.
(See http://www.brad.ac.uk/lss/documentation/pasw-statistics-v17-exercise/statistical-
exercises-using-PASW%20Statistics-v17.pdf )
9. UCLS Academic Technology Services. Resources to help you learn and use SPSS.
(See http://www.ats.ucla.edu/stat/spss/default.htm )
10. University of Toronto. SPSS Tutorial.
(See http://www.psych.utoronto.ca/courses/c1/spss/toc.htm )
11. Visual Statistics Studio.
(See http://www.visualstatistics.net/ )
Module B5:
Using Microsoft Excel to Elaborate PASW Outputs for Better Presentation
Contents:
1. MS Excel 2007: Basics 1.1 Result-Oriented User Interface 1.2 New File Formats in Microsoft Office Excel 2007 1.3 Data Handling Capacity of Microsoft Office Excel 2007 1.4 Selected Statistical Functions in Microsoft Office Excel
2. Further Analyses and Presenting Outputs in MS Excel 2.1 Importing PASW Databases into Microsoft Office Excel 2.2 Creating Frequency and Crosstab Tables 2.3 PivotTables (OLAP Cubes) 2.4 Drawing Pivot Charts 2.5 Elaborating PASW Outputs for Better Presentation
3. Tips and Exercises 3.1 Tips: Do and Don’t 3.2 Self-evaluation 3.3 Hands-on Exercises
Purpose and leaning outcomes:
To know how to import PASW outputs into MS Excel 2007
To introduce data handling and data analysis using MS Excel 2007
To explore some advanced features of data presentation in MS Excel 2007
1. MICROSOFT OFFICE EXCEL 2007: BASICS
1.1 Result-Oriented User Interface
Layout of the main menu and the contents of the first menu tab “Home” are as follow:
Many dialog boxes are replaced with drop-down galleries that display the available options, and
descriptive tooltips or sample previews are provided to help choosing the right option. For example,
when clicking on “Paste”, it will display a drop-down galleries with active options depending on
which items are available in the clipboard as:
(1) No items in clipboard (2) After copying an Excel range
(3) After copying a picture / image (4) After copying text from MS Word
Nowadays, Microsoft Excel is the most widely used spreadsheet software all over the world. The new results-oriented user interface intended to make easy to work in Excel 2007. Commands and features are organized on task-oriented tabs that contain logical groups of commands and features. Since its user interface is totally changed, even the regular users require familiarizing with its new features and looks.
The Office clipboard can store up to 24 items. If the mouse is on the “ ”icon located at the bottom
right corner of “Paste” menu, “instant help” on “Clipboard” will be displayed and if the mouse if on
the “Paste”, the tool-tip will be displayed as followings:
And, if click the Clipboard area located at the bottom of the “Paste” menu, a clipboard pane with all
available items kept in the clipboard will be displayed.
Moreover, online help for the clipboard is available
For every activity being performed in the new user interface – whether it's formatting or analyzing
data – Excel presents the tools, tips and help that are most useful to successfully complete that task.
As such, the user interface of Office Excel 2007 is helping to obtain the desired results efficiently.
Clipboard is empty Sample of items
copied from Word
Sample of items
copied from Excel
Thumbnail of the
picture/image copied
Number of items kept
in the clipboard
1.2 New File Formats in Microsoft Office Excel 2007
The previous versions of Excel files (from Excel 2.1 to Excel 2003) use “.xls” for Excel (data) files,
“.xla” for add-ins, and “.xlt” for templates. Excel files with extension “.xls” could hold data sheets,
chart sheets and micro sheets. In Excel 2003, “.xml” is used for XML-based spreadsheet or data
files (XML = Extensible Markup Language). In Office Excel 2007, the following formats and file
extensions are used to distinguish different file types and for better securities:
Excel Workbook .xlsx The default Office Excel 2007 XML-based file format. It cannot store
VBA macro code or Microsoft Office Excel 4.0 macro sheets (.xlm).
Excel Workbook (code) .xlsm The Office Excel 2007 XML-based and macro-enabled file format.
It stores VBA macro code or Excel 4.0 macro sheets (.xlm).
Excel Binary Workbook .xlsb The Office Excel 2007 Binary file format (BIFF12).
Template .xltx The default Office Excel 2007 file format for an Excel template.
It cannot store VBA macro code or Excel 4.0 macro sheets (.xlm).
Template (code) .xltxm The Office Excel 2007 macro-enabled file format for an Excel template.
It stores VBA macro code or Excel 4.0 macro sheets (.xlm).
Excel Add-In .xlam The Office Excel 2007 XML-based and macro-enabled Add-In, a
supplemental program that is designed to run additional code.
It supports the use of VBA projects and Excel 4.0 macro sheets (.xlm).
Moreover, the following file types (or filename extensions) of previous versions of Excel are still
valid Excel files in Office Excel 2007 and can open or save without transforming into 2007 format:
Excel 97-2003 Workbook .xls The Excel 97 - Excel 2003 Binary file format (BIFF8).
Excel 97-2003 Template .xlt The Excel 97 - Excel 2003 Binary file format (BIFF8) for an Excel tem
plate.
Excel 5.0/95 Workbook .xls The Excel 5.0/95 Binary file format (BIFF5).
XML Spreadsheet 2003 .xml XML Spreadsheet 2003 file format (XMLSS).
XML Data .xml XML Data format.
It should be noted that all Excel files created in any version can be opened and saved them back in
the original file type, however, the Office Excel 2007 files cannot be opened in earlier versions of
MS Excel unless the optional Office updates for file format transformation is installed.
1.3 Data Handling Capacity of Microsoft Office Excel 2007
Enabling to explore massive amounts of data in worksheets, Office Excel 2007 supports 1,048,576
rows by 16,384 columns per worksheet (or 234
, i.e., 17 billion cells). This is the size that every
household survey datasets cannot surpass: allowed one million cases across sixteen thousand
variables. Therefore, any household survey dataset can be exported to Excel, and further analyses
can be conducted in Excel 2007 which is much more familiar with most education planners and
administrators.
As seen in the above exhibit, Office Excel 2007 Worksheet is “1 K” (1024) times larger than Excel
2003 worksheet. Although Excel 2007 files can be opened in Excel 2003, the contents of Excel
2007 worksheets which are located outside the Excel 2003 boundaries (65,536 rows x 256
columns) cannot be retrieved into Excel 2003.
Other improvements in Office Excel 2007 compared to Excel 2003 include the followings:
(a) 4 thousand types of formatting allowed in Excel 2003 to unlimited number in the same
workbook in Excel 2007;
(b) the number of cell references per cell is increased from 8 thousand to limited by available
memory;
(c) memory management has been increased from 1 GB to 2 GB;
(d) supports up to 16 million colors; and
(e) supports dual-processors and multithreaded chipsets.
With such improvements, general performance of Excel has moved forward. Moreover, when using
computers with advanced chipsets, calculations are much faster in large, formula-intensive
worksheets.
256 columns
16,384 columns
1.4 Selected Statistical Functions in Microsoft Office Excel 2007
There are altogether 346 built-in functions under 12 different categories in Excel 2007. Summary of
Excel functions under different categories in descending order of number of functions in category is
as following:
Sr. Category Number Per cent
1 Statistical functions 82 23.7%
2 Math and trigonometry functions 60 17.3%
3 Financial functions 53 15.3% 4 Engineering functions 39 11.3%
5 Text functions 27 7.8%
6 Date and time functions 20 5.8% 7 Lookup and reference functions 18 5.2%
8 Information functions 17 4.9%
9 Database functions 12 3.5%
10 Cube functions 7 2.0% 11 Logical functions 6 1.7%
12 Add-in and Automation functions 5 1.4%
Total 346 100.0%
It is difficult to say which Excel functions are required and which are not in analyzing household
survey data since it is more concerned with the experience of the user and types of output needed to
generate. The followings are the functions, directly concerned with analyzing a database or refining
the PASW Statistics output tables.
DAVERAGE Returns the average of selected database entries
DCOUNT Counts the cells that contain numbers in a database
DCOUNTA Counts nonblank cells in a database
DGET Extracts from a database a single record that matches the specified criteria
DMAX Returns the maximum value from selected database entries
DMIN Returns the minimum value from selected database entries
DSTDEV Estimates the standard deviation based on a sample of selected database entries
DSUM Adds the numbers in the field column of records in the database that match the criteria
DVAR Estimates variance based on a sample from selected database entries
AND Returns TRUE if all of its arguments are TRUE
FALSE Returns the logical value FALSE
IF Specifies a logical test to perform
NOT Reverses the logic of its argument
OR Returns TRUE if any argument is TRUE
TRUE Returns the logical value TRUE
ROUND Rounds a number to a specified number of digits
ROUNDDOWN Rounds a number down, toward zero
ROUNDUP Rounds a number up, away from zero
SQRT Returns a positive square root
SUBTOTAL Returns a subtotal in a list or database
SUM Adds its arguments
SUMIF Adds the cells specified by a given criteria
SUMIFS Adds the cells in a range that meet multiple criteria
SUMPRODUCT Returns the sum of the products of corresponding array components
AVERAGE Returns the average of its arguments
AVERAGEA Returns the average of its arguments, including numbers, text, and logical values
AVERAGEIF Returns the average (arithmetic mean) of all the cells in a range that meet a given criteria
AVERAGEIFS Returns the average (arithmetic mean) of all cells that meet multiple criteria.
COUNT Counts how many numbers are in the list of arguments
COUNTA Counts how many values are in the list of arguments
COUNTBLANK Counts the number of blank cells within a range
COUNTIF Counts the number of nonblank cells within a range that meet the given criteria
FREQUENCY Returns a frequency distribution as a vertical array
GEOMEAN Returns the geometric mean
GROWTH Returns values along an exponential trend
HARMEAN Returns the harmonic mean
MAX Returns the maximum value in a list of arguments
MAXA Returns the maximum value in a list of arguments: numbers, text, and logical values
MEDIAN Returns the median of the given numbers
MIN Returns the minimum value in a list of arguments
MINA Returns the smallest value in a list of arguments: numbers, text, and logical values
MODE Returns the most common value in a data set
PERCENTILE Returns the k-th percentile of values in a range
QUARTILE Returns the quartile of a data set
RANK Returns the rank of a number in a list of numbers
STDEV Estimates standard deviation based on a sample
STDEVA Estimates standard deviation based on a sample, including numbers, text, and logical values
TREND Returns values along a linear trend
TRIMMEAN Returns the mean of the interior of a dataset
The detailed descriptions of these functions and examples can be seen in online help of Microsoft
Excel 2007, and thus, will not be elaborated in this module.
2. FURTHER ANALYSES AND PRESENTING OUTPUTS IN MS EXCEL
2.1 Importing PASW Database into Microsoft Office Excel
To read PASW Statistics (*.sav) data files directly in applications that support Open Database
Connectivity (ODBC) or Java Database Connectivity (JDBC), the PASW Statistics data file driver
is required. PASW Statistics itself supports ODBC in the Database Wizard, providing the ability to
leverage the Structured Query Language (SQL) when reading SAV data files in PASW Statistics.
The PASW Statistics data file driver is packed together with other drives which may be required in
accessing different types of databases in a “Data Access Pack (DAP)” which can be downloaded
from the PASW Statistics Website. A version of DAP for Windows, “DAPWin32_5.3_SP2.exe”
(file size: 36,624 KB) is provided in the training CD.
After installing DAP, there will be “SPSS Inc OEM Connect and ConnectXE for ODBC 5.3”
program group in the “Start Menu Programs”. Click “ODBC Administrator”, and follow the steps
to get access to PASW Statistics data files (*.sav) from the applications with ODBC capabilities:
1. Click “File DSN” tab; and
2. Click “Add” button to add a new data source.
With extended data handling capacities, it is possible to analyze any dataset from household surveys for assisting EFA Monitoring with Microsoft Excel. However, it is much easier to use other popular data analysis software such as PASW Statistics, then export the outputs to Excel, and elaborate and present with MS Excel.
1
2
The “Create New Data Source” user dialogue box will appear. There, all available drivers in
the computer will be listed, and
3. Select “SPSS Inc. 32-Bit Data Driver (*.sav)”; and
4. Click “Next”.
There, it will request a new Database System Name (DSN), and
5. Type-in an appropriate DSN name (“SPSS-Training” in this example); and
6. Click “Next”.
7. “Create New Data Source” dialogue will provide the summary information on the
current setting. If it is correct, click “Finish” to complete creation of a „file DSN‟.
At this point the program will request to identify the location and fill in correct folder name
with complete “path” of the PASW Statistics data files.
8. In this example type-in: “c:\....\My Documents\SPSS Training\Sample” where all
sample datasets are stored, and Click “OK”;
9. Click “OK” again to complete and exit from “ODBC Data Source Administrator”.
5
6
4
3
7
Step 8
9
After creation of the new ODBC data source, the newly defined file DSN name, “SPSS-Training”,
will be listed in the Windows applications with ODBC capabilities. Any PASW data files (*.sav)
located in the specified folder can be accessed from other applications, and can retrieve full dataset
through “existing ODBC connections” or partially through “Microsoft Query”.
When clicking “Existing Connections” under “Data” menu in Office Excel 2007, “SPSS-Training”
will be displayed as one of the existing external data sources for Excel (see “A” in the following
exhibit). By selecting this connection, one can retrieve any dataset (whole dataset) from the list.
Similarly, when clicking “From Other Sources” and selecting “From Microsoft Query”, one can see
the “SPSS-Training” as a data source (see “B”), and by following the Wizard, users can retrieve
part of a dataset: only cases which satisfied set conditions and only the selected variables.
In short, follow the steps below to import a complete PASW Statistics dataset into Excel 2007:
1. Click “Data” tab;
2. Again, click “Existing Connections” button to get the “Existing Connections”
dialog box;
3. Select “SPSS-Training” form the list of available connections; and
4. Click “Open” and a complete list of PASW Statistics datasets in the specified folder
(set while creating the file DSN “SPSS-Training”) will be displayed as “Tables”.
5. Select the dataset (by clicking on the name) and click “OK” button;
6. In the import data window, select where to place the imported data, in the “Existing
worksheet” (active worksheet) or in a “New worksheet”. If the “Existing
worksheet” is selected, one can define the place to import data (default is $A$1).
7. Click “OK” to start importing process, which will take a few minutes.
A B
1
2 2
3
4
1
3 4
3(a)
At the end of the importing process, the PASW dataset will be placed on the specified Excel
worksheet with the name like “Table_SPSS_Training” and treated as an Excel “Database Table”.
In this example, the file is saved with the name “Excel2.xlsx”. When opening the Excel file with
imported database, Office Excel 2007 will issue a “Security Warning” with the message “Data
connections have been disabled” together with an “Option” tab. If the imported data requires
updating from the source PASW dataset, or requires importing another dataset, the user must
enable the data connection. Otherwise, the user can choose to disable the data connection.
5
5(a)
7
6
Warning:
Importing data into Excel (as well as into other databases) cannot retrieve metadata (labels,
missing values, etc.), but only data values. Therefore, user must have the codebook of the
dataset (and the survey questionnaire) before doing any analysis. As usual, after successfully
importing PASW datasets, first, the Excel file with imported databases must be saved with an
appropriate name.
In the Excel worksheet, the variable names are placed on the first row with enabling “Autofilter” to
all variables. The “Autofilter” feature can assist in checking the invalid entries and selecting cases
which fulfil the specified rules. If the “Autofilter” is not required, it can be turned off by clicking
on the filter tab, , and click it again to turn on Autofilter.
Example:
To select the cases for the children of aged 6-year, one can click the down arrow sign next to
“HV105” and clear the tick next to select all (to unselect all), tick the box next to 6 and click
“OK”. In the following exhibit, it could be seen in the “status bar” (located at the bottom left
corner) that there are altogether 53,413 records (or cases) in the database, and only 1,302 records
with aged 6 children are found and selected.
If another variable sex (HV104) is filtered to show only “1 (Male)” again, the following output will
be obtained with only 656 records (of aged 6 boys).
Selected value
Non-selected values
Even entire worksheet is selected and copied, and then pasted on a new sheet while filtering, only
the filtered records (or unhidden rows) will be pasted in new worksheet. Then, unwanted variables
can be selected and deleted column by column to clean up the Excel database. The final result is
totally the same as imported through “Microsoft Queries”, which is more complicated for those
who are not acquaintance with manipulating databases (see the steps in the following exhibits).
Select the dataset and send entire set, or variable by variable to the right pane
Setting Condition 1 to import only the cases
of children aged 6
Setting Condition 2 to import only the cases
of “boys”
The database can be sorted while importing with the
selected variables
“Filter” signs!
Non-selected value
Selected value
Setting Option set the location of the
imported database; and the query can be saved
for future use!
The Output (Result) There are 656 cases (+1 row for variable names) in the imported database for the
“aged 6 boys”
2.2 Creating Frequency and Crosstab Tables
The Excel function “FREQUENCY” is useful to create entire frequency table from a range of cells
or from a variable in a database table. On the other hand, “COUNTIFS” can be used to get the
appropriate value for a cell of a frequency table or crosstab table.
Using FREQUENCY Function
“FREQUENCY” is a worksheet function under “Statistical functions” category. It counts how often
values occur within a range of values, and then returns a vertical array of numbers. For example,
use FREQUENCY to count the number of males and females among the household members.
Because FREQUENCY returns an array, it must be entered as an array formula.
The followings are the steps required in construction of a table presenting the sex distribution of
household members, both in absolute number and percentage distribution using FREQUENCY
function. The variable to be used is “HV104” with the codes “1=Male” and “2=Female” in the
imported database “SPSS_Training”.
1. Prepare the table structure, formulas and “bin” array as in the following exhibit;
2. Select cell “B3” and type in “=FREQUENCY(SPSS_Training[HV104],$G$3:$G$4)”;
3. Select the range “B3:B4”;
4. Press “F2” to get into formula editing mode, and press “<Ctrl><Shift>ENTER” to reenter
formula as an array formula; and
5. Set the display formats of the number cells and the table, as necessary.
Using COUNTIF or COUNTIFS Function
A frequency table can also be constructed by using COUNT functions. The above frequency table
can be constructed using:
1. Prepare the table structure, formulas and “codes” as in the previous example;
2. Select cell “B3”, and type in “=COUNTIF(SPSS_Training[HV104],G3)”;
3. Copy “B3” and paste at “B4”; and
4. Ally the display formats of the number cells and the table, as necessary.
Note: In the formula, “=COUNTIFS(SPSS_Training[HV104],G3)” can also be used in this example.
COUNTIF allows only one condition while COUNTIFS can be used with multiple conditions.
Using COUNTIFS Function to construct a crosstab table
Although the “FREQUENCY” function cannot use to construct a crosstab table, the “COUNTIFS”
function can be used to get the number value of each and every cell of the table. The following
example elaborates how to construct a complicated crosstab table of educational attainment
(HV109) by sex of household members (HV104) for the population aged 15-24 (Age: HV105):
1. Prepare the table structure, formulas and “codes” for both variables;
2. Select cell “B5”, and type in: =COUNTIFS(SPSS_Training[HV109],$I5,SPSS_Training[HV104],B$14,SPSS_Training[HV105],">14") -
COUNTIFS(SPSS_Training[HV109],$I5,SPSS_Training[HV104],B$14,SPSS_Training[HV105],">24");
Here, the first COUNTIFS counts the population “aged 14 and above” by specific
education level by specific sex, and the second COUNTIFS counts the population
“aged 24 and above” with the same characteristics. Therefore, the difference
represents for the population “aged 15-24”.
3. Copy “B5” and paste to the range “B4:C11”; and
4. Complete the formulas, ally the display formats and etc., as necessary to obtain the
following output table.
As described above, frequency and crosstab tables can be constructed in Microsoft Office Excel.
However, construction of such tables are much more complicated if the sampling procedure
requires “weighting”. In this case, construct the tables with “weight on” in PASW Statistics and
export the outputs to Microsoft Office Excel for further elaboration and presentation.
2.3 PivotTables (OLAP Cubes)
Unweighted frequency and crosstab tables with multi-layers, which are useful in analyzing
household survey data, can be constructed in Microsoft Office Excel with PivotTable technique. A
PivotTable is an interactive way to quickly summarize large amount of data, to conduct in-depth
analysis and to answer unanticipated questions about the data. It is especially designed for:
Querying large amounts of data in many user-friendly ways;
Subtotaling and aggregating numeric data; summarizing by categories and subcategories,
and creating custom calculations and formulas;
Expanding and collapsing levels of data to focus the results, and drilling down to details
from the summary data for areas of interest;
Moving rows to column or columns to rows to see different summaries of the source data;
Filtering, sorting, grouping, and conditionally formatting the most useful and interesting
subset of data to enable focus on the required information; and
Presenting concise, attractive, and annotated online or printed reports.
In a PivotTable, each column in the source data (or database) becomes a PivotTable field (a „field‟
in Excel is a „variable‟ in PASW Statistics) that summarizes multiple rows of information. A value
field provides the values to be summarized. By default, data (of the variables) in the “Values” area
summarize the underlying source data in the PivotTable using: the SUM function for the numeric
variables, and the COUNT function for the text (string) variables.
To create a PivotTable, first, define its source data, specify a location in the workbook or the
database table, and lay out the fields as following:
1. Select the sheet with imported database and click “Insert” tab in the main menu;
2. Click “PivotTable” button to get the “Create PivotTable” dialog box;
3. Since the active worksheet contains the imported “SPSS_Training” database table, it
will appear automatically in the “Table/Range” selection box. However, users can
change the data source to another table or to a specific range (e.g., A1:C2000);
4. Select where to place the PivotTable: “New Worksheet” or “Existing Worksheet”,
and if “Existing Worksheet” is selected, user should provide the first cell address;
In this example, just leave it as default “New Worksheet”; and
5. Click “OK” to create a new worksheet with “PivotTable creation tools”.
4
1 2
3
5
Then, following new worksheet equipped with tools to assist creating a PivotTable will be created:
And, the following tools are available for creation, elaboration and editing the PivotTable.
6. From “PivotTable Field List” select variables (or fields) and drag and drop to:
(a) Values the variables to make actual summarization (count or sum or etc.)
(b) Row Labels the variables to be displayed on the rows (can be nested)
(c) Column Labels the variables to be displayed on the columns (can be nested)
(d) Report Filter the variables to be used for filtering/subsetting the database;
As soon as a variable is dragged and dropped into a box, the opening sign of PivotTable on the
worksheet will be replaced with an actual PivotTable with default settings.
Construction of a “PivotTable” will be demonstrated by creating a crosstab table of “Educational
attainment by Sex for Population Aged 15-24”.
To do this, first, define which variables (or fields) were to put into which box: “value, row, column
or filter”, to get the required table. In this example, educational attainment (HV109) is the key
variable to be explored and also to display the education levels in the rows.
Step 6
Newly
added sheet
Place mark for
PivotTable 1
Therefore, drag HV109 from the PivotTable filed list and drop it into both:
(a) value (to count how many persons in each category), and
(b) row (to display education levels in rows).
And, the following PivotTable showing the “frequency of HV109” will be created:
The items displayed in rows can also be selected. For example, there are eight items: 0, 1, 2, 3, 4, 5,
8, and (blank), are displayed in cells A4 through A11. Since the code “8” represents “unknown”
and “(blank)” is simply “missing value”, these two items shall not be displayed in the frequency
table, or at least the item “(blank)”. To do this, just click on the dropdown next to “RowLabels” and
uncheck the box next to “(blank)” and click “OK”. However, this refinement will conduct only
when finalizing the PivotTable in this example.
The next step is to place “Sex (HV104)” into column box to obtain the following crosstab table:
Here, “value labels” can be directly typed into a PivotTable, and the new labels will replace the
defaults. For example, the column labels “1” can be replaced with “Male” and “2” with “Female”.
These fine-tunings will be carried out only when finalizing the PivotTable.
The current PivotTable represents entire household population irrespective of age, but the
requirement is just for the “population aged 15-24”. To fulfill this requirement, the cases must be
filtered by “age”. Therefore, send the variable “age (HV105)” to the “filter” box. It should be noted
that, although the filtering variable is set, the table will be unchanged since no filtering is in place.
Therefore, click on the “dropdown” icon next to “(All)” in cell B2, then, tick “Select Multiple
Items” checkbox, and leave the ticks only for the ages between 15 and 24 inclusively.
Above exhibit presents the PivotTable after tuning up captions (value labels) and column width.
PivotTables can be copied the whole or any part of it to be use for other purposes.
PivotTable is more useful if multiple tables with the same structure are required for different groups
(e.g. for different ages), or presenting the same table with selected rows and/or columns only. For
example, the same table for adults (aged 15+) can be created by clicking dropdown icon next to
“(Multiple Items)” in Cell B2, first, tick “(All)”, and clear off ticks next to “0”, “1”, “2”, …, “14”
(see A). Similarly, to create a table for all adults but with “up to complete primary” education only,
click the dropdown icon next to “Row Labels” and select only the first three categories (see B).
As seen in these examples, PivotTable method is user-friendly, powerful and efficient in analyzing
household survey data, especially for the surveys applying “self-weighting” sampling designs.
Place to employ changes
A
B
2.4 Drawing Pivot Charts
PivotChart provides a graphical representation of the data in a PivotTable. The layout and data that
are displayed in a PivotChart can be changed just as in a PivotTable. A PivotChart always has an
associated PivotTable that uses a corresponding layout. Both of them have fields that correspond to
each other, that is, when changing the position of a field in the PivotTable, the corresponding field
in the other report also moves.
In addition to the series, categories, data markers, and axes of standard charts, PivotChart reports
have some specialized elements that correspond to the PivotTable as following:
Filter field: A field to filter data by specific items. In the example, the “age” field
displays data for both sexes. To display data for a single age or selected ages, click the
drop-down arrow next to (All) and then select a number or some numbers.
Values field: A field from the underlying source data that provides values to compare or
measure. Depending on the source data of the report, the summary function can be
changed to Average, Count, Product, or another calculation.
Series field: A field that is assigned to a series orientation in a PivotChart. The items in
the field provide the individual data series. In a chart, series are represented in the legend.
Item: Items represent the unique entries in a column or row field, and appear in the drop-
down lists for report filter fields, category fields, and series fields. Items in a category
field appear as the labels on the category axis of the chart. Items in a series field are listed
in the legend and provide the names of the individual data series.
Category field: A field from the source data assigned to a category orientation in a
PivotChart report. It provides the individual categories for which data points are charted.
In a chart, categories usually appear on the x-axis, or horizontal axis, of the chart.
Customizing the chart: The chart type and other options (such as, the titles, the legend
placement, the data labels, the chart location, and so on) can be changed.
A PivotChart can be created automatically when creating a PivotTable or from an existing
PivotTable. To create a PivotChart from an existing PivotTable, follow the steps:
1. Select any place (cell) on the existing PivotTable, two new menu items “Options”
and “Design” will be added (under “PivotTable Tools” group) in the main menu;
2. Click “PivotChart” under “Options” tab to get the “Insert Chart” dialog box;
2
1
3. Choose “Chart Type” from the “Insert Chart” dialog box; and
4. Click “OK” to create a basic PivotChart together with a “PivotChart Filter Pane”.
PivotChart created automatically is a “draft”. Particularly, there is no chart title. Therefore, it must
be edited using the following tools, which are available when clicking on an active PivotChart:
4
3
For example, to add “Education Level by Sex (Aged 15+)” as the chart title above the drawing
(chart), click on the “Chart Title” under the “Layout” command, and select the third option, „Above
Chart‟. After making few other make-ups such as moving the legend, changing the chart design,
putting in border lines for the plot area, etc., the following PivotChart is successfully created and
ready to use.
Another useful adjustment in both PivotTable and PivotChart is to display the “values” not in the
absolute numbers, but in percentages. To review the percentage distribution of education level by
sex for adults:
1. Click on the “dropdown” of the variable in the “value” area;
2. Select “Value Field Settings…”;
3. Select “Show values as” tab in the “Value Field Settings” dialog box;
4. Select “% of column” in “Show values as” dropdown list; and
5. Click “OK” .
1 2
3
4
5
Then, the following table and chart will be obtained after adjusting display formats, especially
number of decimal places in percentages.
2.5 Elaborating PASW Outputs for Better Presentation
Although PivotTable and PivotChart are user-friendly and efficient way to present household
survey data in both tabular and graphical presentations, the PASW Statistics provides broader
methods and options, and capable of using “weights” for complex sampling techniques. On the
other hand, Excel is more familiar with users, and easier to make further analyses through output
tables from different analyses. Therefore, the best blend is to analyse the dataset in PASW Statistics
and to finalize the outputs in Office Excel.
In this section, use of “weights” in the calculation of school-age population and number of children
currently attending school in PASW Statistics, and calculation and presentation of age-specific
enrolment rates in Microsoft Office Excel 2007, are going to demonstrate step-by-step.
Basic of Weighting
In a town with 2 Wards, there are 100 children aged 6-10 in Ward-1 and 50 in Ward-2. Of those
children, a survey on “schooling status” was conducted by selecting 25 children from Ward-1 and
20 children from Ward-2. It was found out that 5 children (out of 25) from Ward-1 and 6 children
(out of 20) from Ward-2 were not currently in school. Therefore, percentage of out-of-school
children (say, POS) can be estimated as:
POS (Ward-1) = 5 / 25 x 100 = 20.0%
POS (Ward-2) = 6 / 20 x 100 = 30.0%, and
Percentage of out-of-school children in the town can be estimated as:
POS (Ward 1+2) = 11 / 45 x 100 = 24.4%. …………. (1)
POS (Ward 1+2) = (20.0%+30.0%) / 2 = 25.0%. …………. (2)
Although the percentages of out-of-school children by Ward can represent respective Ward, above
percentages calculated for the entire town do not represent correctly. The main reason is the sample
sizes are not “self-weighting” or unbalanced between two Wards: the sampling fraction for Ward-1
is 25 / 100 or 25.0% while that for Ward-2 is 20 / 50 or 40%. In other ward, a child in the sample
from Ward-1 represents 4 children while a sample child from Ward-2 represents just 2.5 children.
To have a correct estimate for the town, it should be calculated as following:
Since the POS (Ward-1) is 20.0%, it is expected to have 20 out-of-school children (20.0% x 100)
in Ward-1 and it is expected to have another 15 children (30.0% x 50) in Ward-2.
Therefore, there could be 35 out-of-school children out of 150 children aged 6-10, and the POS
for the Town is (35 / 150 x 100 = 23.3%).
On the other hand, the appropriate number of out-of-school children in Ward-1 and Ward-2 can be
estimated as 5 x 4.0 = 20 (since one in the sample represents 4 children in Ward-1) and 6 x 2.5 = 15.
These numbers 4.0 and 2.5 are known as “sample weight”, and normally provided in the datasets.
In PASW Statistics, it is easy to apply weights if it is provided in the dataset:
1. Click “Data” on the main menu;
2. Click “Weight Cases…” and “Weight Cases” dialog box will be appeared;
3. In “Weight Cases” dialog box, set “Weight cases by”;
4. Select the variable representing the “weight” (it is HV005 – Sample weight in the
DHS dataset); and
5. Click “OK” to complete weighting process.
When no longer weighting is necessary, select “Do not weight cases” in above Step 3 and click
OK” to step weighting.
The following tables represent population aged 6-10 by sex with and without weighting.
The differences due to weighting can be observed in the percentage distribution of population by
age and sex. Similarly, the following tables present weighted and unweighted number of children
currently attending school (HV110 – Member still in school) by age and sex.
1
2
3
4
5
From these two sets of tables, one can calculate proportion of children currently attending school
by age and sex or percentage of out-of-school children by age and sex, in Excel.
Since, it is easier to export all outputs from PASW Statistics Viewer, first clear unnecessary
outputs, such as logs, notes, and case processing summary; then, export to Excel:
1. Click “File” on the main menu;
2. Click “Export…” and “Export Output” dialog box will be appeared;
3. In “Export Output” dialog box, set:
a) “All” in “Objects to Export”;
b) “Excel (*.xls)” in “Document Type”;
c) Provide “File Name” with folder path; and
d) Click “OK” to begin exporting outputs to Excel.
1
2
Step 3
(a)
(b)
(c)
(d)
At the end of this process, an Excel file, “POS.xls” will be placed in the specified folder with four
cross-tabulation tables exported from PASW. The files can be seen in the following exhibit.
Adding three more columns in the last two tables, making simple calculation of dividing children in
school by total number of children of respective age and sex, the required percentage of children in
school can be obtained easily. That is not such simple in PASW. Percentages of children in school
by age and sex are presented in the following tables with rephrasing of titles and captions.
To demonstrate the visualization of data through charts, the percentage of children in school by age
(or Age-Specific Enrolment Rate) will be presented in “3-D Clustered Column” and “Line” charts
which are appropriate with the data. To create a 3-D Clustered Column chart, follow the steps:
1. On the table, (a) select Cell „A36‟, unmerge and type “Age” into Cell „B37‟;
similarly, (b) select Cell „A43‟, unmerge and type “6-10” to Cell „B43‟;
2. Select the “Data Source” to create chart: age (B37:B43 - X-axis), percentage of
children in school for male (F37:F43 - Series 1) and for female (G37:G43 - Series 2);
3. Click “Insert” on the main menu;
4. Click “Column” to get the list of available Column Charts;
When user places mouse on the “Column”, concise but useful information: “Column
charts are used to compare values across categories”, will be popped up. Similar
information will be popped up when pointing on other chart types also.
5. Click “3-D Clustered Column” icon, the first one under “3-D Column” group;
Then, following draft chart based on the provided data will be displayed instantly.
6. The next step is to finalize the chart in Excel:
a. Click on the chart, and click again on “Layout” under “Chart Tools”;
b. Click “Chart Title”, select “Above Chart” to insert a space for chart title, and
type “Age-Specific Enrolment Rate by Sex, Aged 6-10” into that space; and
c. Click “Axis Titles”, set “Primary Horizontal Axis Title” to appear “Title Below
Axis”, and type “Age” into the space appears.
At this stage, the chart is usable. However, more polishing could be carried out such as:
d. To change the location of legend (just select, drag and drop at new location);
e. To change the gap width between items (select one series, right-click to get pop-
up menu, click “Format data series”, and set “Gap width/depth”);
f. To change the series colour (select one series, right-click to get pop-up menu,
click “Format data series”, and set colour in “Fill”);
g. To format any … (select that item, right-click to get pop-up menu and set); and
h. To move or resize the chart, chart title, legend, etc...
1(a)
4
5
1(b)
2
3
The following chart will be obtained after putting few final touches:
Same procedure should be carried out to create a line graph except selecting the data range to cover
ages 6 to 10, but not total (aged 6-10). The line charts are normally used to display the trends, over
time or age. Therefore, putting total (aged 6-10) in the series will misinform the viewers.
On the other hand, it is nice to include for both sexes in the line graph and the differences could be
observed clearly if the rates begins at 60% instead of 0%. Such few adjustments in the above line
chart will yield the following final one.
3. TIPS AND EXERCISES
3.1 Tips: Do and Don’t
i) Do… export to Excel from PASW Statistics with data in “labels” as much as
possible, rather than exporting only numeric data values;
Don’t… import PASW Statistics datasets to Excel without having codebook (the
coding scheme used in creating dataset) or questionnaire with codes.
ii) Do… practice importing PASW Statistics dataset to Office Excel 2007 and
check the correctness of database table in Excel by constructing frequency
tables;
Don’t… edit imported database before saving and leave computer with unsaved
files.
iii) Do… autofilter on one or more fields (variables) in extracting data with certain
criteria or to review the invalid cases (data validation);
Don’t… forget to release autofilter from the fields which are not using; otherwise
wrongly filtered the cases.
iv) Do… practice using, and use PivotTable and PivotChart as and where
appropriate;
Don’t… try to edit too much PivotTable and PivotChart or undo several times; it
may hamper the computer performance or totally hanged.
v) Do… use PivotTable technique to create frequency and crosstab tables, and
check the outputs thoroughly;
Don’t… trust computer outputs. Don‟t use those tables and charts on presentation
or dissemination before completing thorough checking.
3.2 Self-evaluation
Are you able to work with Microsoft Excel 2007 to: a. import SPSS dataset
Very well / Somewhat well / Not so much / Almost None b. select some rows (cases) using auto-filter
Very well / Somewhat well / Not so much / Almost None c. create frequency table
Very well / Somewhat well / Not so much / Almost None d. construct a two-way (crosstab) table
Very well / Somewhat well / Not so much / Almost None e. develop a Pivot Table
Very well / Somewhat well / Not so much / Almost None f. create a Pivot Chart
Very well / Somewhat well / Not so much / Almost None
Are you confident that you can export selected output tables from PASW Statistics to Microsoft Office Excel 2007? Confident / Somewhat confident / Not so much / Not at all
Are you confident that you can elaborate PASW Statistics output tables in Microsoft Office Excel 2007? Confident / Somewhat confident / Not so much / Not at all
3.3 Hands-on Exercises
1) Import the attached “BDPR50FL(Validate).sav” into Excel.
2) From the dataset obtained from Exercise 1 above, validate the database table in Excel
for various errors, recommend with reasons on whether the imported database is valid
to use.
3) Import the attached “BDPR50FL1.sav” into Excel and extract cases with “out-of-school
children aged 6-10”.
4) Create PivotTable and PivotCharts to present “percentage of out-of-school children
aged 6-15” by Division.