Module B

199
Training Module B i Module B Glossary ANOVA: ANOVA stands for analysis-of-variance. It is a collection of statistical models, and their associated procedures, in which the observed variance is partitioned into components due to different explanatory variables. In its simplest form ANOVA provides a statistical test of whether or not the means of several groups are likely to be equal. Chi-square tests: A statistical hypothesis test in which the sampling distribution of the test statistic is a chi-square distribution when the null hypothesis is true, or any in which this is asymptotically true, meaning that the sampling distribution (if the null hypothesis is true) can be made to approximate a chi- square distribution as closely as desired by making the sample size large enough. Cleaning database: A process to increase the accuracy of the data and streamline the database, by removing/correcting duplicate and wrong data in the database. Codebook: A document used for implementing a code. It reports dictionary information such as variable names, variable labels, value labels, and missing values. Coefficient of variation (CV): A normalized measure of dispersion of a probability distribution.int or a missing component of a data point. Cohort Survival Rate (CSR): The percentage of enrollees at the beginning grade or year in a given school year who reached the final grade. Correlation: A single number that describes the degree of relationship between two variables. Correlations are useful because they can indicate a predictive relationship, possible causal, or mechanistic relationships. Coverage: The extent or degree to which the entire study area is observed, analyzed, and reported by the survey. Cross tabulation (Crosstab): This displays the joint distribution of two or more variables. They are usually presented as a contingency table in a matrix format. Whereas a frequency distribution provides the distribution of one variable, a contingency table describes the distribution of two or more variables simultaneously. Data collection: A process of preparing and collecting data to keep on record, to make decisions about important issues, and to pass information on to others.

description

Training Module B

Transcript of Module B

Page 1: Module B

Training Module B

i

Module B Glossary

ANOVA:

ANOVA stands for analysis-of-variance. It is

a collection of statistical models, and their

associated procedures, in which the

observed variance is partitioned into

components due to different explanatory

variables. In its simplest form ANOVA

provides a statistical test of whether or not

the means of several groups are likely to be

equal.

Chi-square tests:

A statistical hypothesis test in which the

sampling distribution of the test statistic is a

chi-square distribution when the null

hypothesis is true, or any in which this is

asymptotically true, meaning that the

sampling distribution (if the null hypothesis

is true) can be made to approximate a chi-

square distribution as closely as desired by

making the sample size large enough.

Cleaning database:

A process to increase the accuracy of the

data and streamline the database, by

removing/correcting duplicate and wrong

data in the database.

Codebook:

A document used for implementing a code.

It reports dictionary information such as

variable names, variable labels, value labels,

and missing values.

Coefficient of variation (CV):

A normalized measure of dispersion of a

probability distribution.int or a missing

component of a data point.

Cohort Survival Rate (CSR):

The percentage of enrollees at the

beginning grade or year in a given school

year who reached the final grade.

Correlation:

A single number that describes the degree

of relationship between two variables.

Correlations are useful because they can

indicate a predictive relationship, possible

causal, or mechanistic relationships.

Coverage:

The extent or degree to which the entire

study area is observed, analyzed, and

reported by the survey.

Cross tabulation (Crosstab):

This displays the joint distribution of two or

more variables. They are usually presented

as a contingency table in a matrix format.

Whereas a frequency distribution provides

the distribution of one variable, a

contingency table describes the distribution

of two or more variables simultaneously.

Data collection:

A process of preparing and collecting data

to keep on record, to make decisions about

important issues, and to pass information

on to others.

Page 2: Module B

Training Module B

ii

Data preparation:

A process of preparing and collecting data

to keep on record, to make decisions about

important issues, and to pass information

on to others.

Data Validation:

A process of ensuring that a program

operates on clean, correct and useful data.

Descriptive statistics:

To describe the basic features of the data in

a study. They provide simple summaries

about the sample and the measures.

Together with simple graphics analysis, they

form the basis of virtually every

quantitative analysis of data.

Disaggregation:

A process of breaking down and analyzing

an indicator by detailed sub-categories. Also,

it is for understanding the degree of

accuracy and its limitations of the survey.

Educational attainment:

A term commonly used by statisticians to

refer to the highest degree of education an

individual has completed.

Estimation:

Any of numerous procedures used to

calculate the value of some property of a

population from observations of a sample

drawn from the population.

Factor analysis:

A statistical method used to describe

variability among observed variables in

terms of a potentially lower number of

unobserved variables called factors.

Frequency:

The number of occurrences of a repeating

event per unit time. It provides statistics

and graphical displays that are useful for

describing different types of variables.

Gender Parity Index (GPI):

A socioeconomic index usually designed to

measure the relative access to education of

males and females. In its simplest form, it is

calculated as the quotient of the number of

females by the number of males enrolled in

a given stage of education.

Household:

A basic residential unit in which economic

production, consumption, inheritance, child

rearing, and shelter are organized and

carried out. Household is broader than

family, which is a group of people related by

blood or marriage such as parents and their

children only.

Household survey:

A process of data collection and analysis for

understanding general situation and

exploring specific characteristics of

households or household population.

Imputation:

To substitute some value for a missing data

point or a missing component of a data

point.

Page 3: Module B

Training Module B

iii

Kurtosis:

A measure of the "peakedness" of the

probability distribution of a real-valued

random variable. Higher kurtosis means

more of the variance is the result of

infrequent extreme deviations, as opposed

to frequent modestly sized deviations.

Liner regression:

An approach to modeling the relationship

between one or more variables denoted y

and one or more variables denoted X, such

that the model depends linearly on the

unknown parameters to be estimated from

the data.

Mean:

The expected value of a random variable.

For a data set, the mean is the sum of the

observations divided by the number of

observations.

Missing value:

This occurs when no data value is stored for

the variable in the current observation.

Missing values are a common occurrence,

and statistical methods have been

developed to deal with this problem.

Nonparametric test:

A statistic (a function on a sample) whose

interpretation does not depend on the

population fitting any parameterized

distributions. Statistics based on the ranks

of observations are one example of such

statistics and these play a central role in

many non-parametric approaches.

OLAP cube:

A multidimensional database that calculate

summary statistics for summary variables

within categories of one or more grouping

variables. The cube allows different views of

the data to be quickly displayed.

Outlier identification:

To identify an observation that is

numerically distant from the rest of the

data.

Pivot table:

A data summarization tool to create output

table formats. Pivot-table tools can

automatically sort, count, and total the data

stored in one table or spreadsheet and

create a second table.

Population census:

A procedure of systematically acquiring and

recording information about the members

of a given population. It includes

information on household members, which

are useful for policy making, planning,

monitoring and evaluation.

Sample design:

To determine what kind of people and how

many people you need to interview to

collect data. A decision about sample size

can be made, based on factors such as: time

available, budget and necessary degree of

precision.

Page 4: Module B

Training Module B

iv

Sampling:

A part of statistical practice concerned with

the selection of an unbiased or random

subset of individual observations within a

population of individuals intended to yield

some knowledge about the population of

concern, especially for the purposes of

making predictions based on statistical

inference. A design of any information-

gathering exercises where variation is

present.

Skewness:

A measure of the asymmetry of the

probability distribution of a real-valued

random variable.

Standard deviation:

A statistic that tells how tightly all the

various examples are clustered around the

mean in a set of data. In other words, they

are measures of variability.

Structured Query Language (SQL):

A standard programming language used for

accessing and maintaining a database. The

key feature of the SQL is an interactive

approach for getting information from and

updating a database.

Syntax:

A set of rules that define the combinations

of symbols that are considered to be

correctly structured programs in the

programming language.

T-test:

The expected value of a random variable.

For a data set, the mean is the sum of the

observations divided by the number of

observations.

Validation rule:

A criterion used in the process of data

validation, carried out after the data has

been encoded onto an input medium and

involves a data vet or validation program.

Variable:

A symbol that stands for a value that may

vary. For instance, a variable can be used to

designate a value occurring in a hypothesis

of the discussion.

Visual Binnig:

To perform automatic creation of new

variables based on grouping contiguous

values of existing variables into a limited

number of district categories. This can

create categorical variable from continuous

scale variables.

Wealth index:

The extent or degree to which the entire

study area is observed, analyzed, and

reported by the survey.

Weighting:

A process, which involves emphasizing

some aspects of a phenomenon, or of a set

of data.

Page 5: Module B

Module B1:

Exploring Household Surveys for EFA Monitoring

Contents

1. Understanding Household Surveys 1.1 Introduction to Household Surveys 1.2 Education Related Questions (or Modules) in Household Surveys 1.3 Inputs from Household Surveys for Aligning Education Policies

2. Brief Information on Common Household Surveys 2.1 Background and Objectives of Selected Surveys 2.2 Structure and Contents of the “Survey Questionnaire” 2.3 Consideration on Sample Design 2.4 Understanding Survey Data Files and Availability of Education Related Data

3. Gathering Survey Data and Getting Ready for Analysis 3.1 Data Sources and Contact Points for Obtaining Census and Survey Data 3.2 Common Obstacles and Approaches in Gathering Population Census and Household

Survey Data 3.3 Quality Issues, Challenges and Recommendations in Using Survey Data 3.4 Use of Survey Data along with EMIS Data/Indicators for Policy Analysis

4. Exercises and Further Studies 4.1 Self-evaluation 4.2 Exercises 4.3 Further Studies

5. Annexes

Annex 1: Population and Housing Census

Annex 2: Education Related Questionnaires from Selected Household Survey

Annex 3: Education Related Variables in the Selected Datasets

Annex 4: List of Key EFA Indicators

Purposes and learning outcomes

To gain better understanding of common household surveys

To understand reasons on limited use of household survey data in education planning and EFA monitoring

To explore the values added and benefits of data from household surveys for education policies

To recognize the questions in common household surveys, which are directly or indirectly useful in exploring access, quality and management of education, and their determinants

To know the key point to be aware in analyzing data from household surveys

Page 6: Module B

1. UNDERSTANDING HOUSEHOLD SURVEYS

1.1 Introduction to Household Surveys

“Household” is defined to be a basic residential unit in which economic production, consumption,

inheritance, child rearing, and shelter are organized and carried out. Household is broader than

family, since family refers only to a group of people related by blood or marriage such as parents

and their children only.

“Household survey” is a process of data collection and analysis for understanding general situation

and exploring specific characteristics of households or household population. The fieldwork of a

household survey investigates and records the facts, observations and experience of sample

households, which represents all households in the study area. Tools for data collection include a

series of questions, observation checklists and records for discussions.

Nowadays household surveys were conducted in almost every country and territory, ad-hoc or

periodically (annually, biennially or once in every three or every fifth year or etc.). There are

different types of surveys (Ref. Section 2).

Most education indicators, especially school-based ones, can be derived from the annual school census or EMIS data collection system. However, EFA monitoring requires more indicators to measure "reaching the unreached" which generally cannot be provided by school data. Some essential EFA indicators which are based on ethnic minority, disabled or illiterate population and out-of-school children can be derived only from the household surveys.

Page 7: Module B

1.2 Education Related Questions (or Modules) in Household Surveys

Two main components of household survey

Household survey generally uses two different questionnaires: a household roster and at least one

detailed or individual questionnaires.

Household roster: this includes listing of all household members and their characteristics such as

age, sex and relationship to head of household for every member; education and literacy status for

the persons aged 5 and above; schooling status to those aged 5-24 (or 6-14, 6-19, etc.), and marital

status for all adults aged 15 and above.

Detailed or individual questionnaire: this explores the main theme of the study, and sometimes,

aim only to the specific respondents such as head of household, married couples, mother of children

under 5, ever married women, out of school children, disadvantaged children, etc.

The fieldwork (data collection) of a household survey is followed by coding, checking and editing,

data entry, data verification, data analysis and drafting of the report. Majority of household surveys

use SPSS (renamed as PASW) for data analysis and also for creation of tables, graphs and charts.

As such, although the survey may enter data using different programs such as dBase, MS Access,

MS Excel, CSPro, IMPS, …, the final data files analyzed are available in SPSS data format.

Household survey and population census

The datasets created from household surveys and population censuses1

normally include

information on household members, which are useful for policy making, planning, monitoring and

evaluation in education, such as:

(i) population by age and sex (and urban/rural residence in larger surveys), and with special

characteristics such as ethnic minority, disability, …);

(ii) literacy status of respondents (self-reporting) and other family members (proxy reporting);

(iii) highest educational attainment of the respondent, and population under study; and

(iv) schooling status (currently attending , dropout or never attended) of children at the school-

going ages.

Apart from the above mentioned information, several household surveys could provide migration

status of household members, and socio-economic characteristics of household such as:

(v) birth place and/or place of residence during five or ten years ago;

(vi) number of income earners in the household;

(vii) household income and expenditure (in some cases, separate health and education

expenditures);

(viii) possession of household amenities or durables; and

(ix) food securities; and so on.

As such, data from household survey and population census can complement the school-based data2

by providing information on aspects of children‟s background that may influence household

schooling decisions and school participation of children (such as enrollment and/or school

attendance).

Household surveys provide broader varieties of information while population census provides more

accurately on age and sex structure, and education and literacy attainment of entire population.

1 Population census is a type of household survey with broader coverage. By international agreement, census consists

of an enumeration of entire population in the specified area regularly at a marked time interval. 2 Ministry of Education, through EMIS (Education Management Information System), regularly collects school-based data

and normally processes and provides limited information on the individual characteristics of pupils, such as age, sex,

grade and performance (flow rates), and little information on the characteristics of their households.

Page 8: Module B

1.3 Inputs from Household Surveys for Aligning Education Policies

Household surveys and population census could also provide data on adult educational attainment

and reported literacy skill (that is, reported by the respondent) by household characteristics such as

rich or poor household, reside in urban, rural or remote area, far or near to the school, and etc…

Key education indicators possible to derive from surveys

The following common education indicators which are essential in formulating and aligning

education policies, and preparing, monitoring and evaluating education development programmes

and projects could be derived from common household surveys and population censuses.3

1) Adult Literacy Rate (for population aged 15 and above);

2) Youth Literacy Rate (for population aged 15-24);

3) Illiteracy rates for different population groups, especially for the vulnerable groups such as

females, ethnic minorities, disabled persons, and those from poor families and remote areas;

4) Educational attainment, measured by the number of years attended school or highest level

of schooling or proportion of adult population who completes primary or secondary school

(adult primary and secondary school completion rates);

5) Gross and net intake rates for primary Grade 1;

6) Gross and net enrolment rates by education level or by age;

7) Transition rates (from primary to lower secondary, and lower to upper secondary level);

8) Student flow rates (promotion, repetition and dropout rates); and

9) Out of School Children.

Moreover, some other measures such as gender parity index, cohort survival rate and measure of

internal efficiency could be derived from the above indicators.

One important benefit for constructing education indicators from the household surveys is the

“ability to compare the indicators among different population groups” such as;

a. male versus females;

b. ethnic minorities vs. other ethnic groups;

c. disabled persons vs. general population;

d. those living in remote areas vs. urban/rural areas;

e. comparing among the families with different wealth levels (measure by quintiles of

household expenditure per capita or ownership of household amenities).

Such information cannot be made available from regular school-based data collection, and are

important in measuring the achievement of education policies and in aligning education policies

for future.

Utilization of household survey data in education

All these information are very valuable for education policy makers and planners, however, such

information are not fully utilized for several reasons:

lack of awareness on existence and accessibility of survey data even in the same ministry due

to bureaucratic procedures, cost, and not knowing where to find or how to request such data;

little information on education and literacy are presented in the main report – only few

paragraphs or just a section on education in the general household survey reports;

3 See “Guide to the Analysis and Use of Household Survey and Census Education Data (UIS, 2004, pp 13-21)” for detailed

framework for analysis and further discussion.

Page 9: Module B

additional analysis on education and literacy status are very rare; and

lack of knowledge and skill on how to capitalize education and literacy data from surveys

particularly to facilitate the evidence-based policy formulation, implementation and

monitoring.

As a result, only a couple of researchers and consultants from international agencies are the ones

who use the education and literacy data from surveys to undertake few additional studies. However,

most of such studies are academic oriented or aimed to serve the specific project purposes set by

the international organization. It is seldom provide the information needs for the policy

recommendations.

It is crucial to build the capacity on analysis of data from survey to the staff from Ministry of

Education and line ministries so as to reflect and incorporate the findings from surveys into the

policy formulation, program implementation, monitoring and evaluation, including those for

achieving EFA goals4.

4 See Annex 4 for List of key EFA indicators.

Page 10: Module B

2. BRIEF INFORMATION ON COMMON HOUSEHOLD SURVEYS

2.1 Background and Objectives of Selected Surveys

Multiple Indicator Cluster Survey (MICS)

The Multiple Indicator Cluster Survey is a household survey developed by UNICEF to assist

countries in filling data gaps for monitoring the situation of children and women. It is capable of

producing statistically sound, internationally comparable estimates of these indicators. MICS was

originally developed in response to the World Summit for Children to measure progress towards an

internationally agreed set of mid-decade goals. The first round of MICS was conducted around

1995 in more than 60 countries, and the second round was conducted in 2000 (around 65 surveys).

The third round of MICS was carried out in 2005 onwards (more than 50 countries). It was focused

on providing a monitoring tool for the World Fit for Children, the Millennium Development Goals

(MDGs), as well as for other major international commitments, such as the United Nations General

Assembly Special Session (UNGASS) on HIV/AIDS and the Abuja targets for malaria. At least 21

MDG indicators can be collected in the current round of MICS, offering the largest single source of

data for MDG monitoring.

Results from the surveys, including national reports, standard sets of tabulations and micro level

datasets are available at UNICEF's web site www.childinfo.org.

Demographic and Health Survey (MEASURE DHS)

Since 1984, the Demographic and Health Survey (DHS) Project has provided technical assistance

to more than 200 demographic and health surveys in 75 countries advancing global understanding

of health and population trends in developing countries. In 1997, DHS became one of four

components of the “Monitoring and Evaluation to Assess and Use Results” (MEASURE)

Program5.

The MEASURE DHS Project gains worldwide reputation for collecting and disseminating

accurate, nationally representative data on health and population in developing countries. The

project is implemented by Macro International, Inc. and is funded by the United States Agency for

International Development (USAID) with contributions from other donors such as UNICEF,

UNFPA, WHO, UNAIDS.

Since October 2003 Macro International has been partnering with four internationally experienced

organizations to expand access to and use of the DHS data: The Johns Hopkins Bloomberg School

of Public Health/Center for Communication Programs; Program for Appropriate Technology in

Health (PATH); Blue Raster; The Futures Institute.

5 MEASURE Program - Together, the four MEASURE partners (MEASURE DHS, MEASURE Evaluation,

MEASURE U.S. Census Bureau- Survey and Census Information, Leadership, and Self Sufficiency (SCILS), and

MEASURE Centers for Disease Control and Prevention - Division of Reproductive Health (CDC/DRH) provide a

full range of related services, which include promoting the demand for quality data; providing technical assistance,

training, systems development, data collection and analysis, and capacity-building services; and disseminating

information and facilitating its use in decision-making. (See http://www.measureprogram.org/)

Every year, different types of household surveys are conducting for different purposes in almost every country. Three most common household surveys in this region, namely, Multiple Indicator Cluster Survey (MICS), Demographic and Health Survey (Measure-DHS), and Living Standard Measurement Study (LSMS) together with the population census are discussed in this section.

Page 11: Module B

The DHS surveys collect information on fertility, reproductive health, maternal health, child health,

immunization and survival, HIV/AIDS; maternal mortality, child mortality, malaria, and nutrition

among women and children stunted. The strategic objective of MEASURE DHS is to improve and

institutionalize the collection and use of data by host countries for program monitoring and

evaluation and for policy development decisions.

LSMS – Living Standard Measurement Survey

LSMS was established by the Development Economics Research Group (DECRG) of the World

Bank to explore ways of improving the type and quality of household data collected by statistical

offices in developing countries. LSMS is a research project that was initiated in 1980 and carried

out several rounds in more than 30 countries. The program is designed to assist policy makers in

their efforts to identify how policies could be designed and improved to positively affect outcomes

in health, education, economic activities, housing and utilities, etc...

Objectives of LSMS include:

to improve the quality of household survey data;

to increase the capacity of statistical institutes to perform household surveys;

to improve the ability of statistical institutes to analyze household survey data for policy

needs; and

to provide policy makers with data that can be used to understand the determinants of

observed social and economic outcomes.

LSMS is providing users with actual household survey data for analyses and also a link to reports

and research done using LSMS data.

Population Census

The oldest type of household survey with broader coverage is the “population census”. By

international agreement, census consists of an enumeration of entire population in the specified area

regularly at a marked time interval. Questions may be asked concerning certain characteristics of

each person, such as age, sex, marital status, education, employment status, and more while

enumerating population. Therefore, census basically provides the data on number and composition

of the entire population at a given time, and selected socio-economic and educational

characteristics of household population in the country.

Since it is based on the complete enumeration of all households in the country, a census can

provide valuable information for policies and the planning of socio-economic development from

the national to the lowest administrative levels. Moreover, census is the source for constructing

sampling frames for selecting households and population for other surveys.

Population censuses are carried out once in every 10 years in most of the countries or once in every

5 years in some economically advanced countries. As such, census is the most comprehensive

source of demographic and socio-economic data for several countries.

Although the main objective of a census is to get reliable population data, the latest United Nations

guidelines6

for preparing population census emphasis on collecting data on literacy, school

attendance, educational attainment, field of study and educational qualifications.

6 “Principles and Recommendations for Population and Housing Censuses”, United Nations Statistical Office, 1998.

Page 12: Module B

2.2 Structure and Contents of the “Survey Questionnaire”

2.2.1 Questionnaire Used in Multiple Indicators Cluster Survey (MICS)

MICS uses three main questionnaires in every survey:

(i) household questionnaire,

(ii) questionnaire for women aged 15-49, and

(iii) questionnaire for children under the age of 5.

The Household Questionnaire comprises of household characteristics, household listing, education,

child labor, water and sanitation, salt iodization, insecticide-treated mosquito nets (ITNs), and

support to children orphaned and made vulnerable by HIV/AIDS, with optional modules for

disability, child discipline, security of tenure and durability of housing, source and cost of supplies

for ITNs, and maternal mortality.

A. Household Identification

B. Household Listing Form

Page 13: Module B

C. Education Module

2.2.2 Questionnaire used in MEASURE DHS

Although DHS surveys aim to collect data to understand fertility; reproductive, maternal and child

health; immunization, survival and nutrition; maternal and child mortality; HIV/AIDS; and malaria,

the key household questionnaire covers several questions on education and its differentials.

Followings are the extracts from the DHS Model Household Questionnaire.

A. Household Identification

Page 14: Module B

B. Listing of all Household Members - 1

C. Listing of all Household Members - 2

Page 15: Module B

2.2.3 Questionnaire used in Living Standards Measurement Survey (LSMS)

LSMS is a comprehensive survey. Its questionnaire set contains (i) household and (ii) community

and (iii) price questionnaires. Household questionnaire expands over 100 pages covering 15

sections including education.

The education section of the LSMS questionnaires has three sections on four pages as follows:

Page 16: Module B

Ref: LSMS Working Paper 130 "Model Living Standards Measurement Study Survey Questionnaire

for the Countries of the Former Soviet Union" by Raylynn Oliver.

2.2.4 Population and Housing Censuses

As mentioned above, a census covers each and every person in the country, and is the most reliable

source of population data. Household roster used in censuses contains basic information on all

household members such as age, sex, marital status, education and literacy status together with

household characteristics such as location and type of residence, and availability of services.

Viet Nam 2009 Population and Housing Census questionnaire includes the following questions on

education and literacy status of entire population. Combining with age, sex, residence, migration

and disability status recorded in other questions, literacy, educational attainment, and participation

and access to education could be analyzed for different population groups.

For further case study, please refer to Annex1.

Page 17: Module B

2.3 Consideration on Sample Design

Census based on all households in the study area (a region, or a territory or a country). Therefore,

the entire household population is included in data collection. During census taking process, there

might be some non-response households, but comparatively very few and generally negligible.

Since it is complete enumeration, census does not require a sample design and the data and

indicators derived from the census are the actual values, not the estimates.

On the other hand, a household survey collects data from the selected households in the area, and

provides the estimates (of the characteristics or indicators) for entire household population in the

area based on the experience of the sample households. That is, not all the households in the study

area are selected in a survey. The quality (accuracy of the estimates) and the usefulness of a

household survey depend on the followings points.

i) Sampling method (how the sample households are selected);

Common sampling methods include SRS (Simple Random Sampling), PPS (Probability

Proportional to Size), cluster sampling, multi-stage sampling, and purposive sampling.

ii) Coverage (whether the entire study area is covered by the survey);

To represent the entire area, sample households must be selected from all households in the

area (country or region) using a random sampling method. Some household surveys select

from the households with specific characteristics (e.g., poultry farmers) or from pre-

assigned parts of the areas only (e.g. households beyond 3 mile radius from a school).

iii) Sample size (how many households are selected) and allocation of samples (how the

sample households were allocated to different parts of the area); and

iv) Data analysis - how to get estimates (values) of the key indicators, perceived standard

errors of estimates, and pre-determined level of disaggregation (e.g. by age, sex, grade,

region, socio-economic status, etc.).

Sample design of the household survey includes the above mentioned information and it is

generally part of the survey report.

For the data users (secondary analysts) it is important to know the sampling method and sample

size of the study before making any analysis. The accuracy will be lower if the estimates are not

calculated in-line with the sampling method of the survey. Similarly, the survey method and how

the sample households were allocated are essential in deciding whether and which weights should

be applied in data analysis. Moreover, the actual coverage of the survey, sample size and set level

of disaggregation will help data user to understand the limitations of the survey including whether

desired disaggregation is appropriate at required degree of accuracy or not.

The data analyst should, first, check the sample design through the accompanying documents such

as survey report or service contract, and/or contact persons of survey organization.

Example:

In a survey which was designed to get reliable estimates up to the provincial level by sex, and if

the estimates of adult illiteracy rate were computed for the adults who are living in remote areas

with lowest socio-economic status (lowest quintile) by district by sex, the derived estimates will

not be reliable. On the other hand, some surveys were designed to capture specific and rare

events. In such a survey, sample size is large and thus sufficient to estimate common education

indicators at lower levels at acceptable accuracy.

Page 18: Module B

2.4 Understanding Survey Data Files and Availability of Education Related Data

This section highlights the education related variables in the main datasets of three common

household surveys and sample outputs on selected variables.

Education Related Variables in MICS Sample Dataset

In MICS sample dataset, four SPSS data files are generated for: (i) household, (ii) individual

household members, (iii) women aged 15-49, and (iv) children under 5. MICS datasets are shared

to a wide range of users. The second data file, which is for all individual household members (or

household listing – hl.sav), contains education and literacy status of population including school-

age children. The sample “hl.sav” data file contained 183 variables for 29,560 cases (persons), and

the following 21 variables are useful for analyzing education and literacy.

HH1 Cluster number

HH2 Household number

HL1 Line number

HL3 Relationship to the head

HL4 Sex

HL5 Age

HL6 Area (urban / rural)

ED2 Ever attended school

ED3A Highest level of sch. attended

ED3B Highest grade at level

ED4 Currently attending school (2004-05)

ED5 Days attended school in last week

ED6A Level of education attended

ED6B Grade of education attended

ED7 Attended school last year (2003-04)

ED8A Level of education attended last year

ED8B Grade of education attended last year

melevel Mother's education

helevel Education of HH head

hhweight Household sample weight

wlthind5 Wealth index quintiles

The following tables, which are useful in analyzing the schooling status of children aged 5-14, are

derived from the sample data file “hl.sav”.

Please see Annex 1 for more case studies.

Page 19: Module B

3. GATHERING SURVEY DATA AND GETTING READY FOR ANALYSIS

3.1 Data Sources and Contact Points for Obtaining Census and Survey Data

Population Census: Censuses are conducted regularly every five or ten years and cover entire

country. Complete census databases are confidential and not sharing to the public or third party

users. However, subsets of those databases could be requested by the government education

departments after complete publishing of the census reports. Census databases are normally

maintained by the Census Bureau or Census Department or Central (or General) Statistical Office

of the country. On the other hand, if Ministry of Education identifies the required population data

and education-related data in tabular forms and requests through higher level authorities

(ministerial level), the census authorities will generate and provide the requested tables.

Major drawback for using census data is long lag time. A population census took over a year to

complete clean databases and the census reports are published two to three years after the census.

As such, Ministry of Education could get the education related datasets at least two years after the

census. There may also be a long delay in providing requested database subsets or tables.

Therefore, not many education ministries are using census databases, but requesting only

population data especially the projections of different school-age population.

Household surveys: They are available more frequently than population censuses. Moreover, the

conducting agencies are willing to share their datasets with simple formal requests. With smaller

workload, conducting agencies could create survey databases faster and most reports are available

within twelve months after completion of the fieldwork (data collection).

Access to datasets varies by survey and from country to country. All major household surveys

conducted or sponsored by international organizations have their own websites.

Please refer to “Further studies” for more information.

Although population census and household survey datasets are rich of information, those datasets are difficult to get and sometimes hard to understand. This section discussed the contact points and some tips on how to get the quality data from different sources.

Page 20: Module B

3.2 Common Obstacles and Approaches in Gathering Population Census and Household Survey Data

As mentioned above, population censuses and household surveys contain useful data for EFA

monitoring. However, there are limitations.

- Common obstacles in gathering population census database

i) Difficult to locate the person (or department) who has the authority to provide census

datasets to the third party user.

ii) Lack of coordination in developing census questionnaire with other ministries and

departments including education ministry so that the questionnaire items in the census may

not directly useful for constructing education indicators.

iii) A census is conducted normally once in every 10 years and the census data may obtain at

least 2 to 3 years after completion of the census. Thus, the usefulness of census data is more

to review historic trend than for unveiling the current situation and status.

iv) Census collects during the school holiday. Census date rarely coincides with the beginning

of school-year, which is the reference date for calculating common education indicators. As

such, there may be minor discrepancies among the indicators calculated from the census and

regularly collected service statistics.

In many countries, very few household survey questionnaires were developed by education related

ministries and agencies. The survey questionnaires were set by the conducting agency and just

distribute to education ministry for comment or just for the information. Compared to population

census data, household survey data are easier to obtain for the education ministries.

- Main barriers in using household survey data for EFA monitoring7

i) Variation in measures of educational participation

Survey questions on educational attainment and current school attendance are phrased quite

differently from survey to survey. In many cases, assumptions were to use in calculating

common education indicators.

For example, a survey inquires (1) the highest grade completed by household members, and

(2) whether the person is currently attending school. To calculate net enrollment rate (NER)

or gross enrollment rate (GER) from these questions, an assumption is required about the

level/grade currently attended by the household member: if a child has completed Grade 4,

and currently attends school, it is to assume that the child is currently attending Grade 5.

ii) Timing and duration of survey fieldwork

7 This portion is extracted from: “Guide to the Analysis and Use of Household Survey and Census Education Data (UIS, 2004)”.

Tips:

How to get census data faster and smoother for analysis?

i) When seeking census data, it is better to contact at the ministerial level. Approaching

census department/agency by a lower level education planner may result in catastrophic

situation – waiting days after days, and never receiving proper response from the census

department.

ii) Limit number of variables in the requested dataset. By requesting data just to meet the

minimum requirements, the education planners may get a faster response and can

conduct analyses easier. Census datasets are very huge, and take time to subset, or

making analyses if several unused variables are included.

Page 21: Module B

When considering education data from household surveys, the timing (when the survey was

started or at which date that a survey referred to) and duration or how long has the survey

taken to complete data collection. If a survey was started just before the end of school-year

and took over a month, then, the grade completed or attending may differ from household to

household depending on when the interview was conducted – in the early days or later days

of the survey. This may not be a problem for the surveys which has set the reference date

clearly like in the population censuses.

iii) Sample size and sampling method

A household survey is designed to provide the facts on or characteristics of the population at

a certain period through a representative sample of households. The representativeness of

sample depends on the survey design, which is influenced by three factors: the sampling

method used, the level of accuracy sought in the estimates for various indicators; and the

level of data disaggregation.

Some surveys especially the rapid assessments and case-control studies do not use

probability sampling techniques, and thus, the findings may not represent the entire

population under study. For the surveys aiming to get estimates for common characteristics

with moderate accuracy require smaller sample size, while for a rare characteristic (or

event) with higher accuracy requires larger sample size. Similarly, for estimating at the

national (and provincial) level only requires smaller sample size while finer sub-

stratification (such as district or lower level) needs larger sample size.

Therefore, it is important to check which sampling method was used in the survey under study, and

whether the sample size is sufficient enough for the particular education indicators at desired level

of disaggregation.

EFA monitoring indicators generally aim to explore the differences among the population groups,

such as normal and the disadvantaged ones. The sample size of a particular household survey may

or may not be sufficient to compute indicators for the disadvantaged group living in a certain area,

depending on the definition of “disadvantaged population” and level of disaggregation.

If the sample size is not sufficient for required disaggregation, it is recommended to reduce the

level of disaggregation or compute the required indicators at the desired disaggregation level and

present the results with sufficient notice.

Page 22: Module B

3.3 Quality Issues, Challenges and Recommendations in Using Survey Data

Generally speaking, data files made available for analysis should be “cleaned”. These files will

have been checked for structural and range errors and edited for internal consistency. Provisions

that compensate for non-response should also be incorporated into the files and fully explained in

the accompanying documentation.

The first step after acquiring a dataset is to familiarize with its structure and the nature of its

variables, the circumstances of data collection, and any limitations on the use of the dataset. The

documentation for a census or household survey, such as reports and a codebook, will provide

important background information on the survey, such as sample size and data quality indicators.

Data manipulation and analysis can be demanding and complex. The following discussions do not

provide a comprehensive set of guidelines for the use of datasets; instead, reviews some key issues

to be considered in analyzing survey data.

(1) Familiarize with the structure of dataset and explore appropriate ways to analyze

First, find out whether records within the data files are at the household or individual level, and

second, whether household or individual weights should be used in estimation procedures.

Since sample surveys do not collect entire population (all households or all individuals) in an

area, weighting factors are required to reconstitute the characteristics of entire population from

the samples. For example, in a survey 5 households are selected from two enumeration area

(EA) of 50 and 60 households respectively; then, the household weight for each of the 5 sample

households from the first EA is 10, and from the second EA is 12. The weights are calculated

while planning the survey, and are provided in the dataset.8

(2) Study the variables in the datasets before analysis

It is important to refer original questionnaires to understand the variables better how to analyze

the data. For example, to analyze the literacy status of population, one should know the nature

of the variable such as: its codes (for example, „1=literate‟, „2=illiterate‟); restrictions (whether

the question was asked to all ages or aged 5+ or aged 15+); relationship to other

questions/variables (whether it was asked to everybody, or only those persons who answered

„no education‟ or „incomplete primary‟ in the question on “highest education level”); and

missing values (code „9‟) and non-response (code „8‟ for the variable “literacy status”). Only

after that, the data analyst can determine which variables were to select and how to handle the

selected variables to produce required indicator estimates efficiently.

(3) Replicate published results before proceeding with additional calculations

If there are reports of results from the data collection activity, try to replicate these results

before calculating any new indicators. Sorting out the difficulties with calculations already done

will bolster confidence in producing new results.

(4) Consider the issue of missing values

Non-response in a survey or census can happen in one of two ways. First the entire record

representing an individual or household was missing since the individual or household refused

to answer, was not available, could not be contacted, etc.; this is called “total non-response”.

The second type of non-response arises when variables within a record are missing and is

termed “item non-response”. The item non-response is common for the variables representing

the question which was not asked or known for all household members, such as whether a child

attends school during the current school year.

8 For detail explanations on weighting, see C-E Särndal et al (1992) “Model Assisted Survey Sampling”, Springer-Verlag;

and WG. Cochran (1977) “Sampling Techniques”, Jonh-Wiley & Sons.

Page 23: Module B

A technique called “imputation” is often used to compensate for missing values in the case of

item non-response. Imputation replaced missing values with the most suitable ones base on

other cases in the same dataset. The resulting file, complete or “square”, allows getting better

estimates in constructing new indicators. Therefore, the data analyst must know how item

missing values were treated in the dataset.

In the case of total non-response weight adjustments method is often used. That is, non-

response records are omitted from the dataset and recalculating the weights. In this case, the

dataset contains two sets of weights “sample weight” and “adjusted/final weights”, and the

users must employ the final weight in calculating indicators.

(5) Calculate the measures of accuracy (coefficient of variation) of the basic estimates to

gauge reliability of the estimated indicators

Depending on the overall sample size of the survey, some tabulations may yield cells with very

small numbers of cases. The indicators estimated based on those tables may not be reliable. For

this reason, it is paramount to calculate some measure of accuracy and to disseminate it

alongside the basic estimate enabling to gauge the reliability of all estimates produced. A good

rule of thumb in this regard is to use the coefficient of variation (CV).

The coefficient of variation (CV) is defined as the square root of the variance divided by the

estimate itself and multiplied by 100 – expressed as a percentage.

Often, national statistical offices advocate basic quality guidelines that estimates having CVs

greater than 35% should not be used to draw statistical inferences and should not be released to

the public. Be sure to properly account for complex survey designs in analysis, particularly

when calculating variances.

In general, national population censuses collect data on all households and individuals in the

population, and thus, sample design and weighting are not at issue. The only exception is when a

different questionnaire with more detailed questions is presented to a sampled fraction of the

population. But even then, no explicit issues of complex survey designs since simple and self-

weighting designs (such as Stratified Simple Random Sampling or Systemic Sampling) are

generally used.

In the case of complex survey designs, forming the estimate itself (for example, primary school net

enrolment rate (NER)) is not an issue since it is easy to take the design into account by simply

applying the survey weights into the estimator. However, there may be critical issues in variance

estimation and thus CV estimation9.

9 See “Guide to the Analysis and Use of Household Survey and Census Education Data (UIS, 2004, pp 36-37)” for further

discussion on issues concerning weighting and calculation of CV in complex sample designs.

Page 24: Module B

3.4 Use of Survey Data along with EMIS Data/Indicators for Policy Analysis

Administrative and household survey data sources measure educational participation in different

ways. Administrative data are based on school reporting at the beginning of the school year, and in

some cases, it can include reporting at the middle or end of the school year. Enrolment rates are

based on the numbers of children enrolled in school and the school-age population estimated from

national censuses and/or vital statistics.

Ideally, household surveys collect data on enrolment and/or school attendance based on a

representative sample of children. Questions concerning children‟s school participation are

typically asked to the head of household. The timing of the survey is varied from one survey to

another and unrelated to the school year. Some survey may actually even cross two different school

years.

Limitation of data

Estimates of educational participation from these two sources may differ for a number of reasons.

One major factor is that the question asked in the household surveys querying children‟s school

attendance is different from that answered by school censuses: attending school may slightly differ

from being enrolled in school. Children may be recorded in school enrolment records and not

actually attending school. Thus, the enrolment rates from the census and surveys may slightly lower

than those from the administrative data.

The different rates of participation can also be attributed to the timing of data collection relative to

the school year. A school census conducted at the beginning of the school year and a household

survey collecting data at the end of the school year will likely find different rates of participation

since some children will have enrolled in school without ever actually attending, and other children

will have dropped out of school during the school year.

In addition, the accuracy of the population estimate and the completeness of school-level data can

affect the calculation of participation rates from administrative data. Similarly, the completeness of

the census enumeration and the sample design for the household survey may also affect the

accuracy of estimates produced by censuses and surveys.

In short, many factors may contribute to variations in the estimates of school participation rates

from administrative data and household surveys. Further research is needed to explore the reasons

for similarities or differences between the measures of participation from these two sources.

However, when the school-age population estimates are not accurate and annual school censuses do

not cover several aspects essential for planning and monitoring, only the population census and

household surveys could provide reasonable indicators for planning and EFA monitoring. For

example, school administrative cannot provide enrolment rates by socio-economic status of the

household or for the disadvantaged groups and also cannot provide reasons for non-participation

(not enrolled) or dropping-out.

As such it is important to use both school administrative data and secondary data from census and

surveys for the policy analysis especially for the EFA monitoring aiming at reaching to the

unreached.

Page 25: Module B

4. EXERCISES AND FURTHER STUDIES

4.1 Self-evaluation

How much do you understand why household survey data are essential in EFA monitoring and evaluation? Very well / Somewhat well / Not so much / Almost None

Do you know which common household surveys are conducted in your country? Very well / Somewhat well / Not so much / Almost None

Do you agree that the selected questions in three common household surveys are directly or indirectly useful in exploring access, quality and management of education, and their determinants? Strongly agree / Agree / Not so much / Disagree

Are you able to share the factors to be aware in analyzing data from household surveys to someone who want to analyze survey data? Very well / Somewhat well / Not so much / Almost None

Are you confident that you could explore a household survey questionnaire and extract key questions which are useful to supplement the regular data collection system for EFA monitoring and evaluation? Confident / Somewhat confident / Not so much / Not at all

4.2 Exercises

i) When was the last population census conducted in your country?

a. Get the census report or tables which may be useful for EFA monitoring.

b. Provide pros and cons for using data from census report(s) for EFA monitoring.

c. Get the census questionnaire and extract the items on education and related to

education.

d. Is that possible to get raw data on education and related fields from Census

Department and why?

ii) What is the most recent household survey conducted in your region (or country) and

describe the followings briefly?

a. When was it conducted?

b. Which sampling method was applied?

c. What was the sample size?

d. Explain briefly about the survey findings on education and literacy provided in

the report.

e. Is data file (dataset) from that household survey available for you?

iii) Connect to internet and find out the MICS website on your country, then,

a. Collect the questionnaire set for the most recent MICS survey in your country (or

in a neighboring country).

b. Download datasets in SPSS format from the most recent MICS survey for your

country (or for a neighboring country).

c. Study the variables, and compile a list of variables which you think is useful to

construct education indicators especially for EFA monitoring.

iv) From the DHS website, find out a recent report (if possible for your country) and

prepare an abstract which is useful for education planners.

v) If you have a chance to discuss, what do you want to add to or delete from LSMS

survey questionnaire, and why?

Page 26: Module B

4.3 Further Studies

- International Household Survey Network (See http://www.internationalsurveynetwork.org )

- Luxembourg Income Study (See http://www.lisproject.org/)

- MEASURE DHS (Demographic and Health Surveys):Quality information to plan and improve

population, health, and nutrition program ( See http://www.measuredhs.com/)

- Rand Family Life Survey ( See http://www.rand.org/labor/FLS/ )

- UNESCO Institute for Statistics (UIS). 2004. Guide to the Analysis and Use of Household

Survey and Census Education Data (Can be downloaded at

http://www.uis.unesco.org/template/pdf/educgeneral/HHSGuideEN.pdf )

- UNICEF. Childinfo: Monitoring the Situation of Children and Women (Multiple Indicator

Cluster Survey) ( See http://www.childinfo.org/)

- United Nations Department of Economic and Social Affairs. 2008. Principles and

Recommendations for Population and Hosing Census Revision 2. (See

http://unstats.un.org/unsd/publication/SeriesM/Seriesm_67rev2e.pdf )

- United Nations Population Funds. Collection and using data: population and housing data (See

http://www.unfpa.org/data/census.cfm )

- United Nations Statistics Division (See http://unstats.un.org/unsd/default.htm )

- USAID‟s DHS EdData Activity website ( See http://www.dhseddata.com/ )

- World Bank. Living Standards Measurement Study (LSMS) ( See

http://econ.worldbank.org/WBSITE/EXTERNAL/EXTDEC/EXTRESEARCH/EXTLSMS/0,,m

enuPK:3359053~pagePK:64168427~piPK:64168435~theSitePK:3358997,00.html )

- Other organizations with links to education data sources

The William Davidson Institute http://www.wdi.bus.umich.edu/

The Development Gateway http://www.ids.ac.uk/eldis/health/health.htm

University of California http://biko.sscnet.ucla.edu/dev_data/

Country case studies

- NEPAL LIVING STANDARDS SURVEY 2002/03 ( See http://siteresources.worldbank.org/

INTLSMS/Resources/3358986-1181743055198/3877319-1181925143929/nlss2_urban.pdf)

- General Population Census of Cambodia 2008

(See http://www.nis.gov.kh/nis/uploadFile/pdf/EnumeratorManual.pdf)

(Household questionnaire refer to p65)

- Vietnam 2009 Population and Housing Census (See http://www.gso.gov.vn )

- 2005 Population and Housing Census of Korea (See http://kostat.go.kr )

- Tanzania poverty monitoring ( See http://www.povertymonitoring.go.tz/index.asp )

Page 27: Module B

5. ANNEXES

Annex1: Population and Housing Census

A1.1 2005 Population and Housing Census of Korea:

This includes just two education items on one question. Even form such limited data, education and

literacy status of population and schooling status of children could be studied by age, sex,

residence, and etc…

Page 28: Module B

A1.2 General Population Census of Cambodia 2008:

This contains the following literacy, education and disability status in the main questionnaire.

Therefore, it is apparent that all population censuses include from a limited number to several

questions on education and literacy status of entire population.

Page 29: Module B

Annex 2: Education Related Questionnaires from Selected Household Survey

A2.1 Household questionnaire of the Nepal Living Standard Survey 10

2002/03:

This contains a section on education covering (i) literacy, (ii) past enrolment and (iii) current

enrolment as followings:

10

NLSS, which is alternative name of LSMS

Page 30: Module B

Annex 3: Education Related Variables in the Selected Datasets

A3.1 Nepal’s 2006 DHS Dataset

The dataset from 2006 Nepal DHS contains seven SPSS data files: (i) Births Recode, (ii) Couples'

Recode, (iii) Household Recode, (iv) Individual Recode, (v) Children's Recode, (vi) Male Recode,

and (vii) Household Member Recode. The last data file NPPR51FL.SAV (for the individual

household members; 44,057 persons x 258 variables) contains all necessary information except for

one important differential of access to and attainment of education, the “wealth index” (households

grouped into five quintiles based on wealth). The wealth index could obtain from the third data file

for the households. The selected variables from NPPR51FL.SAV are:

HV001 Cluster number

HV002 Household number

HV003 Respondent's line number

HV005 Sample weight

HV024 Region

HV025 Type of place of residence

HV026 Place of residence

HV104 Sex of household member

HV105 Age of household members

HV106 Highest educational level

HV107 Highest year of education

HV108 Education in single years

HV109 Educational attainment

HV121 Member attended school during current school-

year

HV122 Educational level during current school-year

HV123 Grade of education during current school-year

HV124 Education in single years - current school-year

HV125 Member attended school during previous school-

year

HV126 Educational level during previous school-year

HV127 Grade of education during previous school-year

HV128 Education in single years- previous school-year

HV129 School attendance status

From the above variables, the following frequency tables could be constructed for the children aged

5-14.

Page 31: Module B

A3.2 Albania’s 2005 LSMS Dataset

The 2005 Albania LSMS covered 3,638 households residing 17,302 persons. The survey datasets

are available on the LSMS website. Since LSMS questionnaire covers several topics and items,

datasets were split into several files. The datasets directly concerned with education are

educationa_cl.sav (for preschool education), educationb_cl.sav (for general education and literacy),

and household_rostera_cl.sav (for age, and sex).

Page 32: Module B

The selected variables from those datasets are:

hhid household identifier

m2b_q00 ID code

m1a_q02 Sex

m1a_q5y Age - Years

m2b_q01 Can read newspaper

m2b_q02 Can write personal letter

m2b_q04 Highest level

m2b_q05 Highest Grade

m2b_q07 Years of preschool

m2b_q09 Currently attending school

m2b_q10 Reason for not attending

m2b_q14 Intends to return to school

m2b_q16 Current level

m2b_q17 Current Grade

m2b_q18 Public - Private

m2b_q20 Distance from dwelling

m2b_q22 Hours to travel

m2b_q23 Minutes to travel

m2b_q24 Transport to school

m2b_q49 Absent from school

m2b_q50 Days missed

m2b_q51 Reason missed school

From the above variables, literacy (read and write) and schooling status for the children aged 7-14

could be analyzed as seen in the following tables:

Page 33: Module B

Annex 4: List of Key EFA Indicators

Goal 1:

ECCE (H)

(H)

(S)

S

(S)

(S)

(S)

(S)

(S)

1. Gross Enrolment Ratio (GER) in ECCE programmes

2. Percentage of new entrants to primary Grade 1 who have attended

some form of organized ECCE programme

3. Enrolment in private ECCE centres as a percentage of total enrolment

in ECCE programmes

4. Percentage of trained teachers in ECCE programmes

5. Public expenditure on ECCE programmes as a percentage of total

public expenditure on education

6. Net Enrolment Ratio (NER) in ECCE programmes including pre-

primary education

7. Pupil/Teacher Ratio (PTR) (child-caregiver ratio)

Goal 2:

UPE H

H

H

H

(H)

(H)

(H)

(H)

(H)

(H)

(H)

(H)

S

S

S

S

S

S

S

S

S

S

S

S

S

S

S

S

S

8. Gross Intake Rate (GIR)

9. Net Intake Rate (NIR)

10. Gross Enrolment Ratio (GER)

11. Net Enrolment Ratio (NER)

12. Percentage of repeaters

13. Repetition Rate (RR) by grade

14. Promotion Rate (PR) by grade

15. Dropout Rate (DR) by grade

16. (Cohort) Survival Rate to Grade 5

17. Primary Cohort Completion Rate

18. Transition Rate (TR) from primary to secondary education

19. Percentage of trained teachers in primary education

20. Pupil/Teacher Ratio (PTR) in primary education

21. Public expenditure on primary education as a percentage of total public

expenditure on education

22. Percentage of schools offering complete primary education

23. Percentage of primary schools offering instruction in the mother tongue

24. Percentage distribution of primary school students by duration of travel

between home and school

Goal 3:

Lifelong

learning

H

H

(H)

S

(S)

(S)

(S)

25. Number and percentage distribution of the adult population by

educational attainment

26. Number and percentage distribution of young people aged 15-24 years

by educational attainment

27. Gross Enrolment Ratio (GER) for technical and vocational education

and training

28. Number and percentage distribution of lifelong learning/ continuing

education centres and programmes for young people and adults

29. Number and percentage distribution of young people and adults

enrolled in lifelong learning/continuing education programmes

30. Number and percentage distribution of teachers/facilitators in lifelong

learning/continuing education programmes for young people and adults

Note:

H: Household surveys S: School records and school censuses

(H): If collected by Household surveys (S): If collected from ECCE centers and NFE centers

Page 34: Module B

Goal 4:

Adult literacy (H)

(H)

(S)

(S)

(S)

(S)

(S)

(S)

(S)

31. Adult literacy rate (15 years old and above)

32. Youth literacy rate (15-24 years old)

33. Public expenditure on adult literacy and continuing education as a

percentage of total public expenditure on education

34. Number and percentage distribution of adult literacy and basic

continuing education programmes

35. Number and percentage distribution of facilitators of adult literacy and

basic continuing education programmes

36. Number and percentage distribution of learners participating in adult

literacy and basic continuing education programmes

37. Completion rate in adult literacy and basic continuing education

programmes

38. Number and percentage of persons who passed the basic literacy test

39. Ratio of private (non-governmental) to public expenditure on adult

literacy and basic continuing education programmes

Goal 5:

Gender

equality

H

(H)

(H)

(H)

H

H

H

H

H

H

H

H

S

S

S

(S)

S

S

S

S

S

S

S

S

S

S

40. Female enrolled as percentage of total enrolment

41. Female teachers as percentage of total number of teachers

42. Percentage of female school managers/district education officers

43. Gender Parity Index for:

a. Adult literacy rate (15 years old and above)

b. Youth literacy rate (15-24 years old)

c. GER in ECCE

d. GIR in primary education

e. NIR in primary education

f. GER in primary education

g. NER in primary education

h. Survival rate to Grade 5

i. Transition Rate from primary to secondary education

j. GER in secondary education

k. NER in secondary education

l. Percentage of teachers with pre-service teacher training

m. Percentage of teachers with in-service teacher training

Goal 6:

Quality of

Education

S

S

S

S

S

S

S

S

S

S

S

44. Percentage of primary school teachers having the required academic

qualifications

45. Percentage of school teachers who are certified to teach according to

national standards

46. Pupil/Teacher Ratio (PTR)

47. Pupil/Class Ratio (PCR)

48. Textbook/Pupil Ratio (TPR)

49. Public expenditure on education as a percentage of total government

expenditure

50. Percentage of schools with improved water sources

51. Percentage of schools with improved sanitation facilities

52. Percentage of pupils who have mastered nationally defined basic

learning competencies

53. School life expectancy

54. Instructional hours

Page 35: Module B

Module B2:

Introduction to PASW Statistics (SPSS for Windows)

Contents:

1. Selecting Example Software for Analyzing Household Survey Data to Assist EFA Monitoring 1.1 CSPro (Census and Survey Processing System) 1.2 EPI Info 1.3 Microsoft EXCEL (with VBA Programming) 1.4 PSPP 1.5 SAS (Statistical Analysis System) 1.6 Stata 1.7 SPSS (Statistical Package for Social Sciences)

2. Introduction to PASW Statistics 2.1 What is SPSS/PASW Statistics? 2.2 Step-by-Step Procedure for PASW Statistics Installation 2.3 Running PASW and Its User Interface

3. Basic Components of PASW Statistics 3.1 Output Viewers 3.2 Pivot Tables 3.3 Charts 3.4 Saving/ Exporting Outputs 3.5 Online Help

4. Using Data from Other Sources 4.1 Importing Data from Microsoft Excel 4.2 Importing Data from Delimited ASCII Text Files 4.3 Importing Data from Fixed Width Text Files 4.4 Importing Data from Microsoft Access Databases

5. Tips and Exercises 5.1 Tips: Do and Don’t 5.2 Self-evaluation 5.3 Questions and Hands-on Exercises

Purpose and Learning Outcomes:

To inform background of popular statistical analysis software packages

To understand why SPSS / PASW is chosen as a statistical software for assisting EFA monitoring

To practice installation of PASW

To explore basic features and components of PASW

To understand how to import data from other sources to PASW

Page 36: Module B

1. SELECTING EXAMPLE SOFTWARE FOR ANALYZING HOUSEHOLD SURVEY DATA TO ASSIST EFA MONITORING

1.1 CSPro (Census and Survey Processing System)

CSPro is a public domain statistical package which can be used for entering, editing, tabulating,

and mapping of census and survey data. It is widely used by statistical agencies in developing

countries, especially for data entry (fixed-width text file format).

It was designed and implemented through a joint effort among the developers of the Integrated

Microcomputer Processing System (IMPS) and the Integrated System for Survey Analysis (ISSA):

the United States Census Bureau, Macro International, and Serpro S.A. CSPro was designed to

replace both IMPS and ISSA.

The current version of CSPro is 4.0.003 released on 20 October 2009. CSPro 4.0 There are four key

applications (together with several useful utilities) in the CSPro application package:

1) A Data Entry Application contains a set of forms (screens) and logic that a data entry

operator uses to key in data to a file which can be used to add new data or to modify

existing data. Users can create unlimited number of forms (screens) for data entry normally

as a part of the data entry application.

2) A Batch Edit Application can be used to gather information about a data file together with

several run-time features including: writing editing rules for checking validity (values in a

variable) and consistency (between variables/cases) and modifying data values; making

imputations and generate imputation statistics; generating edit reports automatically or

creating a customized report and creating additional variables.

3) A Tabulation Application contains a set of table specifications (structure) and a data

dictionary (an existing or newly defined one) describing a data file to be tabulated.

This application could cross-tabulate variables and producing map results by geographical

area (if applicable) using both existing variables and new variables created "on the fly".

Output tables can contains selected statistics from simple counts and percents to mean,

median, mode, standard deviation, variance, n-tiles, proportions, minimum, and maximum.

Tabulations can be made on the values as it is o the data file or by applying weights.

4) A Data Dictionary describes overall organization of a data file (or) provides a description

of how data are stored in a data file. Data dictionary is the life of CSPro applications. It

must be created for each file being used.

One of the excellent feature of CSPro is requiring very simple and minimal hardware resources to

run. The minimum configuration includes (i) 33MHz 486 processor; (ii) 16MB of RAM, (iii) a

VGA monitor, and Microsoft Windows 98SE (this program runs only on the Microsoft Windows

family of operating systems). It is a public domain software and can be download at no cost.

All in all, CSPro is the most software in conducting data entry and initial analyses for general

surveys and population censuses. It is widely used in current DHS surveys. However, every data

file must have a data dictionary, even for making simple data analysis such as constructing the

frequency tables for the selected variables. Therefore, it is not suitable to analyze a dataset created

in other software (or datasets without predefined data dictionary).

More than 100 statistical software packages are observed on the web. Some of those packages can be run only on-line; some are free or public domain while the remaining are proprietary; some packages stick to a special field while the others are general purpose.

It is impossible to review all packages, and difficult to select example software for this module. Therefore, a review has been made on seven most widely used software in this section.

Page 37: Module B

1.2 EPI Info

“Epi Info”is public domain statistical software for epidemiology developed by Centers for Disease

Control and Prevention (CDC) in Atlanta, Georgia (USA) since 1985. It is a public domain

software package designed for the global community of public health practitioners and researchers.

The first version, Epi Info 1, was an MS-DOS batch file on 5.25" floppy disks released in 1985. It

was developed under MS-DOS platform until the Epi Info 2000, the first Windows-based version.

Starting from Epi Info 2000, data was stored in the Microsoft Access database format, rather than

the text file format used in the MS-DOS versions. In current years, Windows Vista was supported

in version 3.5.1, released on August 13, 2008 and, an open source version, Epi Info 7, was released

on November 13, 2008 where its source codes can be downloaded.

The current versions provide easy form and database construction, data entry, and analysis with

epidemiologic statistics, maps, and graphs. The primary applications within EpiInfo are:

MakeView to create forms and questionnaires which automatically creates a database;

Enter to enter data into database through forms and questionnaires created in MakeView;

Analysis to produce statistical analyses of data, report output and graphs;

EpiMap to develop GIS maps with overlaying survey data; and

Epi Report to combine analysis output, enter data and any data contained in Access or SQL

server and present it in a professional format. The generated reports can be saved

as HTML files for easy distribution or web publishing.

Although “Epi Info” is a CDC trademark, the programs, documentation, and teaching materials are

in the public domain and may be freely copied, distributed, and translated. The 2003 analysis

documented 1,000,000 downloads from 180+ countries and its manual and/or programs have been

translated from English into 13 additional languages.

One of the most attractive functions of Epi Info is supporting all steps from developing of

questionnaire to data analysis and creating a tailor-made report. First, the users must develop a

questionnaire with Epi Info's "MakeView". Base on that questionnaire, one can customize the data

entry process, enter data into the database (that was created when developing questionnaire), and

finally, analyze the data. For epidemiological uses, such as outbreak investigations, being able to

rapidly create an electronic data entry screen and then do immediate analysis on the collected data

can save considerable amounts of time versus using paper surveys.

As such, it is one of the best software for using survey developers and researchers especially on

epidemiological research/surveys. However, it is not easy to analyze a dataset created in other

software, which the main theme of this Module.

Page 38: Module B

1.3 Microsoft EXCEL (with VBA Programming)

Microsoft Excel (full name Microsoft Office Excel), a component of Microsoft Office, is a

spreadsheet application of Microsoft for both Windows and Mac OS X operating systems. Excel

was first established in 1985 on Mac OS, and the first Windows version in November 2007.

Microsoft Excel has became the most widely used spreadsheet application since the release of

Version 5 in 1993. The most recent commercial versions are Microsoft Office Excel 2007 for

Windows and 2008 for Mac.

Key features of Microsoft Excel include: calculation, graphing tools, pivot tables (or OLAP Cubes)

and a macro programming language in Visual Basic for Applications (VBA). It also has the ability

to carry out several database management functions including supports to SQL (Structured Query

Language) and Network DDE (Dynamic Data Exchange) allowing spreadsheets on different

computers to exchange data.

Since 1993 version, Microsoft Excel supports programming through Microsoft's Visual Basic for

Applications (VBA). VBA is based on Visual Basic and adding the ability to automate tasks in

Excel and to provide user-defined functions (UDF) for the use in worksheets. Moreover,

programming with VBA allows spreadsheet manipulation impossible with standard spreadsheet

techniques. Programmers may write VBA codes directly using the Visual Basic Editor (VBE). On

the other hand, users can record VBA codes replicating their actions on the spreadsheets, and thus

allowing simple automation of regular tasks.

Through VBA, a programmer can assess a database (or dataset) which is placed on a spreadsheet or

from the different files (created in non-Excel formats). Then, Visual Basic modules can be written

for constructing frequency and crosstab tables, calculation of different statistics, and conducting

transformation, sorting, selection and formatting. The results, intermediate or final, could be

concurrently written back to a spreadsheet or saved in a separate file.

The most favoring feature of Microsoft Excel is its wide accessibility as a component of Microsoft

Office. Microsoft Excel is one of the most frequently used software since almost all computer

literates can use it easily.

On the other hand, only few users are familiar with VBA, Pivot Table and database functions which

are the essential part for analyzing household survey data for EFA monitoring. However, Microsoft

Excel is the most suitable software for making final touches on statistical output tables produced by

other software, such as modifying a table format and adding graphs and charts.

Page 39: Module B

1.4 PSPP

A free, open-source alternative software to the proprietary statistics package SPSS. It is an

application for analysis of sampled data and it has a graphical user interface and conventional

command line interface. It is written in C, uses GNU Scientific Library for its mathematical

routines, and "plotutils" for generating graphs. PSPP was start distributing since 1998, and the most

recent once (version 0.6.2) was released on 11 October 2009.

PSPP provides basic, but very useful, statistical analyses such as constructing frequency and

crosstab tables; making non-parametric tests, significant tests and reliability tests; fitting of

different linear regression models; factor analysis and computing basic statistics. It also provides

some database management features such as sorting and selecting cases, computing new variables,

recoding into existing and new variables, and more.

Users can select outputs (tables and graphics) in ASCII, pdf, postscript or html formats. Some

graphs such as histograms, pie-charts and np-charts can also be generated. PSPP can open SPSS

data files and able to import data from Gnumeric, OpenDocument, Microsoft Excel spreadsheets,

databases, comma-separated text files and ASCII text files. It can save data files in the SPSS

'portable' file format (*.por), SPSS 'system' file format (*.sav) and ASCII text file format. Some of

the libraries used by PSPP can be accessed programmatically; PSPP-Perl provides an interface to

the PSPP libraries.

The program file and manual can be downloaded from "http://www.gnu.org/software/pspp/". The

program can be installed freely and used without limitations. However, its documentations and help

system are not much useful for the beginners.

Page 40: Module B

1.5 SAS (Statistical Analysis System)

SAS is an integrated system of software products from "SAS Institute Inc.". SAS enable

programmers (users) to perform many different kinds of analysis, data management and output

generating functions such as:

data entry, retrieval, management, and mining

report writing and graphics

statistical analysis

business planning, forecasting, and decision support

operations research and project management

quality improvement

applications development

data warehousing (extract, transform, load)

platform independent and remote computing

In addition, SAS has many business solutions that enable large scale software solutions for areas

such as IT management, human resource management, financial management, business intelligence,

customer relationship management and more.

SAS is driven by SAS programs that define a sequence of operations to be performed on data stored

as tables. SAS Library Engines and Remote Library Services allow access to data stored in external

data structures and on remote computer platforms.

SAS functions via application programming interfaces, in the form of statements and procedures. A

SAS program is composed of three major parts namely, (a) the DATA step, (b) procedure steps,

and (c) a macro language.

The DATA step identifies file structure, and reading and writing of records, and closing of the file.

All other tasks are accomplished by procedures in the procedure steps. Procedures are not

restricted to only to built-in ones but allow extensive customization, controlled by mini-languages

defined within the procedures. SAS also has an extensive SQL procedure, allowing SQL

programmers to use the system with little additional knowledge.

The macro programming extensions allows using of the "open code" macros or the interactive

matrix language SAS/IML component. Macro code in a SAS program undergoes preprocessing. At

runtime, DATA steps are compiled and procedures are interpreted and run in the sequence they

appear in the SAS program. A SAS program requires the SAS software to run. SAS consists of a

number of components, which require separately licenses and installations.

SAS runs on IBM mainframes, Unix machines, OpenVMS Alpha, and Microsoft Windows; and

code is almost transparently moved between these environments. SAS requires extensive

programming knowledge and it is the most expensive and comprehensive statistical analysis

software.

Page 41: Module B

1.6 Stata

The name "Stata" is taken letters from the words "statistics" and "data". It is a general-purpose

statistical software package with full range of capabilities including data management, statistical

analysis, graphics, simulations, custom programming. It is used by many businesses and academic

institutions around the world. Most of its users work in research, especially in the fields of

economics, sociology, political science, and epidemiology.

Stata was first commercialized in 1985 by StataCorp and released a new major release roughly

every two years in recent years. The most recent version is Stata 11 distributed on 27 July 2009.

There are four major builds on each version of Stata:

Stata/MP for multiprocessor computers (including dual-core and multi-core processors)

Stata/SE for large databases

Stata/IC the standard version

Small Stata a smaller, student version of educational purchase only

Stata emphasizes on command-line interface to facilitate replicable analyses although a graphical

user interface (that is, menus and dialog boxes facilitate access to built-in commands) has initiated

since Stata 8.

It allows opening one dataset at a time for review and editing in spreadsheet format, but the dataset

must be closed before other commands are executed. When working with Stata, it holds entire

dataset in memory, which limits its use with extremely large datasets. The dataset is always

rectangular in format, that is, all variables hold the same number of observations (with some entries

may be missing values).

Stata's proprietary file formats are platform independent, so users of different operating systems can

easily exchange datasets and programs. Stata's data format has changed over time, although not

every Stata release includes a new dataset format. Every version of Stata can read all older dataset

formats, and can write both the current and most recent previous dataset format. Thus, the current

Stata release can always open datasets that were created with older versions, but older versions

cannot read newer format datasets.

Stata can read and write SAS XPORT format datasets natively and it can import data from ASCII

formats (CSV or fixed-width) and spreadsheet formats (including various Microsoft Excel formats).

Just some other econometric applications can directly import data in Stata file formats.

An advantage for using Stata is independency of OS for both datasets and programs. Another

advantage is allowing to operate user-written commands together with built-in commands. Several

useful commands are available to download from the internet (these command files are called ado-

files). Stata's version control system is designed to give a very high degree of backward

compatibility, ensuring that codes written for previous releases continues to work in newer version.

Some of the difficulties in suing Stata are requiring a thorough understanding of working on its

command line interface and basic commands. It seems that only those with extensive programming

experience could use Stata through self-learning. That is, a tailor-made training may be required for

the beginners before working effectively with Stata.

Page 42: Module B

1.7 SPSS (Statistical Package for Social Sciences)

SPSS is one of the most popular data analysis software allowing various statistical methods and

procedures. SPSS was first developed in 1968 at the Stanford University for internal use only (see

brief history of SPSS/PASW Statistics in Section 2.1 of this module). Starting from March 2009,

the name SPSS had been changed to PASW Statistics (Predictive Analytics SoftWare)1.

The recent versions of SPSS/PASW Statistics could handle multiple datasets with almost unlimited

number of variables and cases. It allows importing and exporting of data and outputs to different

formats including Microsoft Excel and various text formats. Both menu (and dialog boxes) driven

graphical interface and command line (syntax) interface are available for the users.

It is the most user-friendly statistical software for the beginners to do basic analysis. It offers

excellent on-line help, complete users' manuals and self-learning tutorials. The package covers

almost all statistical methods required from basic to advanced analysis, good data management and

data documentation.

It is also found out that a vast majority of household surveys were analyzed with SPSS and/or final

survey datasets are available in SPSS (*.sav) format.

For these reasons, PASW Statistics is chosen as the example software to demonstrate household

survey data analysis for EFA monitoring purposes in this module. At the same time, with better

availability and acquaintance with intended users of this module, Microsoft Excel is also selected as

another example software especially for finalizing outputs and presentation purposes.

1 Recently, PASW Statistics has been changed to IBM SPSS Statistics after becoming part of IBM in late 2009.

Disclaimer

UNESCO does not recommend using a particular software. PASW Statistics and Microsoft

Excel are used only as the “example” software in this module. A software is just a tool to

assist in exploring EFA monitoring indicators from the household survey datasets, and users

can choose any statistical software.

Review and selection of the statistical software are solely based on the limited experience of

the author of this module. It does not reflect UNESCO's view or perspective.

Several facts are obtained from the user manuals of underlying software, and from the

Wikipedia, the web-based free encyclopedia.

Page 43: Module B

2. INTRODUCTION TO PASW STATISTICS

2.1 What is SPSS/PASW Statistics?

Brief History

In 1968 at the Stanford University, Norman H. Nie a social scientist and doctoral candidate, C.

Hadlai (Tex) Hull who was just completed master of business administration, and Dale H. Bent a

doctoral candidate in operations research, developed a software system based on the idea of using

statistics to turn raw data into information essential to decision-making. This statistical software

system was called SPSS, the Statistical Package for the Social Sciences, which is the root of

present day PASW, the Predictive Analytics Software.

Nie, Hull and Bent developed SPSS out of the need to quickly analyze volumes of social science

data gathered through various methods of research. Nie represented the target audience and set the

requirements; Bent had the analysis expertise and designed the SPSS system file structure; and Hull

programmed. The initial work on SPSS was done at Stanford University with the intention to make

it available only for local consumption. With the launch of the SPSS user‟s manual in 1970, the

demand for SPSS software was taken off. Moreover, the original SPSS user‟s manual has been

described as “Sociology's most influential book2”. With growing demand and popularity since

1970, a commercial entity, SPSS Inc. was formed in 1975. Up to mid-1980s SPSS was available

only on mainframe computers.

With advances of personal computers in early 1980s, the SPSS/PC was introduced in 1984 as the

first statistical package appeared on a PC working on MS DOS platform. Similarly, the first

statistical product on the Microsoft Windows (version 3.1) operating system was again SPSS,

which was released in 1992.

Versions of SPSS in Recent Years

SPSS regularly updates to be fit in and also to exploit the advance features of new operating

systems, and to fulfill the growing needs of users.

SPSS 16.0.2 - April 2008

SPSS Statistics 17.0.1 - December 2008

PASW Statistics 17.0.2 - March 2009 (PASW = Predictive Analytics SoftWare)

PASW Statistics 18.0.1 (or) IBM SPSS Statistics 18.0.1 - August 2009

PASW is just enhancement and renaming of SPSS and not even the version number is restarted.

SPSS Users

At the beginning, SPSS users were limited academic researchers, mostly around large universities

with mainframe computers. With relatively very high price, employment of touch security systems

and less user-friendliness, number of SPSS users were not many at the early age of SPSS/PC+. Use

of SPSS is increasing rapidly after the release of SPSS for Windows which are user-friendly with

enhanced availability (fully functional evaluation version with a specified trial period could be

downloaded easily).

2 Wellman, B.; Doing it ourselves, Pp 71-78 in Required Reading: Sociology's Most Influential Books. Edited by Dan Clawson,

University of Massachusetts Press, 1998, ISBN 9781558491533

Statistical Package for the Social Sciences (SPSS) was the first comprehensive data analysis software available on personal computers. Its original SPSS user’s manual is widely accepted as the “Sociology's most influential book".

Page 44: Module B

Moreover, the cost for obtaining an SPSS/PASW license is minimal for the students, and it is

within the reasonable range for the members of corporations/organizations. Yet, PASW Statistics is

still expensive for general users. Nowadays, its users include market researchers, health

researchers, survey companies, government, education researchers and marketing organizations.

Strengths of SPSS/PASW Statistics

In addition to superb statistical analysis, PASW offers good data management (case selection, file

reshaping, creating derived data) and data documentation (a metadata dictionary is stored with the

data). PASW data files are portable (smaller in size compared to other database systems) and its

program (PASW syntax) files are quite small.

Organization of PASW Statistics (SPSS) Software Package

PASW organizes as the base system and optional components or modules. Most of the optional

components are added on to the base system. However, some optional components such as Data

Entry is working independently.

The base system, main component for running PASW, has the following functions:

Data handling and manipulation: importing from and exporting to the other data file

formats, such as Excel, dBase, SQL and Access and allowing sampling, sorting, ranking,

subsetting, merging, and aggregating the data sets;

Basic statistics and summarization: Codebook, Frequencies, Descriptive statistics,

Explore, Crosstabs, Ratio statistics, Tables, and etc.;

Significant testing: Means, t-test, ANOVA, Correlation (bivariate, partial, distances), and

Nonparametric tests; and

Inferential statistics: Linear and non-linear regression; Factor, Cluster and Discriminant

analysis.

Some of the optional components (add-on modules) available in version 17.0 are:

Data Preparation provides a quick visual snapshot of your data. It provides the ability to

apply validation rules that identify invalid data values. You can create rules that flag out-of-

range values, missing values, or blank values. You can also save variables that record

individual rule violations and the total number of rule violations per case. A limited set of

predefined rules that you can copy or modify is provided.

Missing Values describes patterns of missing data, estimates means and other statistics, and

imputes values for missing observations.

Complex Samples allows survey, market, health, and public opinion researchers, as well as

social scientists who use sample survey methodology, to incorporate their complex sample

designs into data analysis.

Regression provides techniques for analyzing data that do not fit traditional linear statistical

models. It includes procedures for probit analysis, logistic regression, weight estimation,

two-stage least-squares regression, and general nonlinear regression.

Advanced Statistics focuses on techniques often used in sophisticated experimental and

biomedical research. It includes procedures for general linear models (GLM), linear mixed

models, variance components analysis, loglinear analysis, ordinal regression, actuarial life

tables, Kaplan-Meier survival analysis, and basic and extended Cox regression.

Custom Tables creates a variety of presentation-quality tabular reports, including complex

stub-and-banner tables and displays of multiple response data.

Forecasting performs comprehensive forecasting and time series analyses with multiple

curve-fitting models, smoothing models, and methods for estimating autoregressive

functions.

Page 45: Module B

Categories performs optimal scaling procedures, including correspondence analysis.

Conjoint provides a realistic way to measure how individual product attributes affect

consumer and citizen preferences. With Conjoint, you can easily measure the trade-off

effect of each product attribute in the context of a set of product attributes - as consumers do

when making purchasing decisions.

Exact Tests calculates exact p values for statistical tests when small or very unevenly

distributed samples could make the usual tests inaccurate. Available only on Windows OS.

Decision Trees creates a tree-based classification model. It classifies cases into groups or

predicts values of a dependent (target) variable based on values of independent (predictor)

variables. The procedure provides validation tools for exploratory and confirmatory

classification analysis.

Neural Networks can be used to make business decisions by forecasting demand for a

product as a function of price and other variables, or by categorizing customers based on

buying habits and demographic characteristics. Neural networks are non-linear data

modeling tools. They can be used to model complex relationships between inputs and

outputs or to find patterns in data.

EZ RFM performs RFM (recency, frequency, monetary) analysis on transaction data files

and customer data files.

Amos™ (analysis of moment structures) uses structural equation modeling to confirm and

explain conceptual models that involve attitudes, perceptions, and other factors that drive

behavior.

Another version of PASW, PASW Server, is also available which is developed in client/ server

architecture with some features not available in the normal version, such as scoring functions.

Page 46: Module B

2.2 Step-by-Step Procedure for PASW Statistics Installation

First, the user must have the PASW Statistics software package with official license or just to

install an evaluation version for 21 days trial period. In this manual, evaluation version of PASW

Statistics 17.0 for Windows will be used for demonstration.

Follow the following steps in order to install evaluation version of PASW Statistics 17.0:

Step 1: Check Installed SPSS Versions

Make sure no older version is already installed. If a previous version exists, please uninstall

it before starting the installation process.

Step 2: Insert Installation CD and Run “PASW_Statistics_1702_win_en.exe”

Insert the Installation CD and open “PASW 17.0 for Windows” folder.

Double-click the file named “PASW_Statistics_1702_win_en.exe” to begin extraction of the

contents automatically by the PASW InstallShield Wizard”.

The system requirements to install PASW Statistics 17.0:

Operating System: Microsoft Windows 7, Vista, XP or 2000

System Requirements: Intel Pentium-compatible processor, 256MB RAM, 700MB free

disc space, VGA monitor, and Internet Explorer 6.0 or above

Page 47: Module B

Step 3: Follow the “InstallShield Wizard” until Successfully Complete the Installation

When requesting to choose license type, select “Single user license” and click Next to

continue to the license agreement.

Select I accept the terms in the license agreement and click Next to continue.

Immediately, a dialog window with additional information for the users will appear. Read

the information and click Next to continue.

Fill in “User Name” and “Organization” accordingly and click Next to continue.

A window will pop-up requesting the place (folder) to save program files. It is strongly

recommended to accept default location and just click “Next” to proceed.

Leave serial number blank to install

evaluation version!

Locate where to install

Page 48: Module B

PASW InstallShield Wizard will again confirm to begin the installation.

Click Install to start installation or Back to review and change the installation settings.

As soon as clicked on “Install” button, PASW installation begins. It takes just a few minutes.

During installation, do not press a key or click mouse buttons since it may interrupt the work.

When installation is complete, the Wizard will request to register PASW.

1. Click OK to begin registering process.

2. Select “Enable a temporary trial period” and Click Next.

3. Click browse button.

4. Select the trial license file “trial.txt” and click Open to get the trial license file.

Do NNOOTT press or

click here

Page 49: Module B

5. Click Next to continue and the next windows will inform the enabling of trial period.

6. Click Finish to complete installing the PASW Statistics 17.0 with 21 days trial period.

At this point, the installation of PASW Statistics 17 is successfully completed.

Page 50: Module B

2.3 Running PASW and Its User Interface

After successful installation, a program group called “PASW Statistics 17” will be placed under

“SPSS Inc.” in the “Start Menu”. There will be at least two items in the menu:

1) PASW Statistics 17, and

2) PASW Statistics 17 License Authorization Wizard.

More items may be displayed in the menu, depending on which optional components (add-on

modules) have been installed.

2.3.1 Starting and Ending a PASW Session

To start PASW, just click the “PASW Statistics 17” menu item as following.

Or, double-click any PASW (or SPSS) data or syntax file to start PASW Statistics. In this case, the

file double-clicked will also be opened in an appropriate Window.

To start just click

“PASW Statistics 17”

To browse and open

data file not in the list

Page 51: Module B

When running PASW for the first time, a superimposed dialog window will be displayed on top of

the Data Editor window. This window is aiming to assist initiating a task when starting PASW. It

helps users in performing an initial task such as opening a data or syntax or output file, or running

the tutorial for beginners, or conducting new data entry, or activating an existing query or creating a

new query to import data from another database file. Among the others, opening an existing data

file, from the list or by browsing, is the first common task in PASW statistics.

By default, up to nine most recently used files will be listed in both “Open an existing data source”

and “Open another type of file”. There will be no file in both lists while running PASW for the first

time. An unlisted data file could be opened by double-clicking “More Files…” item and following

the steps of a regular “open file” dialog box. One can double-click the listed file names or select a

file from the list and click OK button to open one of the most recently used files.

By checking the box , only the Data Editor will appear when starting

PASW Statistics in future sessions. It is recommended just to click the “Cancel” button to close the

dialogue window to keep showing the superimposed dialogue window in the coming sessions. In

this case, a blank Data Editor window will be appeared.

For using the “evaluation” version, the following message will be appeared every time running the

program. There will be 21 days if you are using PASW for the first time after installation.

And, it will become 20 days in the following day, and so on. After completing the trial period,

PASW processor will no longer work, that is, commands will not produce any result.

2.3.2 Data Editor and Data Views

In PASW (and earlier versions of SPSS also), data files are displayed in the “Data Editor”. In the

Data Editor, if the mouse cursors on a variable name (the column headings) a more descriptive

label for that variable is displayed for every variable that has been defined with a label.

Data editor has two views: “Data View” and “Variable View”.

Data View: the actual data values are displayed in the cells by default. The „case numbers‟ are

displayed as row captions (as „row number‟ in Microsoft Excel), and the variable names as the

column captions. For the cells, users can choose to display descriptive value labels (for example: to

display “Male” and “Female” instead of coded 1 and 2), from the menus by choosing View, then,

click Value Labels as following:

Tips:

Save the syntax and output files frequently!

Active running session of PASW will end and exit automatically if the user closes the last active

dataset (or data file). Whenever exit PASW, it will ask to save all unsaved windows – including

data, output and syntax windows. It does not have automatic recovery feature and there is no

“undo” for data transformations. Thus, it is important to save the syntax and output files

frequently. Data files should be saved under different name after applying any transformations or

erasing any variables, not to lose the original data files.

Page 52: Module B

or, simply, click the Labels button . Value labels are easier to interpret the responses in the

household survey.

The following is the dataset for individual household member of Bangladesh Demographic and

Health Survey 2007 in the Data View with Value Labels.

Relationship

to HH Head Age Sex

Page 53: Module B

The Data View shows the cases (or observations) in rows and each column represents a variable (a

characteristic that is being measured). In the above example, each individual „member of selected

households‟ is a case, and each „item in the questionnaire‟ is a variable. For example, „relationship

to head of household‟, „age‟ or „highest education level‟ is a variable. Each cell contains a single

data value of a variable for a case. The cell is where the case and the variable intersect, for

example, if the case represents the „head of household‟ (row 13) and variable is „sex‟ (HV104), the

cell is „sex of the head of household‟. When displaying the actual data values, the cell will show

“2”, or it will become “Female” if selected to view in value labels. PASW data files are stored in

flat-file format and data cells cannot store any formula.

Variable View: This displays the metadata dictionary where each row represents a variable and

shows the attributes (or characteristics or properties) of the variable on 10 columns:

1) variable name;

2) type: numeric, comma, dot, scientific notation, date, dollar, custom currency, and string;

3) variable width, i.e. number of digits or characters;

4) number of decimal places;

5) variable label;

6) value labels;

7) codes for user-defined missing values;

8) column width in data view;

9) cell alignment, i.e., left, right or center when displaying in data view; and

10) type of measurement (scale, ordinal or nominal).

All attributes are saved with data values in the file.

Number of rows and columns (size or dimension) of the data file are determined by the number of

cases and variables used in that file. Data can be entered in any cell, even in a cell which is outside

the boundaries of the defined dataset. In this case the dimension of the data view is extended to

include all the rows and columns to cover that newly entered cell. Variable names for the undefined

columns will automatically be assigned as “VAR00001”, then “VAR00002”, and so on.

Page 54: Module B

The cells without entering data in the newly expanded data range (in both rows and columns) will

be filled-up with “.” (a system-missing value) for the numeric variables, and “ ” (blank is valid

string values in PASW) for the string variables.

In this case, type of the new variables is automatically defined as „numeric‟ and default attributes

for the numeric variable are set by PASW. Users could change all attributes, including variable

name and type, in the Variable View.

Apart from directly putting in Variable View, the following two methods can be used in defining

variable properties:

Copy Data Properties Wizard provides the ability to use an external data file or another

dataset that is available in the current session as a template for defining file and variable

properties in the active dataset. Similarly, variables in the active dataset could be used as

templates for other variables in the same dataset. „Copy Data Properties‟ is available on the

„Data menu‟ in the main SPSS window.

Define Variable Properties, which is also available on the „Data menu‟, scans the data and

lists all unique data values for any selected variables, identifies unlabeled values, and

provides an auto-label feature. This method is particularly useful for categorical variables

that use numeric codes to represent categories, for example, 0 = Male, 1 = Female.

2 variables just

created automatically

Value just typed-in

New properties

typed-in / changed

Page 55: Module B

3. BASIC COMPONENTS OF PASW STATISTICS

3.1 Output Viewers

The outputs created by the program are displayed in the “PASW Statistics Viewer”. By default, all

outputs including, command syntax used during the analysis, output tables, charts, notes and the

activity logs during the session are recorded in the Viewer. Users are allowed to determine which

output items were to display or hide in the viewer. It could be set through the “Viewer” tab of

“Options” sub-menu in the “Edit” menu.

If PASW is stated through opening a data file, a Viewer (with the name Output1 [Document1]) will

automatically activate and record the command syntax used to open the data file under the “Log”

tag. If it is decided not to show the command syntaxes in future, for example, the user can set to

hide “Log” initially as shown in the above exhibit. Otherwise, the following log will be displayed

when opening the data file “BDHR50FL.SAV”.

Both “Data Editor” and “PASW Statistics Viewer” will be automatically opened when starting a PASW Statistics session. A user-friendly Help system is available and ready to serve whenever requested by pressing F1 key: the opening page “Getting Help” of the “Base System Help” will be displayed if working on data editor or output viewer; or context sensitive “PASW Command Syntax Guide” for the specific command when working on the syntax.

Options for:

Log

Warnings

Notes

Title

Page title

Pivot table

Chart

Text output

Tree model

Model viewer

Page 56: Module B

A typical PASW Viewer, after running the cross-tabulation (crosstab) of “Highest education level”

by “sex”, can be seen in the following illustration. Six types of outputs are recorded in the Viewer:

(i) Command Log; (ii) Title; (iii) Notes; (iv) Active Dataset; (v) Case Processing Summary; and

(iv) the output table (Highest educational level * Sex of household member Cross-tabulation).

PASW Statistics Viewer is useful in:

browsing the results like in the Windows explorer;

showing or hiding selected output item (notes, tables and charts);

deleting selected output items;

changing the display order of results; and

moving items between the Viewer and other applications.

In the viewer, double-click the appropriate icon in the left pane to unhide any hidden item and

doing so to a visible item will hide it. For example, notes are hidden by default in outputs and

double-click the notes icon will display the notes.

Drag-and-drop can be applied on icons in the left pane to change the location of any item (order in

the output pane). Click the icon to activate the associated item, and press “delete” key to eliminate

that item (and its icon) from the output.

Click to select

Double-click toggles

hide / unhide

Drag-and-drop to

change location

(order in output)

Notes are hidden!

Double-click here

to unhide

Tips:

If some particular items from the output were to use in other applications like in MS Excel or

Word, just simple copy and paste technique can be used. Moreover, almost any object, a paragraph

or a chart, can be paste on to the output view as usually do in popular application programs.

Page 57: Module B

3.2 Pivot Tables

Pivot table is a data summarization tool to create output table formats. Pivot-table tools can

automatically sort, count, and total the data stored in one table or spreadsheet and create a second

table. For example, user can change the variables displayed in rows to columns and vice versa. This

ability of "rotation" is known as pivoting and a table with this ability is called a “pivot table”. One

of the significant features of PASW Statistics Viewer is its ability to handle pivot tables.

Most of the output tables in PASW Viewer can be pivoted interactively. User has the choice to

setup and change the table structure by dragging and dropping the variables or by selecting the

specific items of the layer variables whether the results represent the entire dataset, or just a subset

of data.

Options for manipulating a pivot table include:

transposing rows and columns;

moving rows and columns;

creating multidimensional layers;

grouping and ungrouping rows and columns;

showing and hiding rows, columns, and other information;

rotating row and column labels; and

finding definitions of terms.

The followings illustrate how one can use pivoting in data analysis and presentation.

First, run cross-tabulation of “Educational Attainment” by “sex” by “type of place of residence”

(click Analyze on Main Menu and select Crosstabs under Descriptive Statistics, then, select the

variable, click appropriate arrowhead to move variable name to row or column or layer, and finally

click OK – see in the next module for a detail illustration).

Page 58: Module B

The following is the main results obtained by the above cross-tabulation command.

Then, go through the following steps for pivoting an output table:

1) Double-click the output table located in right result pane to go into table editing mode;

2) The main menu will contain a new item “Pivot”;

3) Select “Pivot” menu and click “Pivoting Trays”; and

4) In the pivot tray, arrange the row, column and layer variables (including statistics) as

necessary by drag-and-drop the variable names,

The followings illustrate the use of pivot table method on the crosstab table.

Double-Click

any place

on this Table

Drag and

drop

Page 59: Module B

Click to get “Pivot Trays

New Item

Page 60: Module B

3.3 Charts

(a) Creating Charts while Analyzing Data

PASW provides high-resolution charts by a click from several procedures on the “Analyze” menu.

For example, in the bottom-left area of “Crosstab” command, there is a check-box “Display

clustered bar charts” which could help create useful graphs for the selected variables.

(b) Creating Chart through Builder

Different types of charts and plots could be produced by the procedures in the “Chart Builder” item

under “Graphs” menu. The Chart Builder helps building charts from predefined gallery charts

(templates/ samples) or from the individual parts (axes and bars). A chart can be built by dragging

and dropping the gallery charts or basic elements onto the canvas, which is the large area to the

right of the Variables list in the Chart Builder dialog box. When building a chart the canvas will

display a preview of the chart with defined variable labels and measurement levels. The preview

Page 61: Module B

does not reflect the actual data since it uses randomly generated data to provide a rough sketch of

how the chart will look.

Using the gallery is the preferred method for the new users. It is also possible to build a chart from

basic elements which is more complex since the chart options were to define explicitly by the users.

Construct a chart by using gallery

First, click the “Chart Builder” item under “Graphs” menu, and the following Chart Builder

window with superimposed warning will appear. Click OK since users can define temporary

variable types while building charts.

Then, follow the steps for building a chart from the gallery as:

1) Click the Gallery tab if it is not already displayed.

2) In the Choose From list, select a category of charts. Each category offers several types.

3) Select the suitable type of chart again by dragging onto the canvas, or double-clicking, the

picture of the desired chart type. If the canvas already displays a chart, the gallery chart

replaces the axis set and graphic elements on the chart.

4) Drag variables from the Variables list and drop them into the axis drop zones and, if

available also to the grouping drop zone. If an axis drop zone already displays a statistic and

if it is the statistic desired, do not drag a variable into the drop zone. Add a variable to a

zone only when the text in the zone is blue. If the text is black, the zone already contains a

variable or statistic. Refer to Statistics and Parameters for information about the available

statistics.

In building the charts, measurement level of variables is important. The Chart Builder sets defaults

based on the measurement level while building the chart. Furthermore, the resulting chart may also

look different for different measurement levels. The user can temporarily change a variable's

measurement level by right-clicking the variable and choosing an option.

Page 62: Module B

5) If the user needs to change statistics or modify attributes of the axes or legends (such as the

scale range), click Element Properties. In the “Edit Properties Of” list, select the item needs

to change and change as needed and after making any changes, click Apply.

6) Click OK to create and display the chart in the Viewer.

Notes: (a) If it is necessary to add more variables to the chart (for example, for clustering or paneling), click the

Groups/Point ID tab in the Chart Builder dialog box and select one or more options. Then drag categorical

variables to the new drop zones that appear on the canvas.

(b) To transpose the chart (for example, to make the bars horizontal), click the Basic Elements tab and then

click Transpose.

(c) If many default settings for a specific chart were to change often, the current settings could be saved as a

favourite and use it later. Please refer to PASW manuals for detailed instructions.

(d) Canvas is the area of the Chart Builder dialog box where building the chart.

(e) An axis set defines one or more axes in a particular coordinate space (like 2-D rectangular or 1-D polar).

Adding a gallery item to the canvas automatically creates an axis set. Each axis includes an axis drop zone

for dragging and dropping variables. Blue text indicates that the zone still requires a variable. Every chart

requires adding a variable to the x-axis drop zone.

(f) The graphic elements are the items in the chart that represent data. These are the bars, points, lines, and

so on. In the illustration, the graphic element is a bar.

(g) The variable list displays the available variables. If a variable selected in this list is categorical, the

category list shows the defined categories for the variable. A variable's measurement level can be

changed temporarily by right-clicking its name and choosing desired measurement level.

(h) Drop zones are the areas on the canvas to which drag and drop a variable from the Variables list. The

basic drop zone is the axis drop zone. Certain gallery charts (like clustered or stacked bar charts) include

grouping drop zones. The illustration shows a grouping zone that contains Sex as the grouping variable.

After clicking on the OK button, the following chart will be placed in the Viewer.

Canvas

Variable List

Statistics in axis

drop zone

Variable in

grouping zone

1

3 2

4

5

Category List

6

Page 63: Module B

To generate a bar chart of the “percentage of male and female head of household in each district”,

first, click Element Properties button on the Chart Builder window and follow the steps below:

1) In the “Element Properties” window, change the desired statistics to “Percentage()”;

2) Click Set Parameters button;

3) Select “Total for Each X-Axis Category” as the denominator for computing percentage in

the set parameters drop-down list;

4) Click Continue; and

5) Click Apply to activate changes

And, finally, click OK button on the Chart Builder window to get the following graph.

1

3

4

5

2

Page 64: Module B

(c) Using Graphboard Visualization to Create Customized Graphs

Creating a graph from the “Graphboard Template Chooser”

This is a new feature in PASW Statistics 17. Through this command (located in the “Graph” menu),

graphs can be created from ready-made templates called “Graphboard Visualizations” which

contains graphs, charts, and plots. PASW Statistics ships with built-in visualization templates

covering 23 different types of graphs which are sufficient for the general users. Another product,

PASW Viz Designer, is available to create own visualization templates.

To use built-in templates, select “Graphboard Template Chooser” in the “Graph” menu and follow

the following steps:

1) In the “Graphboard Template Chooser” window, click basic tab to start selecting

appropriate variable(s);

2) Click (with control key starting from the second variable) the variable name(s) to create the

graph. Here, PASW just list the variable names, instead of labels. As soon as a variable is

selected, all possible graph types which are suitable for the selected variable will be

displayed in the right pane of the window. Similarly, if two variables are selected, possible

types for those two variables will be displayed;

3) Double-click the icon of the preferred graph type from the displayed samples;

4) Optionally, click:

(a) Detailed tab to change chart type, variables, and etc.;

(b) Titles tab to set chart title, sub-title and footnote; and

(c) Options tab to set output label and other options.

5) Click OK to start creating the preferred graph.

It should be noted that creating graphs through this “Graphboard Template Chooser” requires more

resources, such as processing time, better processor, and larger memory. Moreover, the graph

created from this option is difficult to edit.

(d) Graphs through Legacy Dialogs

Graph can also be created from the "legacy dialogs". Almost all graph types are available and can

be customize the view such as title, sub-title and so on while creating the graph through this option.

The following exhibits show the types of graphs available under “Legacy Dialogs” and the

population pyramid of sample household population created through the legacy dialogs.

1

2

Different types of

charts available in

“Legacy Dialogs”

Page 65: Module B

The following dialog shows the generating a population pyramid from sample household

population by age and sex.

And, the pyramid produced by the above setting is as following:

3

4

5

6

7

8

Drawing a Population Pyramid:

1) Select “Legacy Dialogs” in

“Graphs” menu;

2) Click “Population Pyramid”;

3) Drag “Age of household

members” and drop in “Show

Distribution over” box;

4) Drag “Sex of household

members” and drop in “Split

by” box;

5) Click “Titles…” button;

6) Type in “Population Pyramid of

Sample Households” in Title

Line 1;

7) Click “Continue”; and

8) Click OK on “define

Population Pyramid” dialog

Page 66: Module B

3.4 Saving and Exporting Outputs

Starting from PASW Statistics 16, outputs are saved only in Viewer format (*.spv). The PASW

viewer no longer supports output files of earlier versions in the proprietary file format (*.spo).

From PASW Viewer, outputs can be selected, copied and paste in any spreadsheet software or word

processors or graphical presentation software.

Outputs in the Viewer can also be exported to different formats such as: Excel (*.xls); HTML

(*.htm); Portable Document Format (*.pdf); Power Point (*.ppt); different text formats (*.txt) such

as plain text, UTF8 and UTF16; and Word/RTF (*.doc). Moreover, graphical outputs can be saved

into such formats as: Bitmap (*.bmp); Enhanced Meta File (*.emf); Encapsulated Postscript

(*.eps); JPEG file (*.jpg); Portable Network Graphic (*.png); and Tagged Image File (*.tif).

In exporting outputs, one can select:

i) to export all items, including hidden, both selected items and non-selected items;

ii) visible (non-hidden) items only; or

iii) just selected items.

For exporting multiple items, one can select different items by clicking the item while pressing

control key, and follow the steps as described in the following example.

For exporting PASW outputs to MS Excel,

1) Select the item(s) to export on the left pane of the PASW Statistics Viewer;

Selected

Output

Tables

Page 67: Module B

2) Click Export in File Menu and an “Export Output” window will appear;

In the “Export Output” window:

3) Check “Selected” option button to export only selected output items (tables, notes,

summaries, …);

4) Select “Excel file (*.xls)” from the “File Type:” dropdown;

5) Click Browse button and select the location of the export file and file name or type in the

file name with full path, e.g., “C:\Documents and Settings\User\My Documents\SPSS

Training\Sample\Test-exporting.xls”;

6) Click OK to begin the export process;

3

4

5

6

Page 68: Module B

At the end of exporting process, the exported file can be seen in the designated folder.

For exporting only the graphics without any notes, tables, etc., select “None (Graphics only)” while

choosing the Document Type in Step 4 (the last item in the drop-down list). Then, the Graphic

section of the “Export Output” window will activate and the Document section will inactivate (that

is, user can no longer set any options or select other than document type). In this case, users can

select the graphic format (together with graphic options) to be saved and the root file name to save

the graphics. If the root file name is “text.png” and if there were 3 charts in the active Viewer, three

graphic files will be created with the name: “test1.png”, “test2.png”, and “test3.png”.

Page 69: Module B

3.5 Online Help

PASW Statistics provides a comprehensive help system together with tutorial for every key aspects.

Context-sensitive help topics in dialog boxes could guide on every specific task. A help window

will pop-up whenever the help key “F1” is pressed. It shows the base system help while working

with data editor or output viewer, or command syntax guide of the closest command while in the

syntax editor. Similarly, various types of PASW help can be accessed through “Help” menu.

The first item and the most important for the beginners under the Help menu is the item “Topics”.

“Topics” provides access to the basic PASW Help system with Contents, Index, and Search tabs,

from which users can find the explanation of specific topic or command procedure.

Page 70: Module B

The second item, “Tutorial” illustrates step-by-step instructions on how to use many of the basic

features. Users can choose the topics required to grasp, skip around and view topics in any order.

The index or table of contents can be used to find specific topics. “Case studies”, the third item,

provides hands-on examples of how to create various types of statistical analyses and how to

interpret the results. The sample data files used in the examples are provided in the PASW package.

Table of contents of the tutorial can be observed in the following illustration.

The “Statistics Coach”, using a wizard-like approach, helps finding the commands or procedures

needed. After making a series of selections, the Statistics Coach opens the dialog box for the

statistical, reporting, or charting procedure that meets selected criteria. It provides access to most

statistical and reporting procedures and several charting procedures in the Base system.

Page 71: Module B

The above mentioned help items are useful for all users – from beginners to advanced developers.

A part from those, more help topics such as “Command Syntax Reference” and “Statistical

Algorithms” are available interactively for the advanced users, and the “Developer Central” and

“Technical Support Website” for the on-line users.

Like in other modern software, PASW provides “Context-sensitive Help” in several places in the

user interface as:

1) Most dialog boxes have a Help button that takes directly to a Help topic for that dialog box.

The Help topic provides general information and links to related topics.

2) Right-click terms in an activated Pivot Table in the Viewer and choose “What's This?” from

the context menu to display definitions of the terms.

3) In a command syntax window, position the cursor anywhere within a syntax block of a

command and press F1 on the keyboard. A complete command syntax chart for that

command will be displayed. Complete command syntax documentation is available from

the links in the list of related topics and from the Help Contents tab.

Page 72: Module B

Select any place in the Command Line and Click <F1>

Page 73: Module B

4. USING DATA FROM OTHER SOURCES

Generally, PASW Statistics can read data files created in:

all versions of PASW Statistics (*.sav) and SPSS/PC+ (*.sys) formats;

spreadsheets (EXCEL, Lotus and SYLK);

database tables (dBase, MS Access, FoxPro, Oracle, SQL Server, etc.);

statistical software (SAS, SYSTAT, and Stata); and

different text formats (fixed width, comma delimited/ CSV, tab or space delimited, etc.).

Data files created by spreadsheets and other statistical software could open directly as PASW data

files. Similarly, PASW can open dBase files, text data files and other files without converting the

files to an intermediate format or entering data definition information. On the other hand, complex

database files such as MS Access, FoxPro and SQL databases could be accessed through the

database wizard or SQL queries.

Opening a data file makes it the active dataset. The active dataset is the one, from which PASW

will read and write during the session if there is no specific command to change to other dataset. If

there are one or more open data files (or datasets), those remain open and available for subsequent

use in the session. Clicking anywhere in the Data Editor window for an open data file will make it

the active dataset.

A PASW data file could be saved (or exported) to other file types. However, some file types could

save only data values while PASW keeps both values and data dictionary (or attributes). The data

dictionary or attributes such as variable label, value labels, missing values, etc. will be lost if it is

save to other formats including Microsoft Excel format.

4.1 Importing Data from Microsoft Excel

Importing data from Microsoft Excel is the easiest among the data sources. First arrange the

spreadsheet in tabular format fulfilling following six recommendations:

i) Names of the variables on the first row of the data range;

ii) Variable names comply with PASW Statistics naming rules3;

iii) For all numeric variables, there should be no blanks in the second row of the data range;

iv) Data range should be continuous – no blank rows or columns;

v) Clear of any graphs, labels, and extra text or data on the worksheet; and

vi) Delete unnecessary worksheets (which are not going to import).

3 Starting from Version 12.0, the following rules apply in variable names:

1. must be unique; duplication is not allowed and cannot contain spaces;

2. up to 64 characters in English;

3. starting with a letter or @, #, or $ and follow by letters, numbers, period (.), and non-punctuation characters;

4. starting with a “#” is a scratch variable, which can create only with command syntax;

5. starting with a $ sign is a system variable, and not allowed for a user-defined variable;

6. the period, underscore, and the characters $, #, and @ can be used within variable names, e.g. “A._$@#1”;

7. shall not end with a period or an underscore ;

8. not allow to use reserved keywords: ALL, AND, BY, EQ, GE, GT, LE, LT, NE, NOT, OR, TO, and WITH;

9. allows mixture of uppercase and lowercase characters, and “case” is preserved for display purposes; and

10. wrap long names in output – breaking at underscores, periods, and where changed from lower to upper.

In general, PASW Statistics can read datasets created by almost all popular statistical software and databases. A PASW dataset is also possible to save in several popular formats. Therefore, PASW data format (*.sav) is the common format in sharing/distributing survey datasets.

Page 74: Module B

If the data in Excel file is spreading over several worksheets, it is better to create a new Excel file

with just one worksheet containing all necessary data including variable names.

Then, follow the steps:

On the main menu click:-

1. File;

2. Open;

3. Data;

And, an “Open Data” pop-up window will be appeared. In this window:-

4. Change Files of type to “Excel (*.xls, *.xlsx, *.xlsm)”;

5. Select the folder containing Excel data file from Look-in box;

6. Select the correct Excel data file (in 97-2003 or 2007 format); and

7. Click Open;

The “Opening Excel Data Source” pop-up window will be appeared and on that window:-

8. Clear the check box next to “Read variable names from the first row”, if and

only if the first row of the Excel data sheet does not have variable names;

9. Select the worksheet containing data, if the file has more than one worksheet;

10. Type in the range of data to be imported (for example A1:V100 for the first 99

cases or 100 rows including the row for the variable names); and

11. Click OK.

1

2 3

4

5

6

7

Page 75: Module B

If the data file in Excel was prepared with six recommendations mentioned above, steps 8 to 10

could be skipped since there is only one sheet in the Excel file, the data range is continuous and

there is no extra cells or objects in the sheet rather than the data to be analyzed.

At the end of this process, data from Excel file has been transferred into PASW dataset. At this

time, it is important to save the current SPSS dataset with an appropriate name in designated place.

Data files in Excel or text format or databases do not have data dictionary, that is, no information

on data attributes such as variable labels, value labels, missing values, etc. Therefore, it is important

to define such attributes to all variables, and save the data file again.

8

9

10

11

Page 76: Module B

4.2 Importing Data from Delimited ASCII Text Files

When requesting data from other agencies and departments, sometimes, data are provided in text or

ASCII file format. Normally, data in an ASCII file are arranged with fixed width format, that is, a

variable is placed in same location for every case or separated by a specific character such as tab,

space, comma, semicolon and any other specific character which is unique throughout the file and

did not use in the data values.

To import data from a delimited text file, first, review the file on a text editor such as notepad or

Word and check the character used for delimitation (normally, tab, space, comma or semicolon).

Then, follow the steps:

On main menu click:-

1. File;

2. Read Text Data;

Then, “Open Data” pop-up window will be appeared with text (*.txt) file type.

In Open Data window:-

3. Select the folder containing text data file from Look-in box;

4. Change “File of type” to “All Files (*.*)”;

5. Select the correct data file (*.txt, *.dat, *.csv, *.prn, etc.); and

6. Click Open;

And, a “Text Import Wizard” will begin automatically and guide through the importing process.

Note: Sometimes, text data files have different file extension than “.txt” and “.dat”, such as “.prn” or “.csv”.

If “Read Text Data” menu item is chosen, PASW will display the files only with extension “.txt” and

“.dat”. To search for a text data file with other extensions, choose “All Files (*.*)” in “Files of type”

field to display all files.

1

2

3

4

5

6

Page 77: Module B

The wizard contains the following 6 steps:

Step 1/6: Click Next to forward Step 2 of 6;

In the Step1 of the Wizard, one can apply a predefined format (previously saved from

the Text Wizard) or follow the steps.

Step 2/6: (i) The Wizard will sense and opt whether the data is arranged as “Delimited” or

“Fixed width”, but check and identify correctly (in “Data.csv” file, the variables

are separated by a comma “,”, and thus the file structure is delimited); and

(ii) identify whether the variable names are included at the top (first line) of the data

file or not (in this example, “Yes”), and click Next to forward to Step 3 of 6.

Step 3/6: (i) Since data file begins with variable names, the first case of data begins on line 2.

Otherwise, user should identify the line number that the data begins.

(ii) If a line represent a case (one person, for example), just click Next; otherwise,

select the second option on “How are you cases represented?” and specify

number of variables per case before clicking Next.

Step 4/6: The Wizard will automatically identify the delimiter(s) between variables. However,

it is important to check and specify correctly. Some software export text in quotes,

i.e. expressed as “text” or „text‟, then the character of text qualifier (or quotation

mark) must be specified by the radio buttons of the second question, and click Next.

The first line contains the variable name!

Page 78: Module B

Step 5/6: In this step, variable names and data formats can be specified (or) changed from the

default settings. Then, click “Next” to continue or “Finish” to start importing data.

Step 6/6: In this step, just click “Finish” to start importing and the task will complete in a few

minutes.

Sometimes, the Wizard may identify

wrong delimiters. Users must check and

post correct delimiter(s).

Page 79: Module B

4.3 Importing Data from Fixed Width Text Files

In some text data files variables are aligned in fixed width columns. That is, a variable is at the

same column throughout the data file. For example, sex of household member is situated in column

33 of every line in the “Data(Fix).txt” file, which is extracted from the Bangladesh DHS 2007.

To import data from a text file with fixed width data structure it is important to have the data

dictionary of the variables, that is, which variable is located on which column(s). After that:

On main menu click:-

1. File;

2. Read Text Data;

Then, “Open Data” pop-up window will be appeared with text (*.txt) file type, and

3. Select the folder containing text data file from Look-in box;

4. Select the correct data file (*.txt or *.dat); and

5. Click Open;

Then, the “Text Import Wizard” will begin automatically and guide through the importing process.

The wizard contains the following 6 steps:

Step 1/6: Simply, click Next to forward Step 2 of 6;

Step 2/6: (i) The Wizard will sense and opt whether the data is arranged as “Delimited” or

“Fixed width”, but check and identify correctly (“Data(Fix).txt” contains no

separation character and file structure is delimited); and

(ii) identify whether the variable names are included at the top (first line) of the data

file or not (in this example, “No”), and click Next to forward to Step 3.

Not require to change “Files of type”

since the extension of file name is “.txt”

Page 80: Module B

Step 3/6: Since there is no variable name in the first line, the first case of data begins on line

number 1. Sometimes, a case spans over one lines, users have to identify the number

of lines per case. Unless, just click “Next” to continue.

Step 4/6: This is the most crucial step in importing a fix width data file. Use the data dictionary

to identify and split the case into variables accordingly. In this example, one line of

data represents a case, and the location of variables are as following:

Variable number Column Variable Name Variable Label

1 1-8 HV005 Sample weight

2 9-10 HV009 Number of household members

3 11-12 HV024 Division

4 13 HV025 Type of place of residence

5 14 HV026 Place of residence

6 15-16 HV218 Line number of head of household

7 17 HV219 Sex of head of household

8 18-19 HV220 Age of head of household

9 20 HV270 Wealth index

10 21-28 HV271 Wealth index factor score (5 decimals)

11 29-30 HV101 Relationship to head

12 31 HV102 Usual resident

13 32 HV103 Slept last night

14 33 HV104 Sex of household member

15 34-35 HV105 Age of household members

16 36 HV106 Highest educational level

17 37-38 HV107 Highest year of education

18 39-40 HV108 Education in single years

19 41 HV109 Educational attainment

20 42 HV110 Member still in school

21 43 SH08 Marital status

22 44 SH15 Employment status

First line does NOT contain variable names!

Page 81: Module B

The Wizard will put in separation lines or break lines wherever explicit (for example,

if a column contains blank(s) consistently across the lines, the Wizard will insert a

break line). A break line can be inserted or deleted with the “Column number” input

box below the data view. For example, to insert a break line in the column 13, put in

13 in the “Column number” input box and press the “Insert Break” button. Similarly,

to delete a break located on column 28, just type in 28 and click “Delete Break”

button. In this step, the user has to check and identify all break lines to get correct

data import.

After defining the location click Next to proceed to Step 5.

Page 82: Module B

Step 5/6: In this Step, one can select “Finish” to start importing data with default variable

names (V1, V2, …, Vn), and data formats (all numbers will be numeric and the

remaining be string). Or, user can put in variable names and formats individually.

Step 6/6: Simply click “Finish” to start the text data importing task.

In text data import wizard, the user can save the format (including break lines and

variable names) for future use.

It will take just a few minutes to import the text data into PASW Statistics Data Editor. It is

strongly recommended to check and edit (or create) variable attributes such as variable labels, value

labels, missing values, etc. It is important to define such attributes to all variables, and save the data

file again.

Page 83: Module B

4.4 Importing Data from Microsoft Access Databases

Data from the databases which are using the Open Database Connectivity (ODBC) drivers can be

read directly by PASW Statistics if respective drivers are installed in the computer. Commonly

used ODBC drivers are provided with the PASW installation package. Among the others, Microsoft

Access is the most widely used database system and step-by-step guide to grab data from MS

Access will be presented in this section. The same steps, with minor variations, could be followed

to import data from the databases created on other platforms.

Before importing data from an MS Access database, check whether the database contains a table in

flat file format (like a worksheet) with all variables needed to import or not. If the data to import

spread over several database tables, that is variables are located in different database tables, first, it

is better creating a simple table containing all variables in MS Access before importing.

To begin, click followings on main menu:-

1. File;

2. Open Database; and

3. New Query;

A “Database Wizard” window will appear for identifying ODBC data source. All available ODBC

data sources will be listed on the right pane and click the one which matches the database to be

imported. If there is no appropriate source, a new driver file for that particular source must be

installed or added before importing data from that database.

Normally, there is “MS Access Database” in the list, and:

4. Select MS Access Database from ODBC Data Sources; and

5. Click Next to continue.

1

2 3

4

5

Page 84: Module B

For the first time, “ODBC Driver Login” window will appear. If it is not the first time that

this import procedure is running, the Wizard may skip this step.

6. Click Browse to browse the folders and file, and select the correct database file to

open; and

7. Click OK to open that database file.

At this point, the user can setup a new link also.

Then, the “Database Wizard” window will come up with two panes: “Available Tables” on

the left and “Retrieve Fields in This Order” on the left.

8. Click table name to expand and double-click the field name(s) to select or

double-click the table name to select all variables in that table; and

9. Click “Finish” to start importing all cases (53,413 cases) from the database.

It is important to save the data file after the import process.

10. On the other hand, one can click “Next” to go to another step where users can select

the cases to import based on some criteria (filtering). The following example shows

how to import the cases where age of household member is between 6 and 15 years.

Here, only 12,621 cases will be imported instead of 53,413 cases in the entire

database. It is important to save the data file at the end of importing process.

To import selected fields (variables),

click here to expand

and double-click the desire field names

To import all

variables, just

double-click

table name

6

7

8

9

Page 85: Module B

11. Again, by pressing “Next” to redefine variable names and to process auto-recoding

string variables before pressing “Finish” to start importing.

Although all variables been imported, PASW Statistics assigns F8.2 (floating-point format; total of

8 digits including 2 decimal places) to all numeric variables, and A255 (alpha-numeric format; up

to 255 characters) to string variables. Therefore, it is import to realign formats for all variables, and

also to set column widths to display appropriately. Moreover, it is recommended to recode string

variables for easier analyses. The following section will explain how to refine imported data sets.

If there are several tables in the source database file, one can link through identification fields and

import variables from different tables (please see: online tutorial on PASW Data Manipulation).

However, it is more convenient to link tables and create a special table with all required variables in

MS Access (or in the original database software) before importing into PASW Statistics.

HV105 is “Age of household member”, and

the criteria is “Age > 5 and Age < 15” or

“5 < Age < 15”

10

Page 86: Module B

5. TIPS AND EXERCISES

5.1 Tips: Do and Don’t

i) Do… check whether any previous version of PASW Statistics or SPSS for

Windows or SPSS/PC+ has already been installed in the computer.

Don’t… install any version of PASW Statistics without checking the existence of

any working PASW Statistics.

ii) Do… check whether any installed PASW Statistics is a license version.

Don’t… uninstall any license version of PASW Statistics before ensuring the

transferability of legitimacy to new PASW software.

iii) Do… uninstall existing PASW Statistics or SPSS for Windows or SPSS/PC+ if

the new software has a valid license or decided to use for evaluation which

allowed for 14 or 21 days.

Don’t… install new version of PASW Statistics before completing un-installation

process.

iv) Do… study and make yourself expert of PASW Statistics components and

survey files including data, questionnaire and codebook before conducting

any analysis.

Don’t… change anything in the dataset! And also do not start analysis with the new

dataset before understanding the questionnaire and codebook of the

survey.

v) Do… familiarize with data file, especially if it is in other format than PASW, for

text data files: review on a text editor such as Word, notepad, etc. check

whether the first line comprises of variable names or not; and which

separation character (blank, comma, tab, etc.) been used.

Don’t… save the original data file after reviewing in MS Word or any text viewer

to avoid altering format and edited characters.

Page 87: Module B

5.2 Self-evaluation

Are you able to explain to your colleagues on background information of some popular statistical analysis software packages? Very well / Somewhat well / Not so much / Almost None

Do you understand why SPSS / PASW is chosen as a statistical software for assisting EFA monitoring? Very well / Somewhat well / Not so much / Almost None

Can you install evaluation version of PASW statistics without any assistance? Certainly / Somewhat certain / Not so much / Not at all

Can you explain your friends on the following basic components of PASW: o Output Viewers Very well / Somewhat well / Not so much / Almost None o Pivot Tables Very well / Somewhat well / Not so much / Almost None o Charts Very well / Somewhat well / Not so much / Almost None o Export Outputs Very well / Somewhat well / Not so much / Almost None o Online Help Very well / Somewhat well / Not so much / Almost None

Are you confident that you can import data from the following sources to PASW: o Microsoft Excel Confident / Somewhat confident / Not so much / Not at all o Delimited text files Confident / Somewhat confident / Not so much / Not at all o Fixed width text files Confident / Somewhat confident / Not so much / Not at all o Access databases Confident / Somewhat confident / Not so much / Not at all

5.3 Questions and Hands-on Exercises

i) Provide three reasons for appropriateness of using PASW Statistics for analyzing

census and household survey data for assisting EFA monitoring.

ii) What are the key components of PASW Statistics?

iii) Open “B2_a.txt” file in any text editor and record (a) how many variables in this file,

and (b) which separation character has been used on a blank sheet.

iv) Import “B2_a.txt” file to PASW data editor and review characteristics of new dataset.

v) Connect internet and

(a) find available household survey data files for your country;

(b) download the most recent survey data file;

(c) find and review the questionnaire and codebook for that survey;

(d) note down the variables which are useful to calculate education indicators,

especially for EFA monitoring, and

(e) prepare for importing data, if it is needed.

Page 88: Module B

Module B3:

Checking, Editing and Preparing Household Survey Data for Analysis

Contents:

1. Metadata Preparation 1.1 Defining Data: Setting Variable Properties 1.2 Setting and Editing Metadata through Wizard 1.3 Copying File and Variable Properties

2. Data Manipulation 2.1 Changing, Inserting and Deleting Data, Cases and Variables 2.2 Computing New Variables 2.3 Recoding

3. Data Preparation 3.1 Selecting Cases 3.2 Sorting Cases 3.3 Rearranging Variables

4. Data validation 4.1 Validation with Single-Variable Rules 4.2 Cross-Variable Rules 4.3 Multi-Case Rules

5. Tips and Exercises 5.1 Tips: Do and Don’t 5.2 Self-evaluation 5.3 Hands-on Exercises

Purpose and learning outcomes:

To gain knowledge on defining data and checking data quality with PASW

To understand basic techniques of data validation

To understand how to prepare datasets for conducting effective data analyses

Page 89: Module B

1. METADATA PREPARATION

One of the most famous computer and ICT terms is GIGO, “Garbage in Garbage out”. It simply

indicates that if dataset under analysis is prone to errors, outputs generated from that dataset are not

reliable or unusable. Therefore, after loading a dataset, keep in mind that it is not yet ready for start

producing analytical outputs. PASW Data Editor can display only the contents, but cannot secure

the quality of data.

To conduct meaningful analyses, it is also important to understand the data collection procedure,

questionnaire and coding rules, and how dataset was prepared and distributed. Moreover, if and

only if the data in the set is defined properly, the data analyst can understand correctly and

conducting meaningful data analyses.

Therefore, logical steps after loading dataset include:

Metadata preparation:

Defining data

This step requires when data was imported from other formats such as Excel, text or

databases. While importing data from those formats, only data values with variable name,

and at most, the defined missing values will be in the new PASW dataset. In this case, data

management should begin with defining data – providing appropriate variable name and

value labels, and setting missing values and measurement level for each and every variable.

Editing data definition

All PASW datasets should begin with reviewing variables in the dataset and determine their

valid values, labels, and measurement levels. Identify combinations of variable values that

are impossible but commonly miscoded. Define validation rules based on this information.

This is a time-consuming task, but worthwhile to ensure the quality of data.

Data preparation:

Even the active dataset is reliable (clean or data with good quality) it may not perfectly fit in

with the type of analyses to perform. The active dataset may require manipulations such as

sorting, aggregation, creation of new variables, conditional selection of cases, and

sometimes merging of datasets.

Data validation:

Run basic checks and checks against defined validation rules to identify invalid cases,

variables, and data values. When invalid data are found, investigate and correct the cause. If

it is impossible to correct, determine whether to omit the entire cases or include the case but

setting the invalid values as missing or special category.

Once the dataset is clean and well prepared, it is ready to analyze with PASW modules. The

following sections highlight the tools provided in PASW base system for metadata preparation,

data preparation, and data validation.

This section will emphasize on metadata preparation while data manipulation, preparation and

validation will be discussed in the Section 2, 3 and 4 respectively.

Page 90: Module B

1.1 Defining Data: Setting Variable Properties

While obtaining data from other sources such as: Excel, text, or Access database, only the variable

name, format (numeric or string, width and decimal places) and data values are imported. Few

more properties, such as missing values, could be assigned while importing from databases,

however, there will be no description of variable (variable label), and the meaning of the data

values (value labels) especially when the codes, instead of texts or words, were imported from the

source. Examples are introduced as following;

In the above dataset, variable “HV104 (Sex of household member)” has values 1 or 2 only.

However, users cannot know “what 1 and 2 stand for?” since 1 could stand for "Male" or "Female"

depending on the coding scheme.

Therefore, it is impossible to answer a simple question: “how many household members are

female?” from the above frequency table created by PASW Statistics.

In PASW, metadata or data dictionary is part of the dataset. It covers such properties as variable label, value labels, formats, and measurement level: scale, ordinal or nominal.

Page 91: Module B

Similarly, from the above frequency table of the HV106, no one could know:

“What is HV106?”

“What are valid values 0, 1, 2, 3, 8 and 9 stand for?” and

“Why the codes jump to 8 after 3, and where are 4, 5, 6, and 7?”

To answer such questions, the next step, after importing data or opening an existing data file, is to

specify, or check and edit, variable label, value labels, missing values and measurement level for

each and every variable in the dataset. For entering variable labels, value labels and missing values,

the codebook, or survey questionnaire if the codes are printed on, is essential.

To define variable label just click the appropriate cell and type in directly as following.

Again, to define the value labels, select “Variable View” in the PASW Statistics Data Editor.

Then, follow the steps below:

1. Click the cell under “Values” and “Value Labels” window will pop-up;

2. Type the code in “Value” box;

3. Type the appropriate label in “Label” box;

4. Press “Add” button and the value and its label will appear in the space below;

5. Repeat Steps 2, 3 and 4 until all value labels been defined and press “OK”, after

entering for the last valid code, to complete defining the value labels.

Note: Starting from the version 17.0, PASW Statistics allows checking spelling of value labels (click the

“Spelling” tab). Similarly, users can identify “missing values” by clicking the cell under “Missing”

and follow the similar procedure in defining value labels.

Page 92: Module B

The same analysis (frequencies) to the variable “HV106” after defining the variable label, value

labels and missing values will provide the following output which is easier to understand and ready

to place in a report or presentation.

Within Variable View, all properties (or definitions) of the variables: name, type, measurement

level, etc., can be added, changed or removed as required. By default, PASW assigns measurement

level for the imported variables automatically as “scale” for numeric variables and “nominal” for

string variables. It is insufficient for some advanced analyses, and thus, the measurement level of

the variables must be checked and changed. For example, type of measurement for the variable

“HV106” can be changed from nominal to ordinal, which is more suitable for the variable.

Click here to define

value labels!

Repeat until

all value labels

have been added

5

1

2 3

4

Page 93: Module B

1.2 Setting and Editing Metadata through Wizard

The "Metadata Wizard" can also be applied to the imported data files instead of setting manually as

described in the previous section.

Steps in this procedure are:

1. Click “Data” on main menu bar; and

2. Select “Define Variable Properties…”.

The “Define Variable Properties” window would pop-up and let choosing variables to be

defined. For demonstration purpose, select just two variables HV219 “Sex of head of

household” and HV104 “Sex of household member” in the following example.

3. Click the variable name(s) to select the variable(s) to be defined;

4. Double-click or click to move variable name to the right “Variables to scan”

pane; Repeat Steps 3 and 4 until all required variables been placed in the right pane;

5. After selecting all variables, click “Continue” to start scanning the variables.

A new “Define Variable Properties” window will appear and show the scanned results by

variable. In this window, one can set:

(i) Variable label (type into blank spaces provided),

(ii) Data type (select from the dropdown), width and decimal places (type-in), and

(iii) Measurement level (select from the dropdown).

After completing for the variable HV219, select HV104 and follow the same procedure

described in steps (i), (ii) and (iii). Then,

6. Complete “Setting variable properties” by clicking “OK”.

PASW provides a wizard-like method of setting variable properties for the new variables, and also for checking and editing variable properties for existing variables in a dataset.

3

5

4

1 2

Page 94: Module B

Alternatively, after setting for HV219, its properties can be copied to HV104 since both variables

have the same nature and using the same codes: 1=Male and 2=Female (i.e. same value labels).

To copy variable properties, except variable label, from HV219 to HV104:

(a) Press “To Other Variables...” button.

Then, in the “Apply Labels and Level to” window:

(b) Select the variable HV104; and

(c) Click “Copy” to copy the variable properties.

All properties of the variable HV219, except variable label, are copied to HV104. Thus,

(d) Type in variable label for HV104, and click “OK” to complete the process.

It should be noted that copying variable properties can be applied only among the

variables scanned during the same session.

Type-in

Variable label

and

Value label

Click and select

measurement level and

variable type

(a)

(c)

(d)

(b)

6

Type-in

Variable label

for HV104

Page 95: Module B

And, the dataset will appear in the Variable View as follow:

Setting of variable properties should be carried out on all variables in the dataset for easier

understanding and effective analyses.

Tip:

Sometimes, source data file contains data in “text format” for some variables, such as “male” or

“female” instead of 1 and 0. In this case, it is essential to code such variables for easier analysis.

PASW Statistics provides automatic coding through AUTORECODE command. For detail

information on AUTORECODE command, please refer to “Base User Guide” for PASW

Statistics 17.0.

Page 96: Module B

1.3 Copying File and Variable Properties

The “Copy Data Properties” in the “Data” menu provides the ability to use an external PASW

Statistics data file as a template for defining file and variable properties in the active dataset.

Similarly, properties of variables in the active dataset can also be copied to other variables in the

same dataset.

The “Copy Data Properties” wizard allows:

• Copy selected file properties from an external data file or open dataset to the active dataset. File

properties include: documents, file labels, multiple response sets, variable sets, and weighting.

• Copy selected variable properties from an external data file or open dataset to matching

variables in the active dataset. Variable properties include: value labels, missing values, level of

measurement, variable labels, print and write formats, alignment, and column width used in the

Data Editor.

• Copy selected variable properties from one variable in (i) an external data file, (ii) open dataset,

or (iii) the active dataset to many variables in the active dataset.

• Create new variables in the active dataset based on selected variables in an external data file or

open dataset.

When copying data properties, the following general rules apply:

• If an external data file is using as the source, it must be in PASW Statistics format;

• Undefined (empty) properties in the source dataset do not overwrite defined properties in the

designated dataset; and

• Variable properties are copied from the source variable only to target variables of a matching

type--string (alphanumeric) or numeric (including numeric, date, and currency).

Variable properties can be copied from the source file to matching variables in the active dataset.

Variables "match" if both the variable name and type (string or numeric) are the same. For string

variables, the defined length must also be the same.

Moreover, the variables which are not in the active dataset can be created using the properties of

the selected variables in the source file. To do this, source list must be updated to display all or

variables in the source data file. If you select source variables that do not exist in the active dataset

(based on variable name), new variables will be created in the active dataset with the variable

names and properties from the source data file.

If the active dataset contains no variables (a blank, new dataset), all variables in the source data file

are displayed and new variables based on the selected source variables are automatically created in

the active dataset. This is the easiest way to create a new dataset (like Excel worksheet) for direct

data entry and, also can be shared the dataset without data as electronic codebook.

To copy the data file properties and variable properties, which may require after importing from

other file formats, first, select “Variable View” of “Data Editor” and follow the steps below:

1. Click “Data” on main menu bar; and

2. Select “Copy Data Properties…” and “Copy Data Properties” wizard will appear;

3. Click the “Browse” button on the bottom right area and select the PASW data file

which were to use as source of the properties;

OR, type in the file name with its full address, for example,

“C:\PASW Training\Sample\Data1.sav”

Copying variable and file properties from a well-defined data file to another data file is an easy task in PASW Statistics.

Page 97: Module B

4. Then, click “Next” to proceed to the Step 2 of the Wizard;

The Wizard will scan both source and target datasets, and display the “match” variables

from source file in the left pane and from active dataset in the right pane. Number of

selected variables is displayed in the bottom of the list.

5. Click “Finish” to copy with the default settings, or “Next” to change the settings;

2

3

1

4

5

Page 98: Module B

The following settings can be changed in Steps 3 and 4 of the Wizard.

If the Wizard is followed Step-by-Step, the summary of “what would be copied” will be displayed

on Step 5. After pressing “Finish” button, whether at the end of step 2, 3, 4 or 5, the active dataset

will have the selected properties as in the source PASW data file.

Page 99: Module B

Alternatively, properties can be copied from an open dataset, if more than one datasets are opened.

Just select “An open dataset” as “Source of the properties” in Step 1, and follow the same steps.

Here, new variables from the source dataset will be added to the active dataset if “Create matching

variables in the active dataset if they do not already exist” is ticked in using set properties. All

variables (press <Ctrl>A) or only some variables (click variable name with <Control> key) can be

selected from the source list. In this case, at the bottom of the list of active dataset will display both

(i) matching variables, i.e. 12 in this example; and (ii) variables to be created, 10 in this example.

Newly inserted variables

No valid data here!

New variables

Page 100: Module B

In the above example, 10 new variables will be added into the active dataset with the same variable

names and properties by copying the properties of all variables from the source dataset. It should be

noted that the data values were not be copied to the active dataset.

PASW Statistics also allows copying variable properties from one variable to another in the same

dataset. For example, in the sample dataset, two variables: sex of head of household (HV219) and

sex of household member (HV104) are sharing the same codes “1=Male” and “2=Female”, and 9 as

the missing value. If the codes were entered and missing value has been identified for the head of

household (HV219), those properties can copy to household member (HV104).

To do this, select the third option in “Choose the source of the properties”, which is “The active

dataset” in Step 1 of the Wizard. Then, click a source variable, and click again the target

variable(s). As usual, user must press <Control> key while clicking the next variable name(s). After

selecting all target variables, just click “Finish” to begin copying process.

Page 101: Module B

In this option, user must type-in appropriate variable labels for the target variables.

Variable labels are

the same as the

source variable

User must change

these variable labels!

Page 102: Module B

2. DATA MANIPULATION

Preparing for data analysis

The following two steps are essential after setting variable properties to conduct an appropriate and

productive data analysis:

(1) the prospective outputs should be listed and laid out suitable analytical methods.

(2) check which outputs can be generated directly from the existing datasets, and which outputs

may require further manipulations such as sorting; calculation/creation of new variables

(temporary or permanent); transformation (coding, grouping, etc.); and creation of new

datasets (aggregation, subsetting and merging the existing datasets).

PASW allows data transformations ranging from as simple as collapsing categories for analysis, to

more advanced tasks, such as creating new variables based on complex equations and conditional

statements. In this chapter some important techniques of data manipulation and transformation will

be discussed.

Surveys could provide very rich information. However, most survey datasets are yet to be ready for analysis and producing output tables to construct EFA monitoring indicators.

Example:

The working dataset contains data extracted from a household survey with personal

records of all household members with the variables: age, sex, schooling status, and the

class/grade currently attending. And, the requirement is to produce “age-specific

enrolment rate (ASER) for the children aged 6 to 14 by sex” on dataset. It is impossible to

compute ASFR directly from the working dataset since:

(a) total number of children aged 6 to 14 by single year of age by sex (which is

denominator); and

(b) number of children aged 6 to 14 who are currently attending school by single year

of age by sex (which is numerator), are not available in the current dataset.

For this task, it requires the following Steps:

(a) Extracting the cases for aged 6-14 only;

(b) Counting of all children, irrespective of schooling or not, by age and sex, for

denominator;

(c) Counting of children who are currently attending school by age and sex, for

numerator; and

(d) Calculation of ASER by age and sex.

Step (a) can be carried out by “case selection” command, while “aggregate” command is

suitable for Steps (b) and (c), and “compute” command to create a new variable, ASFR, in

Step (d).

Page 103: Module B

2.1 Changing, Inserting and Deleting Data, Cases and Variables

Changing the identification (or properties) of a variable:

To change the properties of a variable, for example, variable name, select the cell with the variable

name that you want to change in “Variable View” and type-in new appropriate name. All variable

properties can be changed as such in “Variable View”. Cautions must be put in changing variable

types: if change a string variable to numeric, all alpha-numeric data values will become missing

values (“.”); and only blanks (zero length string data) will get if changing back to string type later.

This may happen with some other data types also.

If data values were to change, select “Data view”, locate the cell and type in the new value, one cell

after another, as in a spreadsheet program.

Adding variables or cases to an existing dataset:

For example, a variable, education level “EdLevel”, should be added to have better understanding

of educational attainment of all household members. To add a new variable, select “Variable View”

and right-click the row number where to insert the new variables. PASW Statistics will insert the

variable before the existing variable on that row with the name “Var00001”, “Var00002”,

“Var00003”, and so on…. Variable type for a newly created variable is numeric with F8.2 format

(8 digits, 2 decimal places). There will be no variable label and value labels. The user can input or

import the variable attributes, as presented in the above section, for new variables including

variable name, type, width and decimal places, variable label and measurement level. As and where

applicable, value labels should also be identified.

In PASW Statistics Data Editor, t is simple to change the value of a specific cell, or properties of a variable, such as name, type, label, value labels and measurement scale.

Type in

variable name,

and edit

properties as

necessary!

Select row and

click RIGHT

mouse to get

pop-up, and click

“Insert Variable”

Page 104: Module B

A new variable could also be inserted on “Data View” by clicking the existing variable name where

to insert the new variable before. Then, go to the “Variable View” and change the properties. On

the other hand, just click variable name (while working on Data View) or click the row number (on

Variable View) and press “Delete” key to delete a variable.

Inserting cases can be carried out only on “Data View”. Select the row (or several rows

continuously) where to insert new case(s), right-click and select “Insert Cases”. Similarly, select

case(s) and press “Delete” key will delete the selected cases. Alternatively, you can use the Clear

command in the Edit menu.

Page 105: Module B

2.2 Computing New Variables

Creation of new variables from existing variables is a common and essential task in data analysis.

Example:

Total service of primary school teachers in many annual school censuses was recorded in months

for better accuracy. However, it requires to summarize or to relate with other variables in years.

Then, a new variable “service in year” must be computed as “service in month” divided by “12”.

Case study:

In the sample dataset extracted from “Bangladesh Demographic and Health Survey 2007” contains

highest education level (HV106) and highest year of education (HV107) for all household

members. However, there is no educational attainment in usual “Grade” or “Grade-level”, that is,

“Primary 2” or “Secondary 4” or …. To study the highest grade-level attended by adult household

members (aged 15 and above), a new variable “Grade” must be calculated from two existing

variables as:

Grade = HV106 * 10 + HV107, for HV106 = 0, 1, 2, 3 and HV107 is not 98; and

Grade = Missing, if HV106 = 8 (Don‟t know) or HV107 = 98 (Don‟t know).

To calculate the new variable “Grade”, the “Compute Variable” command is available under

“Transform” menu in the Data Editor. To create a new variable:

1. Click “Transform” on main menu bar; and

2. Click “Compute Variable” item and “Compute Variable” window will appear.

3. Fill-in “Target Variable” name, and optionally, the type and label of new variable

can also be set by clicking the button under target variable name;

Use Compute to get values for a variable, an existing one or a newly created one, based on numeric transformations of other variables.

Compute only for the

cases which are not

“unknown” for both

education variables

and Age > 15,

6

1

2

3 4

5

Page 106: Module B

4. Set the numeric expression the existing variables together with numbers, PASW

Statistics built-in functions, and operators such as +, - , >, <, etc.;

5. If only the cases which meet certain criteria were to include, press button

located at the lower left corner of the window and fill-in the conditions; and

6. Click “OK” to complete the task.

A new variable, “Grade”, has been added in the current dataset, at the end of variable list. Although

a new variable name was provided, the result variable from the “Compute” command can also take

an existing variable name. After creation of a new variable, it is important to define thoroughly by

setting labels, missing values and measurement level.

Page 107: Module B

2.3 Recoding

Recording is a common task in data preparation. Sometimes, values (or categories or codes) in a

nominal or ordinal variable require regrouping to make further analyses. For example, grouping of

single-year population into school-going age groups is essential to calculate education indicators.

Sometimes, data entering in text format, for example area names, should be changed into numeric

values for the ease of analysis. These tasks can be carried out by the following PASW commands:

1. Automatic Recode;

2. Recode into Same Variables; and

3. Recode into Different Variables.

Automatic Recode

It is useful for string variables with limited number of different values, for example, male or

female; urban, suburban, rural or remote. When the existing categorization of a variable is no

longer needed after recoding, “Recode into same variables” option can be selected or select

“Recode into Different Variables” to maintain the original variable.

To perform automatic recoding:

1. Click “Transform” on main menu bar; and

2. Click “Automatic Recode”, and a new window will appear;

3. Select one variable and send to the area under “Variable New Name”;

4. Type appropriate name for the recoded variable in “New Name” box;

5. Click “Add New Name” button; Repeat Steps 3, 4, and 5 for all variables to recode.

6. Select whether to recode starting from the “Lowest value” or “Highest value”;

7. Select whether to “use the same recoding scheme for all (selected) variables”, and

whether to “treat string values as user-missing” or not; and

8. Click “OK” to complete the task.

RECODE changes, rearranges, or consolidates the values of an existing variable. RECODE can be executed on a value-by-value basis or for a range of values.

1

2 3

4

5 6

7

8

Page 108: Module B

Then, two new variables “Division” and “SES” will be added to the current dataset with the

following coding schemes (codes and value labels).

In some cases, there are more than one variable sharing the same values, for example, „Sex of head

of household (HV219)‟ and „Sex of household member (HV104)‟ must have only two valid values

“Male” and “Female”. Similarly, several variables could take just “Yes”, “No” and non-response or

missing value; for example, „Usual resident (HV102)‟, „Slept last night (HV103)‟ and „Member

still in school (HV110)‟ are such variables in the sample dataset. To recode such group of variables,

just tick the checkbox of “Use the same recoding scheme for all (selected) variables” in Step 7.

The following exhibit shows the automatic recoding of two variables, HV103 and HV102.

„Automatic recode‟ is simple and useful in exploring the newly imported file or for the beginners.

7

All properties are the

same for both variables

8

Page 109: Module B

Recode into Different Variables

The “Recode into Different Variables” is the most useful recoding procedure for the general users.

In this procedure, users can select all the recode options, and both old and new variables are

maintained in the dataset. Before manual recoding, it is important to see the frequency distribution

of the variable under study. The variable “Highest education level (HV106)” will be used as an

example in this section. The frequency table for the variable HV106 is as following:

Here, 6 different items: „9‟, „DK‟, „Higher‟, „No education, preschool‟, „Primary‟ and „Secondary‟

are listed as valid values of the variable. Through the codebook of the DHS Survey, „9‟ is

representing the missing value and „DK‟ represents „Do not know‟. Since the variable under study

is „educational attainment‟, it is valid for those aged 6 and above only. Thus, it is logical to code

as following for the population (household members) aged 6 and above:

0 = No education, preschool 3 = Higher

1 = Primary 8 = DK, and

2 = Secondary 9 = (system) missing value.

To do this,

1. Click “Transform” on main menu bar; and

2. Click “Recode into Different Variables” and a new window will appear;

3. Select the variable “Highest education level (HV106)” and send to the area

“Input Variable Output Variable:”;

4. Input a new “Name” and appropriate variable “Label” for the output variable, and

click “Change” button to set new variable name and label;

5. Click “Old and New Values” button and a new window will appear for setting;

In “Old and New Values” window:

(i) Type in the old value (or a range), e.g. “Primary”;

(ii) Type in new value, e.g. “1”; and

(iii) Press “Add” button to add transformation rule into the process;

(iv) Repeat above steps for all pairs of values and click “Continue” to complete

selection and return to main recode window;

6. Click “If…” button and a new window will appear for case selection setting;

In “If Cases” window:

(a) Select “Include if case satisfies condition:” button;

(b) Construct (or type in) the condition, e.g. “HV105 > 5”; and

(c) Click “Continue” to return to main recode window; and

7. Click “OK” on “Record into Different Variables” window to complete the task.

Page 110: Module B

After creating a new variable with recode command, all necessary properties must be set to the new

variable, such as variable format (type, width and decimal places), value labels, missing values, etc.

The new variable can be observed as following:

Similar steps were to carry out to “Recode into same variable”.

Step 6

3 4

5

6

(i) (ii)

(iv) “Continue”

(a)

(c)

(b)

No value labels yet! Just set width and decimal places

Since age (HV105) is < 6 yr, EdLevel is “Missing”

Since age (HV105) is > 6 yr, EdLevel code is “0”

(iii)

Step 5

Page 111: Module B

Visual Binning

PASW Statistics also provides “Visual Binning” under “Transform” menu to perform automatic

creation of new variables based on grouping contiguous values of existing variables into a limited

number of distinct categories. Visual Binning can assist to:

• Create categorical variables from continuous scale variables. For example, a scale variable

“age” to create a new categorical variable that contains 5-year age groups.

• Collapse a large number of ordinal categories into a smaller set of categories. For example,

collapse the twenty 5-year age groups into 5 groups: 0-19, 20-39, 40-59, 60-79, and 80+.

To conduct visual binning, first select a scale variable (HV105 Age of household members) and

follow the steps below:

1. Click “Transform” on main menu bar; and

2. Click “Visual Binning” and a new window will appear;

3. In the “Visual Binning” window:

(i) select the scale variable(s) to bin and move those variables into “Variables

to Bin” pane; and

(ii) click “Continue” button when complete selecting;

(ii)

2

1

(i)

Step 3

Page 112: Module B

PASW Statistics will analyze the selected variables, and present a graphical distribution of

the variable after binning in the new “Visual Binning” window. Here,

4. Input an appropriate “name” for the binned variable;

5. Input variable “label” for the binned variable; and

6. Click on the “Make Cutpoints…” button to define cutting points for the binning;

and “Make Cutpoints” window will appear to set cutpoints;

Cut points can be constructed based on three options: (i) equal width intervals;

(ii) equal percentiles based on scanned cases; and (iii) cutpoints at mean and selected

standard deviations (1 or 2 or 3 SD) based on scanned cases.

Generally, making cutpoints with equal width intervals is more common and

suitable in analyzing household surveys on education.

In the “Make Cutpoints” window:

7. Input “4” as first cutpoint location since the first age group of common 5-year

interval is 0-4;

8. Input “5” as the Width (or class interval), and the “number of cutpoints” will be

filled automatically, 19 in this example;

9. Click “Apply” and Visual Binning window will appear with set intervals.

Then, in the main “Visual Binning” window:

10. Click “Make Labels” button to generate value labels automatically and the user can

change labels as appropriate; and

11. Finally, click “OK” to create a new binned variable called “Age”.

As usual, properties of the new binned variable must be checked and changed as necessary.

4 5

6

10

Page 113: Module B

The frequency table of the variable “Age” is as following:

7

8

9

10

11

Page 114: Module B

3. DATA PREPARATION

After checking and editing of dataset, setting the variable properties, and recoding as necessary, the

dataset is ready to start preparation for data analyses.

Before making any analysis:

(1) the prospective outputs should be listed and laid out suitable analytical methods.

(2) check which outputs could be generated directly from the existing datasets, and which may

require further manipulations such as sorting; calculation/creation of new variables (temporary

and/or permanent); transformation (coding, grouping, etc.); and creation of new datasets

(aggregation, subsetting and merging the existing data sets).

PASW Statistics allows data transformations ranging from as simple as collapsing categories for

analysis, to more advanced tasks, such as creating new variables based on complex equations and

conditional statements. In this chapter some important techniques of data manipulation and

transformation will be discussed.

For having effective data analysis, users must prepare dataset efficiently. The most frequently used data preparations techniques include sorting and selecting of cases.

Example:

the working dataset contains data extracted from a household survey with personal records of all

household members. The variables include: age, sex, schooling status, and the class/grade

currently attending; and the requirement is to produce “age-specific enrolment rate (ASER) for

the children aged 6 to 14 by sex”. In this situation, it is impossible to compute ASFR directly

from the working dataset since the analyst needs to have a dataset with:

(a) total number of children aged 6 to 14 by single year of age by sex [which is denominator];

(b) number of children aged 6 to 14 who are currently attending school by single year of age by

sex [which is numerator], before computing age-specific enrolment rate, ASER.

In this situation, it requires:

(a) selection of cases (extracts cases of aged 6-14);

(b) aggregation of personal data to get grouped data by age and sex, that is, counting of all

children irrespective whether schooling or not, and of children who are currently attending

school, by age and sex; and

(c) calculation of ASER by age and sex.

[Note: The calculation is much easier and simpler if “Custom Tables” option is installed.]

Page 115: Module B

3.1 Selecting Cases

Selection of cases is essential whenever to analyze a specific subset of data based on set criteria, for

example, to study the percentage of “out-of-school girls aged 6-14”. To do this:

1. Click “Data” on main menu bar; and

2. Click “Select Cases”, which is the second last item on the list. Then,

3. “Select Cases” window will appear and select “If condition is satisfied” and;

4. Click “If” button and a new window “Select Cases: If” will appear.

5. Construct selection statement using variables, operators and functions;

then, click “Continue”;

6. Select output option:

i. Filter out unselected cases;

ii. Copy selected cases to a new dataset (to provide the new dataset name); and

iii. Delete unselected cases;

7. Click “OK” button and a new Data Editor window will appear with selected cases.

Select Cases provides several methods for selecting a subgroup of cases based on criteria that include variables and complex expressions. Users can also select a random sample of cases.

5

1

2

3

4

6

Page 116: Module B

There are three output options;

“Filter out unselected cases” - cross-signs (X) will be put on unselected cases as following picture

shows. The unselected cases will not be used in future analyses and run select cases with Select All

Cases option to retain original dataset.

“Copy selected cases to a new dataset” - this creates new dataset and leave current dataset intact.

Users can switch between the original dataset and newly created dataset or use both datasets

together through PASW syntax.

“Delete unselected cases” – this deletes all unselected cases from the current dataset. With this

option, original dataset cannot be retained, and thus, it is important to save the original dataset

before, and the sub-dataset contains only selected cases should also be saved with an appropriate

name as soon as completing the selection process.

The following cross-tabulation provides the percentage of out-of-school girls aged 6-14 in single

year.

Unselected cases

Selected cases

Unselected cases

Page 117: Module B

3.2 Sorting Cases

Cases can be sorted in ascending or descending order based on one to all variables in the dataset. In

the sample dataset, households can be sorted by wealth index to observe the characteristics of

households in similar wealth status. Moreover, some PASW Statistics commands require pre-sorted

dataset, for example “aggregate” command requires sorted dataset by the breaking variable(s).

Sorting can be carried out through “Sort Cases” command under “Data” menu as following:

1. Click “Data” on main menu bar; and

2. Click “Sort Cases”. Then, “Sort Cases” window will appear;

3. Select the first key variable and send to “Sort by” pane and set “Sort Order”;

Repeat this Steps 3 for all key variables in the order of importance;

4. Click “OK” button to start sorting.

The following example sorts current dataset with two variables: „Education in single year (HV108)‟

in ascending order and „Age of head of household (HV220)‟ in descending order.

Sorted data:

SORT CASES reorders the sequence of cases in the active dataset based on the values of one or more variables. Optionally cases can be sort in ascending or descending order, or combinations of ascending and descending order for different variables.

1st.

Key 2nd.

Key

1

2

3

4

Page 118: Module B

3.3 Rearranging Variables

Sometimes, the original dataset cannot provide the variables in good order, for example, education

related variables may spread in several locations. Other occasions, linked variables are far apart that

it cannot be visually observed the linkage. In such cases, putting those associated or linked

variables or variables under investigation could be grouped into a new dataset or moved to the top

of the variable list.

Relocating Variables

To move a variable form current position to the new one is just click the selected variable, drag-

and-drop at the desired position in “Variable View” or “Data View”. For example, to place “Line

number of head of household (HV218)” to the second position in the list:

1. Select the variable by clicking on the row number (HV218 at row 6) on Variable View; and

2. “Drag and Drop” at the desired location (in this example, after the first variable in the list).

A red hairline will show the position if the user drop the dragged variable at that time.

Relocating of variables does not have any impact on the results of data analyses. However, it makes easier to decide which variables to use for getting required outputs.

Thin Red Line shows the destination

At new location after moving

Page 119: Module B

Variable Sets

In case of several variables in the dataset, it is recommended to define and use “Variable Sets”.

Define Variable Sets under Utilities menu creates subsets of variables to display in the Data Editor

and variable lists in dialog boxes. Defined variable sets are saved with PASW format data files.

A variable set can be defined with any combination of numeric and string variables, and a variable

can belong to multiple sets. The order of variables in the set has no effect on the display order of

the variables in the Data Editor or variable lists in dialog boxes.

Two variable sets “Education” and “HH_Head” are defined in the following example with nine

variables in “Education” variable set and eight in the other with four common variables.

To create a variable set:

1. Click “Utilities” on main menu bar; and

2. Click “Define Variable Sets”, and a new window will appear;

3. In “Define Variable Sets” window, first put in the set name following PASW naming

convention (can be up to 64 bytes long; valid any characters including blanks);

4. Select and put variables into the “Variables in Set” pane;

5. Click “Add Set” button to create the variable set;

Define as many sets as needed by repeating steps 3-5.

6. Click “Close” button to complete creation of variable sets.

It is strongly recommended to save the dataset with the new name after defining the variable sets.

In this example, the dataset is saved as “BDPR50FL2.sav”.

6

5

1

2

3

4

Page 120: Module B

To use a variable set:

1. Click “Utilities” on main menu bar; and

2. Click “Use Variable Sets”, and a new window with the list of variable sets will appear;

The list of available variable sets includes all variable sets defined, plus two built-in sets:

(i) ALLVARIABLES: contains all variables in the data file, including new variables

created during a session;

(ii) NEWVARIABLES: contains only new variables created during the current session;

(iii) Education: the first user-defined variable set containing 9 variables; and

(iv) HH_Head: the second user-defined variable set containing 9 variables.

3. In “Use Variable Sets” window, first, check the desired variable set(s) and uncheck all

others under “Select variable sets to apply”;

At least one variable set must be selected. If ALLVARIABLES is selected, any other

selected sets will not have any effect, since this set contains all variables. In this example,

“Education” variable set is selected.

4. Click “OK” to complete selection and the following new Data View will appear.

5. To get all variables back, click “Show All Variables” under Utilities menu.

3

5

1

2

Display 9 variables

of Education set

4

Page 121: Module B

4. DATA VALIDATION

Why data validation is required?

With rapidly expanding computing power and increasing storage capacity at reasonable cost, many

surveys in current years were designed to collect several items (which will result more variables)

with better coverage (i.e., larger sample size and thus more cases in PASW). It creates more

workloads for the data handlers – coding staff, entry clerks, and data editors. Generally, with time

pressure to complete the task on one hand and inefficiencies in training and recruitment of staff on

the other, the quality of data transmitted from data manager to analyst is in question. In some cases,

surveys were planned without a step to check the coding, and not at all to verify the data entered.

For the education data analysts, it is expected to obtain survey data concerning education from

various sources, and thus, there is no way to conduct rechecking of coding or data entry. Therefore,

it is important to use validation rules to check the data validity and consistency before using the

data set.

Validation rules

Generally, there are three types of rules in validating a dataset:

1. Single-variable rules

2. Cross-variable rules, and

3. Multi-case rules.

In PASW Statistics 17.0, these rules are not available in the base system, but become part of the

optional “Data Preparation” add-on module. However, these tasks can be carried out through

common PASW Statistics commands. It is easier if the user understands PASW syntax

(programming) language.

The first two types, single-variable rules and cross-variable rules, require understanding “case

selection” which was discussed in the previous section. The third type of rules is more complicated

and it may need several steps of data manipulations such as creating temporary variables, matching,

aggregation and selection of cases.

PASW Statistics provided a procedure: “Identify Duplicate Cases” in “Data” menu to identify

duplicate cases in a data file which is the most important part of the third, multi-case rules.

This section will introduce simplest data validation procedures, but those are powerful in pointing

out improper or invalid cases and values.

Validate Data helps identifying suspicious and invalid cases, variables, and data values in the active dataset.

Page 122: Module B

4.1 Validation with Single-Variable Rules

These rules consist of a set of checks apply to a variable. Normally, checks for out-of-range or invalid

values and missing values include in this category. For example, a value of 5 was entered for the

“highest education level (HV106)” where valid codes are only 0, 1, 2, 3 and 8; values other than 1

and 2 (or “Male” and “Female”) are entered in variable “sex of household members (HV104)”, etc…

Checking validation consists of three stages followed by editing of invalid cases. The first stage in

validating a variable is obtaining valid values or ranges from the codebook, for example, valid values

for HV104 (sex) are 1 and 2 only. Therefore, any values except 1 and 2 are invalid.

The second stage is constructing a frequency table. If there is no invalid values displayed in the

frequency table, the variable under observation is „valid‟ with the single-variable rule. If irrelevant

values were observed in the frequency table, for example “3” in variable representing “sex”, it is

required to identify “where these erroneous cases are?” And, thus, the third stage for checking

validation is using “select cases” to split out and observe the irrelevant cases.

To check the validity of “sex of household members (HV104)”, follow the steps:

1. Click “Analyze” on main menu bar;

2. Click “Descriptive Statistics”;

Those validation rules which check internal inconsistencies such as invalid values and cases within a variable are known as Single-Variable Rules.

Invalid values

for Sex

1

2

Step 4

3

Page 123: Module B

3. Then, click again “Frequencies”; and

4. On “Frequencies” window, select the variable to study (HV104) and click “OK” to

construct frequency table.

In the above frequency table, 5 cases with the values 3, 4, and 5 are invalid. Therefore, it is

necessary to check which case contain such invalid values through conducting the third stage: “case

selection” of invalid cases.

To select invalid cases:

1. Click “Data” on main menu bar;

2. Click “Select cases”;

3. On “Select cases” window, check the option button “If condition is satisfied” and

click “If” button;

4. On “Select cases: If” window, type in criteria: “not (HV104=1 or HV104=2)” or

“~(HV104=1 | HV104=2)” and click “Continue”;

5. Check “Copy selected cases to new dataset” option button and provide the new

dataset name, e.g. “Invalid_Cases”; and

6. Click “OK” to execute the case selection command.

1

3

2

5

Set in Step 4

6

Page 124: Module B

The output, new dataset contains only 5 invalid cases (after moving variable HV104 to second

position to get a better view) as below:

In this case, the user must decide, and act, whether to erase the entire case from the dataset or change

the invalid ones to “missing values”, or check other datasets where there have different values and to

correct the invalid values in the current dataset.

Invalid values

for Sex Case Number

Page 125: Module B

4.2 Cross-Variable Rules

In cross-variable rules, users have to use cross-tabulations instead of frequency tables to specify whether there

exist invalid cases or not, and to imply slightly different rule for conditional selection of invalid cases.

In the sample dataset, the “highest educational level (HV106)” has no invalid cases if checked it alone using

frequency tables command. However, when cross-checking with “age of the household members (HV105)”,

there are few susceptible entries as follow:

From the above cross tabulation of age and highest education level, one can easily judged that there are 2 cases

of “age 4 in primary education” and 1 case of “age 12 in higher education” are invalid. Moreover, there are few

more cases which are not reliable (or on the margin) in all education levels. There are few options in

developing cross-variable validation rules:

Option 1 – to sip out all susceptible cases (invalid and marginal ones):

i) with primary education at aged 5 or below (the official entrance age is 6),

ii) with secondary education at aged 10 or below (the official starting age is 6+5=11), and

iii) with higher education at aged 15 or below (the official starting age is 6+5+5=16).

Rules for checking inconsistencies in a variable through the values of other variables in the same case is called Cross-Variable Rules.

Reference

NO

visible

invalid

values

Invalid

On the margin

Valid

Page 126: Module B

Option 2 – to review just certainly invalid cases, one can use the following cross-variable rules with a grace

period (early entrance) of one year:

i) with primary education at aged 4 or below (the official entrance age is 6 but 5 can be allowed),

ii) with secondary education at aged 9 or below (the official starting age is 6+5=11), and

iii) with higher education at aged 14 or below (the official starting age is 6+5+5=16).

Then, the “If” statements to be used in case selection are:

Option 1: (HV105 <= 5 and HV106 = 1) or (HV105 <= 10 and HV106 = 2) or (HV105 <= 15 and HV106 = 3)

Option 2: (HV105 < 5 and HV106 = 1) or (HV105 < 10 and HV106 = 2) or (HV105 < 15 and HV106 = 3)

And the following outputs will be obtained after running appropriate case selection procedures as

presented in the previous section.

Option 1: Both invalid and marginal cases

Option 2: Only certainly invalid cases

Age

Case Number Ed. Level to be checked &

corrected

Page 127: Module B

4.3 Multi-Case Rules

The multi-case rules are defined by a procedure (sequence of logical expressions) that flags invalid

cases. The most common and useful application of multi-case rules is checking whether there are

duplicates in the dataset: entered twice or more for a household member or two heads in a single

household or two persons in the same household have the same personal ID, and so on.

PASW Statistics allows checking duplicate cases and inspection of unusual cases. Follow the steps

below to check duplicate cases:

1. Click “Data” on main menu bar; and

2. Select “Identify Duplicate Cases”. Then, a new window will appear;

3. Select variables to identify duplicate cases (or press Ctrl+A to select all and release

unnecessary variables) and send to the space below “Defined matching cases by:”;

4. Set the options:

(a) “Sort within matching group” - select the variable(s) from the remaining

ones in the list, as the key for sorting within the matching groups;

(b) “Sort” - if a key variable for sorting is selected, define the sort order;

(c) “Variables to create” – tick in the check box, if the user wants a frequency

table showing “how many duplicates are detected?”, or to point out which are

the duplicate cases; then, also could identify:

i. which is the primary case, the first or last case among the duplicates?

ii. whether to count all duplicate cases sequentially or just count only non-

primary cases (the primary case is not considered as duplicate);

A user-defined rule that can be applied to a single variable or a combination of variables in a group of cases is a Multi-Case Rule.

3

1

2

4(a)

4(b)

4(d)

i.

ii.

4(e)

5

4(c)

Page 128: Module B

(d) Tick “Move matching cases to the top” to review duplicates easier; and

(e) Tick “Display frequencies for created variables” if required;

5. Click “OK” to proceed.

With the above set options, the result of checking duplicate cases is displayed in the following

frequency table:

The above frequency table shows that there are 6 duplicates among the 1,889 cases. All of those

may be the same (just one primary case and the group of 7 cases are the same in all variables) or

there may be 6 pairs of duplicates (6 primary cases and one duplicate for each primary case). It is to

review the dataset for understanding the nature of duplicates and how to deal with those duplicates.

The following exhibit shows the groups of duplicates displayed on top of the dataset.

After validation checks, the dataset should be edited as and where necessary. After data validation

and preparation, the next step is analyzing “clean data” using appropriate PASW procedures under

“Analyze” menu.

Duplicate Cases

Primary

Duplicate

Primary

Duplicate

Primary

Duplicate

Primary

Duplicate

Primary

Duplicate

Primary

Duplicate

Values of all variables are same in both cases

Page 129: Module B

5. TIPS AND EXERCISES

5.1 Tips: Do and Don’t

i) Do… request to provide documents such as project proposals, questionnaire sets,

codebooks, documents on fieldworks, and survey reports while approaching

agencies/departments to get survey data;

Don’t… judge the usefulness on the spot and do not leave any survey documents and

datasets which are available in survey agencies/departments.

ii) Do… make understand, check and edit metadata (a set of data that describes and

gives information about other data) before using secondary dataset;

Don’t… leave any variable without proper definition: variable label, value labels,

missing values and measurement level (scale, ordinal and nominal).

iii) Do… save the dataset with an appropriate filename whenever changes have been

made, and record properly what changes were made from earlier version;

Don’t… save the current dataset in original filename after making changes, but do not

replace the original data file with edited ones.

iv) Do… copy variable properties whenever available;

Don’t… leave it as it is after copying variable properties (must check and edit as

necessary).

v) Do… define and use variable sets for the ease of analysis, and subset new datasets

by selecting variables as well as cases;

Don’t… change variable type and measurement level without sound understanding.

vi) Do… recode string variables into numeric codes using “automatic recode” and use

“visual binning” for the continuous variable (or numeric variable with several

different values) to reduce the number of items;

Don’t… recode into same variable since it is irreversible (and also, the original

variable can easily be deleted when it is no longer needed.

vii) Do… validate data through single-variable and multiple-variable rules and check

the existence of duplicate cases before conducting any analysis;

Don’t… change the values in the dataset with imagination or self-imposed

assumptions. Always contact to the primary data source for correction or

omit those cases if not many.

Page 130: Module B

5.2 Self-evaluation

Do you understand how to set variable properties in PASW statistics? Very well / Somewhat well / Not so much / Almost None

Are you confident that you can do the followings in an active dataset? o Compute a new variable:

Confident / Somewhat confident / Not so much / Not at all o Recode into a different variable:

Confident / Somewhat confident / Not so much / Not at all o Selecting cases with girls under 15:

Confident / Somewhat confident / Not so much / Not at all o Sorting cases with wealth index factor score and highest education attained:

Confident / Somewhat confident / Not so much / Not at all o Check erroneous values in a variable (validate with single/cross variable rule)

Confident / Somewhat confident / Not so much / Not at all o Check existence of duplicate cases in the dataset

Confident / Somewhat confident / Not so much / Not at all

Do you understand visual binning? Very well / Somewhat well / Not so much / Almost None

5.3 Hands-on Exercises

1) Import the attached “data1(tab).dat” and define all variables appropriately.

2) From the dataset obtained from Exercise 1 above, recode all string variables.

3) Create single-variable rules to check the validity of three education related variables.

4) Create two multi-variable rules to check the validity of (i) current schooling status of

household members, and (ii) education in single year of household members.

5) Find duplicate cases from the current dataset and propose how to handle those cases.

Page 131: Module B

Module B4:

Basic Data Analysis Techniques in PASW Statistics

Contents:

1. Reports 1.1 Codebook 1.2 Case Summaries: Listing Selected Cases 1.3 OLAP Cubes (Online Analytical Processing Cubes)

2. Descriptive Statistics 2.1 Frequencies 2.2 Descriptive 2.3 Explore 2.4 Crosstabs 2.5 Ratio Statistics

3. Tips and Exercises 3.1 Tips: Do and Don’t 3.2 Self-evaluation 3.3 Hands-on Exercises

4. Annexe: Web Links for Further Study on SPSS/PASW Statistics

Purpose and leaning outcomes:

To introduce basic data analysis techniques in PASW

To understand how to derive PASW to get required outputs (tables and charts)

To know how to interpret PASW output

Page 132: Module B

1. REPORTS

The first command under ANALYZE menu is the REPORT. The REPORT procedures can provide

all univariate statistics available in the DESCRIPTIVES statistics and subpopulation means

available in the MEANS. In addition, some statistics available in the report procedures, such as

computations involving aggregated statistics, are not directly accessible in any other command

procedures.

By default REPORT provides complete report format but a variety of table elements can be

customized, including column widths, titles, footnotes, and spacing. Because it is flexible and the

output has so many components, it is often efficient to preview report output using a small number

of cases until finding the format that best suits the needs, especially when listing individual cases.

The group of REPORT commands comprises of Codebook, OLAP Cubes, and Summarize –

containing „Case Summaries‟, „Report Summaries in Rows‟ and „Report Summaries in Columns‟.

Codebook

This procedure reports the dictionary information and summary statistics for all or specified

variables and multiple response sets in the active dataset.

Summarize procedure (or Case Summaries)

“Case summaries” produces subgroup statistics for variables within categories of one or

more grouping variables. All levels of grouping variable are cross-tabulated. Summary

statistics for each variable across all categories are also displayed. The order in which the

statistics are displayed can be chosen. Moreover, data values in each category can be listed

or suppressed. With large datasets, only the first n cases or all cases can be listed.

Report Summaries in Rows

It produces reports in which different summary statistics are laid out in rows. Case listings

are also available, with or without summary statistics; and

Report Summaries in Columns

Produces summary reports in which different summary statistics appear in separate

columns.

OLAP Cubes (Online Analytical Processing Cubes)

It calculates totals, means, and other univariate statistics for continuous summary variables

within categories of one or more categorical grouping variables. A separate layer in the

table is created for each category of each grouping variable.

Procedures in the REPORT command group can provide all univariate statistics available in other procedures. In addition, computations involving aggregated statistics are directly accessible only in the REPORT procedures.

Among the others, Codebook and OLAP Cubes are included in the most essential procedures for the education data analysts.

Page 133: Module B

1.1 Codebook

Summary statistics produced by Codebook for the nominal and ordinal variables, and multiple

response sets include counts and percents. For scale variables, summary statistics include mean,

standard deviation, and quartiles. As such, codebook is very useful for preliminary analysis.

To obtain a codebook of the current dataset:

1. Click “Analyze” on main menu bar;

2. Click “Reports”; and

3. Click again “Codebook”, and a new window will appear with complete variable list.

4. Select and send the variables to “Codebook Variables” pane;

Here, just three variables with different measurement scales are chosen. And,

5. Click “OK” to proceed with the default settings for Output and Statistics.

Codebook reports such dictionary information as variable names, variable labels, value labels, and missing values. It also provides summary statistics for all or specified variables and multiple response sets in the active dataset.

5

4

1

2 3

Page 134: Module B

The output table obtained by above procedure for the first variable is:

Since the first selected variable “HV009 – Number of household members” is a scale variable, the

statistics produced for the variable are mean, standard deviation and three quartile values.

However, the other variables, “HV024 – Division” is nominal and “HV025 – Type of place of

residence” is ordinal, only count and percentage of each valid value (category) are provided as

statistics.

Page 135: Module B

In the Codebook procedure, measurement level of variables can be changed temporarily by clicking

right-mouse button after pointing on the variable. The following exhibit shows changing

measurement level of “HV270 – Wealth index” from “ordinal” to “scale”. Keep in mind that

changing from “ordinal” to “scale” type is temporary and only useful in the codebook procedure.

And, the followings are the options available in Codebook command at its default setting.

Select and click

right-mouse button

Click here to change

„Ordinal‟ to „Scale‟

Page 136: Module B

The following output table is the codebook for “HV009 – Number of household members” after

changing: (i) measurement level to “Ordinal”; (ii) select “Measurement level” and “Weight status”;

and (iii) to display only “Percent” in statistics option.

(i) These values are

displayed because of changing to “Ordinal”

temporarily

File information

set in (ii)

Real measurement level

„Scale‟ is displayed

Display only “percent”

as set in (iii)

Page 137: Module B

1.2 Case Summaries: Listing Selected Cases

Case Summaries under Report is useful to filter and list the cases with specified characteristics.

For example, to list “20 out-of-school children aged 6-14 from the lowest socio-economic status

from the sample households” with their age, sex, highest education level, etc…

It should be noted that the dataset must be (A) limited only to the household members aged 6-14

who are out of school (use “Select Cases”), and (B) sorted in ascending order by “Wealth index

factor score” (use “Sort Cases”)before exercising the case summaries.

The following exhibits explain the preparatory steps before executing “Case Summaries” briefly.

After conducting the preparatory tasks, the “PASW Statistics Data Editor” shows the

“Out_of_School_6_to_14” dataset with the selected cases sorted in ascending order of “HV271 –

Wealth index factor score”. The original sample dataset contains altogether 53,413 cases while the

filtered dataset contains only 974 cases containing out-of-school children aged 6-14 only.

Occasionally listing of selected cases with limited number of variables is required for validity (error) checking, reporting, printing and presentation purposes.

Case Summaries can help in such tasks.

Condition for selecting

aged 6-14 only

Condition for selecting

only “out-of-school”

Dataset for the

selected cases only

i.

ii. iii.

iv. v.

vi.

(a)

(b)

(c)

(d)

B - sorting

A – select cases

Page 138: Module B

After completing data preparation work, follow the steps to execute “Case Summaries” command:

1. Click “Analyze” on main menu bar;

2. Click “Reports”; and

3. Click again on “Case Summaries”. Then, a new window will appear with the

complete list of variables in the current dataset.

4. In “Case Summaries” window, select the variables in desired sequence;

5. Set number of cases to display in the “Limit cases to first”, for example, 20 ;

6. Click “OK” button to create a case summary report.

1 2

3

4

5

6

(HV105) (HV104) (HV101) (HV109)

(HV025)

Page 139: Module B

The output table of the above procedure is as following:

The following table is copied from PASW Statistics Viewer and pasted directly into MS Word.

Then, few minor touches on output layout are applied in MS Word.

Listing of Out-of-school Children from Poorest Households a

Case Number Division

Age of household members

Relationship to head

Educational attainment

Wealth index factor score (5 decimals)

Male 1 1 Sylhet 14 Son/daughter Incomplete primary -102597

2 4 Dhaka 12 Son/daughter Incomplete primary -97182

3 8 Sylhet 12 Son/daughter Incomplete primary -95883

4 9 Sylhet 12 Other relative Complete primary -95883

5 10 Rajshahi 11 Son/daughter Incomplete primary -95868

6 11 Dhaka 12 Son/daughter Incomplete primary -95747

7 12 Barisal 9 Son/daughter Incomplete primary -95184

8 14 Rajshahi 10 Son/daughter Incomplete primary -94601

9 15 Barisal 12 Son/daughter Incomplete primary -94539

10 17 Rajshahi 12 Son/daughter Incomplete secondary -94185

11 19 Rajshahi 13 Son/daughter Incomplete secondary -93875

12 20 Rajshahi 14 Son/daughter Incomplete primary -93649

Male Mean 11.92 -95766.08

Female 1 2 Dhaka 10 Son/daughter Incomplete primary -97793 2 3 Barisal 8 Son/daughter Incomplete primary -97330 3 5 Dhaka 10 Son/daughter Incomplete primary -97182 4 6 Barisal 13 Son/daughter Incomplete primary -96696 5 7 Chittagong 14 Grandchild Incomplete primary -96592 6 13 Dhaka 11 Son/daughter Complete primary -95028 7 16 Dhaka 12 Son/daughter Incomplete primary -94331 8 18 Dhaka 10 Son/daughter Incomplete primary -93976 Female Mean 11.00 -96116.00

Total Mean 11.55 -95906.05

a. Limited to first 20 cases.

Page 140: Module B

The next table shows the same list of 20 out-of-school children, but by “Division”.

“Report Summaries in Rows” produces reports in which different summary statistics are laid out

in rows. Case listings are also available, with or without summary statistics. Similarly, “Report

Summaries in Columns” can provide summary reports, in which different summary statistics

appear in separate columns.

The outputs of both commands are in text format and cannot use pivot table techniques. Moreover,

all such outputs could be created from “Case Summaries” command described before.

The following table is the summary statistics obtained from “Case Summaries” command without

displaying individual cases. The variable selected to display summary statistics is “number of years

effectively studied by a household member (HV108 – Education in single year)”. And, the report

will provide such statistics as:

(i) number of cases;

(ii) mean year of study (average of HV108);

(iii) standard error of mean; and

(iv) median year of study by:

a. sex,

b. residence, and

c. district without listing individual cases.

Page 141: Module B

Case Summaries Education in single years

Resid-ence Division

Sex of household member

Male Female Total

N Mean Std.Err. Mean Median N Mean

Std.Err. Mean Median N Mean

Std.Err. Mean Median

Urban Barisal 22 3.36 0.387 3.00 17 3.71 0.605 4.00 39 3.51 0.338 3.00

Chittagong 35 3.11 0.366 3.00 35 3.83 0.372 4.00 70 3.47 0.263 3.00

Dhaka 54 3.22 0.314 4.00 60 3.23 0.340 3.00 114 3.23 0.231 3.00

Khulna 22 2.95 0.419 3.50 16 4.19 0.467 5.00 38 3.47 0.324 4.00

Rajshahi 27 3.26 0.448 4.00 20 2.75 0.497 2.50 47 3.04 0.332 3.00

Sylhet 41 2.95 0.356 3.00 19 3.42 0.509 3.00 60 3.10 0.291 3.00

Total 201 3.14 0.153 3.00 167 3.46 0.184 3.00 368 3.29 0.118 3.00

Rural Barisal 54 2.98 0.296 3.00 30 3.67 0.344 4.00 84 3.23 0.228 4.00

Chittagong 63 2.90 0.230 3.00 45 4.02 0.352 4.00 108 3.37 0.205 3.50

Dhaka 71 2.73 0.262 2.00 58 3.72 0.268 4.00 129 3.18 0.192 3.00

Khulna 33 3.48 0.289 4.00 15 4.00 0.569 3.00 48 3.65 0.265 3.50

Rajshahi 54 3.09 0.280 3.50 32 3.41 0.401 3.00 86 3.21 0.230 3.00

Sylhet 99 3.24 0.208 3.00 52 3.62 0.350 4.00 151 3.37 0.182 3.00

Total 374 3.05 0.105 3.00 232 3.72 0.146 4.00 606 3.31 0.087 3.00

Total (Urban+ Rural)

Barisal 76 3.09 0.238 3.00 47 3.68 0.306 4.00 123 3.32 0.189 4.00

Chittagong 98 2.98 0.197 3.00 80 3.94 0.255 4.00 178 3.41 0.161 3.00

Dhaka 125 2.94 0.201 3.00 118 3.47 0.218 3.50 243 3.20 0.149 3.00

Khulna 55 3.27 0.241 4.00 31 4.10 0.360 4.00 86 3.57 0.205 4.00

Rajshahi 81 3.15 0.238 4.00 52 3.15 0.312 3.00 133 3.15 0.189 3.00

Sylhet 140 3.16 0.180 3.00 71 3.56 0.288 3.00 211 3.29 0.154 3.00

Total 575 3.08 0.087 3.00 399 3.61 0.115 4.00 974 3.30 0.070 3.00

Std.Err. Mean = Standard error of mean.

Page 142: Module B

1.3 OLAP Cubes (Online Analytical Processing Cubes)

It creates a separate layer for each category of every grouping variable in the table. The summary

variables are quantitative (continuous variables measured on an interval or ratio scale), and the grouping

variables are categorical. The values of categorical variables can be numeric or string.

OLAP Cubes provides a wide variety of summary statistics such as: sum, number of cases, mean,

median, grouped median, standard error of the mean, minimum, maximum, range, variable value of the

first category of the grouping variable, variable value of the last category of the grouping variable,

standard deviation, variance, kurtosis, standard error of kurtosis, skewness, standard error of skewness,

percentage of total cases, percentage of total sum, percentage of total cases within grouping variables,

percentage of total sum within grouping variables, geometric mean, and harmonic mean.

Some of the optional subgroup statistics, such as the mean and standard deviation, are based on normal

theory and are appropriate for quantitative variables with symmetric distributions. OLAP cube uses

the pivot table techniques, but with specific statistics and output options which cannot be obtained

from other procedures such as cross-tabulation.

Example: OLAP Cubes

Among the variables in the sample dataset, only “HV108 – Education in single year” is the

education related continuous (interval or ratio scale) variable. Since the continuous variable(s) must

be selected as “Summary” variable, HV108 is selected in this example. Thus, the following exhibits

demonstrate how “OLAP Cubes” is useful in exploring the “average number of study years by the

adult household members” by four grouping variables: Sex; Age Group; Residence and Division.

Before using “OLAP Cubes” procedure, only the adult household members (aged 15 and above)

must be selected using “Case Selection”.

The OLAP Cubes procedure can produce variety of summary statistics for summary variables within categories of one or more grouping variables.

5

1

2 3

4 (i)

4 (ii)

6 7

8

Preparing Dataset for analyzing adults only

(HV105) (GAge)

(HV025) (HV024)

(HV108)

Page 143: Module B

After selecting only adults:

1. Click “Analyze” on main menu bar;

2. Click “Reports”; and

3. Click “OLAP Cubes”, and a new window will appear with complete variable list.

4. In “OLAP Cubes” window, select Summary and Grouping variables as planned;

5. Click “Statistics” to set the desired summary statistics:

a. By default, the six summary statistics are selected (can leave it as it is);

b. Users can double-click any unselected statistics to be selected and vice versa;

c. Click “Continue” when complete selection of summary statistics;

6. Click “Differences” button to compute absolute or percentage differences for all

measures selected in the Statistics dialog box. This step is optional.

(a) Default stats

(c).

Step 5

Step 5

Step 6

Step 7

(b) Selected Statistics

Page 144: Module B

The "Differences" dialog box allows calculating percentage and absolute differences:

Differences between Variables calculates differences between pairs of

variables. At least two summary variables must be selected before specifying

differences between variables.

Differences between Groups of Cases calculates differences between pairs of

groups defined by a grouping variable. One or more grouping variables must be

selected in the main dialog box before specifying differences between groups.

The differences are calculated between summary statistics values by

subtracting the value of the “minus” variable/category from the values of the

first in the pair. Percentage differences use the value of the summary statistic of

the second (the Minus) as the denominator.

7. Click “Title” button to create custom table titles. This step is optional.

Title of output table or a caption (add below the table) can be added in this step. If the

title or caption expands over one line, inset \n for wrapping (a line break in the text).

Enter appropriate title and caption, and Click “continue” button when completed.

8. Click “OK” button on “OLAP Cubes window” to start creating with the set options.

When complete creating the OLAP Cube, the following output will be placed in the output viewer.

The default output provides three summary statistics: number of cases (N), mean, and standard

error of mean for “HV108 – Education in single year” for the entire sample (valid cases).

Although this table seems simple and unattractive, one can select for each and every category of

“Grouping variables” as in the Pivot tables. To do this, double-click on the table in Output Viewer,

and then, click on the dropdown icon and select any category in the list. The following exhibit

shows the statistics for the “Males aged 15-29 who lived in the urban areas”.

User can change categories: from the default “total” to any

item in the Dropdown list

Statistics

Layer

Caption

Title

Page 145: Module B

Again, one can pivot the output table to be more attractive as followings:

Page 146: Module B

2. DESCRIPTIVE STATISTICS

Most frequently used procedures in PASW Statistics are Descriptive Statistics. From making

initial analysis and checking validity to extracting education data and constructing indicators from a

household survey, “Descriptive Statistics” are essential. Although “Report” could provide similar

statistics, “Descriptive Statistics” are user-friendly and provide more varieties of charts.

2.1 Frequencies

“Frequencies” procedure can produce such statistics as: frequencies (counts), percentages,

cumulative percentages, mean, median, mode, sum, standard deviation, variance, range, minimum

and maximum values, standard error of the mean, skewness and kurtosis (both with standard

errors), quartiles and percentiles. Moreover, it can produce bar chart, pie chart, and histogram.

For better display in the output table and charts, distinct values can be arranged in ascending or

descending order of category labels or of their counts. The frequencies report can be suppressed

when a variable has many distinct values. Charts produce by this command can be labeled with

frequencies (default) or percentages. To produce a simple frequency table:

1. Click “Analyze” on main menu bar;

2. Click “Descriptive Statistics”; and

3. Click again “Frequencies”, and a new window will appear with complete variable list.

4. Select (categorical) variables to produce frequency tables (each variable will have a table);

5. Click “Format” button, and set the output formats on:

a. how to order categories in the frequency table – ascending or descending order of

values or count?

b. how to organize the outputs if more than one variable is selected?; and

c. whether to display or suppress the table with several categories (to set maximum)?;

6. Click “OK” button to start creating the frequency tables with selected charts and format.

Frequencies is the procedure to start analyzing a dataset. It provides statistics and graphical displays that are useful for describing all different types of variables.

1

3 2

4

5

6

Step 5 (a)

(b)

(c)

(c)

(HV104) (HV109) (HV105)

Page 147: Module B

And the following outputs will be obtained from the steps present in the above exhibit.

It should be noted that only two frequency tables are generated although three variables are

selected. It is because PASW Statistics suppressed the frequency table of “HV105 – Age of

household member” since the number of categories is more than set value of 15 (roughly 100).

Generally, there are two key purposes in using frequencies: (a) to get frequency table of categorical

variables with limited number of different items, for example, sex, educational attainment, age

group, etc.; and (b) to get summary statistics of the continuous variables without frequency table

(i.e., for the variables in interval or ratio scales and values are widely different).

Moreover, bar charts, pie charts and histograms can be created automatically for the categorical

variables with limited number of different items by clicking “Charts” button. Then, select the chart

type and option after current Step 5. In the above example, “Pie chart” is appropriate to review

gender composition (HV104) of sample population while “Bar chart” should be use for the

education levels (HV109). Therefore, those two variables cannot join together at the same time.

(HV104)

Page 148: Module B

Similarly, one can choose types of statistics to be displayed by clicking “Statistics” button after

selecting charts. The following exhibits show the outputs for the variable “HV105 – Age of

household member” without frequency table by age.

(HV105)

(HV109)

Page 149: Module B

2.2 Descriptives

Because it does not sort values into a frequency table, it is an efficient means of computing

summary statistics for continuous variables. Almost all statistics provided in DESCRIPTIVES can

also be obtained from other procedures such as FREQUENCIES, MEANS, and EXAMINE.

Although “Frequencies” could also provide univariate statistics, “Descriptives” displays summary

statistics for several variables in a single table. It can also calculate and save the standardized

values (Z-scores). Variables can be ordered by the size of their means (in ascending or descending),

alphabetically, or by the order in which user selects the variables (default).

When Z-scores are saved, they are added to the current dataset and are available for analyses and

listings. When variables are recorded in different units, e.g., „household members‟ and „education

in single years‟), the Z-score transformation places variables on a common scale for easier visual

comparison. Moreover, “Descriptives” is efficient for large files: with tens of thousands of cases.

To use Descriptives:

1. Click “Analyze” on main menu bar;

2. Click “Descriptive Statistics”; and

3. Click again “Descriptives”, and a new window will appear with complete variable list;

4. Select continuous (interval or ratio scale) or dichotomous (just 0 and 1) variables;

5. Click “Options”, (i) select the preferred statistics from the lists, (ii) define the order of the

variables to be displayed in the output table, and (iii) click “Continue”;

6. Optionally, tick “Save standardized values as variable” to save the Z-score (or standardized

values) of the selected variable(s) in the current dataset; and

7. Click “OK” button to start calculating summary descriptive statistics.

Descriptives computes univariate statistics, such as mean, standard deviation, minimum, and maximum for numeric variables and displayed in a single table for better comparison.

Note: Two scale variables: “HV105 – Age of household members”

and “HV108 – Education in single years”, and one dichotomous

nominal variable: “HV110 – Member still in school” are used in this example.

Step 5

1

3 2

(iii)

6

5

4

(ii)

(i)

7

(HV105) (HV108) (HV110)

Page 150: Module B

In calculating the descriptive statistics (and also in most statistical analyses), it is important to

check and edit the variables under study to contain only valid values in the analysis. For example,

in the variable “HV108 – Education in single years”, code 97 is used for “Inconsistent values”,

code 98 represents “DK or Do not know”, and code 99 is “missing values”. In this case, 97, 98 and

99 are not valid years of study and should not be in the analyses, therefore, put all those codes into

“missing values” to be excluded from computing statistics (see Module B2 to edit missing values).

Two similar “Descriptive Statistics” tables are presented in the above example: (i) constructed with

the default missing values, that is, using the codes 97 and 98 as valid; and (ii) constructed after

setting 97 and 98 as missing. Since number of cases is large, differences in the summary statistics

are minimal. However, if the same calculation is conducted for a subset with limited number of

cases, the differences could be significant. In the above output table, the mean value 0.61 of the

variable "member still in school" can interpret as “61% of 20,540 persons are still in school”.

The following example presents all available statistics (set in the options) in “Descriptives”.

(HV009) (HV026) (HV026)

Page 151: Module B

It should be noted that the descriptive statistics calculated for the variable “HV024 – Division” are

useless in any analysis. “HV024” is just a nominal variable with codes 1 to 6, representing 6

districts of Bangladesh, and their mean value 3.48 cannot point out anything.

One of the significant features of “Descriptives” is its ability to save standardized values (Z-score)

for the selected variables to be used in further analyses. To add Z-scores of a variable into current

data set, just tick the checkbox next to “Save standardized values as variables”. Then, PASW will

add new variables affixing Z in the original variable name as the first letter of the new variable, for

example, the new variable for Z-score of “HV009” is simply “ZHV009”.

Newly created variable

Page 152: Module B

2.3 Explore

Data screening aims to examine the existence of unusual values, extreme values, data gaps, or other

peculiarities. By exploring data, users can determine whether the statistical techniques under

consideration for further analyses are appropriate or not. It may help deciding whether to transform

the data (in case the technique requires a normal distribution) or to use nonparametric tests.

Dependent variables or variables to be explored [List (a) in following chart] can be quantitative

(interval or ratio-level measurements). Factor variables [List (b)], with short string or numeric

values, will break the dependent variables into groups of cases. The factor variables should have a

reasonable number of distinct values, generally, not more than 10 categories. The case label

variable [List (c): allowed only one variable], used to label outliers in boxplots, can be short string,

long string (but use only first 15 bytes), or numeric. To analyze with Explore:

1. Click “Analyze” on main menu bar;

2. Click “Descriptive Statistics”; and

3. Click again “Explore”. A new “Explore” window will appear with complete variable list;

4. Select continuous (interval or ratio scale) variables to produce univariate statistics;

With voluminous outputs produced by “Explore”, just one variable “HV108 – Education in

single years” with simple (mostly default) settings been used in the following example.

Explore produces summary statistics and graphical display, either for all cases or separately for groups of cases. It is particularly useful in data screening, outlier identification, description, assumption checking, and characterizing differences among subpopulations (groups of cases).

Step 6

1

5

2

Step 7

6 7

9

8

Default Settings

Default

Setting

Default

Additional

3

(a)

(b)

(c)

4 (HV108)

Step 5

Page 153: Module B

5. Click “Statistics”, set the preferred statistics from the lists, and click “Continue”;

6. Click “Plots”, set the preferred types of plots from the lists, and click “Continue”;

7. Click “Options”, set how to handle the missing values, and click “Continue”;

8. Select “Display” option (only statistics or plots, or both) on “Explore” window; and

9. Click “OK” button to start “Explore”, and the following outputs will be displayed.

Page 154: Module B

By selecting all statistics and available charts, exploring “HV108 – Education in single year”

factored by “HV024 – Division” produced altogether 33 tables and charts as in the following output

(starting from “Case Processing Summary” to “Spread-versus-Level Plot”):

1

5

10

15

20

25

30

33

Page 155: Module B

2.4 Crosstabs

“Frequencies” and “Explore” are efficient in analyzing univariate statistics, but those procedures

could not provide information on the relationship between categorical variables. For example,

frequencies could provide “number of household heads by education level” and “number of

household heads by sex” or “number of households by economic status (wealth index)”, but cannot

provide “number of female headed households in the poorest category” or even simple question as

“percentage of female headed households”.

In crosstabs, use values of a numeric or short string variable to define categories of each variable.

For example, codes “1 and 2” or “male and female” or “M and F” are valid for the variable “sex”.

Ordinal variables can be either numeric codes that represent categories, for example, numeric codes

“1 to 5” can be used for variable “Wealth Index” representing “1 = poorest, 2 = poorer, 3 = middle,

4 = richer, and 5 = richest” or string values “a to e” as “a = richest, b = richer, c = middle, d =

poorer, and e = poorest”.

In PASW Statistics, the alphabetic order of string values is assumed reflecting the true order of the

categories. Therefore, if a string variable with codes “L, M, H” representing “low, medium and

high” is used, the order of the categories in the output will be “H, L, M” and the results might be

misinterpreted. In general, it is more reliable to use numeric codes and provide appropriate value

labels to represent ordinal data.

Selection of Variables

For cross-tabulation, at least one variable each must be selected to the rows and columns of the

output table. Then, other variables could be put as layers and known as “factor” variables. The

variables used in crosstabs procedure must be categorical ones (measured in nominal or ordinal)

with limited number of value items (generally, less than 10 different values).On the other hand,

discrete scale variables could also be used to get statistics if the range of values are not too large

and suppress the table output. The factor variables must be categorical.

Statistics Option

In Crosstabs, statistics and measures of association are computed for two-way tables only. If a table

is formed in multi-ways as “row, column, and layer (control) variables”, the Crosstabs procedure

forms one panel of associated statistics and measures for each value of the layer (or a combination

of values for two or more control variables). For example, if “sex” is a layer factor for a table of

“educational attainment” against “wealth index”, the results for a two-way table for the males and

for the females are computed separately.

Crosstabs is one of the procedures producing a variety of statistics as:

Chi-square tests of independence/association is generally used for 2 x 2 tables. One can select:

Pearson chi-square, the likelihood-ratio chi-square, Fisher's exact test, and Yates' corrected chi-

square (continuity correction). For tables with any number of rows and columns, select Chi-

square to calculate the Pearson chi-square and the likelihood-ratio chi-square.

Spearman's rank correlation coefficient (rho) is calculated if both rows and columns contain

ordinal variables (numeric data only). When both row and column variables are quantitative,

Pearson‟s correlation coefficient (r), a measure of linear association, is calculated.

For more explanations on statistics please see "PASW Statistics 17 Base User Guide".

Cells Display Option

By default, Crosstabs displays the “count” or the number of cases actually observed in each cell.

Optionally, number of “expected” cases could be selected to display. Similarly, row, column and

total percentages can be displayed in the cells together with the observed number of cases (count).

Crosstabs is useful for investigating the relationship between two or more categorical variables by providing information about the intersection of variables.

Page 156: Module B

To uncover the patterns in data contributing to a Chi-square test, three types of residuals (deviates)

that measure the difference between observed and expected frequencies could be displayed.

Unstandardized: the difference between an observed value and the expected value.

Standardized: the residual divided by an estimate of its standard deviation. Standardized

residuals, also known as Pearson residuals, have a mean of 0 and a standard deviation of 1.

Adjusted standardized: the residual for a cell (observed minus expected value) divided by an

estimate of its standard error.

Non-integer weights Option

Cell counts are normally integer values. But if the dataset is weighted by a variable with fractional

values (e.g. 1.25), cell counts can be fractional values. Then, counts can be truncated or rounded

either before or after calculating the cell counts, or use fractional cell counts for both table display

and statistical calculations.

Using Crosstabs

Follow the steps:

1. Click “Analyze” on main menu bar;

2. Click “Descriptive Statistics”;

3. Click “Crosstabs” and a new “Crosstabs” window will appear with complete variable list;

4. Select categorical variables (or scale variables with limited number of different values) and

send to rows, columns and layers (click “Next” to add another layer). Layer variables can be

organized as: all on the same layer (one set of tables per each layer variable) or on different

layers (just one set of tables with cross-layers cells).

5. Select appropriate statistics to be calculated;

In this example, no statistics is selected although both row and column variables are ordinal

and thus chi-square, correlations, Gamma and Kendall‟s tau are appropriate to calculate.

6. Select the contents of the cells in the cross-tabulation;

7. Set the row order: ascending or descending;

1

2

10

5

3

4(a)

4(b)

4(c)

9 8

6 7

(HV026)

(HV270)

(HV219)

Page 157: Module B

8. Set whether to get the clustered bar charts;

9. Set whether to suppress tables (or display the main crosstab table); and

10. Click “OK” to start constructing tables and charts as selected.

In this example, no optional settings are set and just two tables, (1) Case Processing Summary, and

(2) basic cross-tabulation table with simple counts in cells, are produced. In cross-tabulation the

missing values are handled list-wise (across variables), and thus it is important to observe the

“number of valid cases” in the “case processing summary” statistics.

If different cell display options, such as number of observed and expected counts; row, column and

total percentages, and residuals, are selected in the Step 6, the following crosstab table is created

after using pivoting capabilities offered in PASW statistics and a few minor touches.

Step 5

All settings in Step 5 through Step 9 are

"as it is in the Default”

in this example

Step 6

Step 7

Page 158: Module B

Note: The original output table is huge and difficult to read

since all statistics are placed together.

It is edited: (1) shortened a long value label; (2) hid the

variable label of HV026); and (3) moved “Statistics” to

“LAYER” in the “Pivoting Trays”.

The following tables present percentage distribution of households within “Place of residence” and

within “Wealth index” by “Sex of household head”, which are extracted from the above pivot table.

Newly selected

options

Step 6

click here and select what the cells to display

Page 159: Module B

By selecting both “Display clustered bar charts” and “Suppress tables” options, the following

charts will be produced without producing any output tables:

No output tables except

“Case Processing Summary”

Page 160: Module B

2.5 Ratio Statistics

In Ratio Statistics, outputs can be sorted by values of a grouping variable in ascending or

descending order. Grouping variables must be nominal or ordinal level measurements and it is

better to use numeric codes or short strings. The ratio statistics report can be suppressed in the

output, and the results can be saved to an external file.

It provides statistics on: central tendency (median, mean, weighted mean); confidence intervals

for mean and median; measures of dispersion (AAD – average absolute deviation, COD –

coefficient of dispersion, PRD – price-related differential or index of regressivity, median-centered

coefficient of variation, mean-centered coefficient of variation, standard deviation, range, minimum

and maximum values), and the concentration index (ratio between a user-specified range or

percentage within the median ratio).

Practical Example:

In analyzing household survey data for participation in general education, using total number of

children aged 6-15 (var1) and those who are currently attending primary schools (var2) with sex (or

urban/rural residence or division or etc.) as grouping variable, the age-specific enrolment ratios for

the children aged 6-15 by sex can be calculated. Moreover, variation in the distribution of ratios

between male and female can also be observed.

However, there is no variable which could get “number of children at age x” after summing up

within the grouping variable. Therefore, one variable must be created, say, “pop” with value 1 for

each and every children aged 6-15. Use the “Compute” command as follow:

And, define the variable label (“Population aged 6-15”) and format (Display: 5 and Decimal: 0).

Ratio Statistics provides a comprehensive list of summary statistics for describing the ratio between two scale variables with positive values.

Page 161: Module B

After complete creating new variables, use the Ratio Statistics as following:

1. Click “Analyze” on main menu bar;

2. Click “Descriptive Statistics”; and

3. Click again on “Ratio”. A new “Crosstabs” window will appear with complete variable list;

4. Select two scale variables for “Numerator” and “Denominator”, and a categorical

(nominal or ordinal) variable for “Group” variable;

5. Set whether to sort group variable in ascending or descending order;

6. Set whether to display results or not (just to save in a new file);

7. Set whether to save results to a new data file for further analyses;

8. Click “Statistics” button and select required statistics in “Statistics” window; and

9. Click “OK” button to start constructing statistics as selected.

Computing first time

without IF condition

Computing second time

with IF condition

Warning:

Caution must be taken in using DHS survey data for the “current schooling status” since DHS

asks the question “Whether xx is still in school or not?” to those who have been to school

only, and thus, who have never been to school were omitted or treated as “missing”.

To obtain the correct “current schooling status” of every person, another variable must be

created, say “schooling”, from “HV110 – Member still in school” by setting “schooling = 1”

for the case “HV110=1”, and “schooling = 0” all other cases. Here, the new variable

“schooling” can be created by using “compute” command twice: first, compute “schooling =

0” for all cases, then compute “schooling = 1” for those who are currently attending school,

that is, HV110 = 1. Then, set appropriate properties to the new variable.

Page 162: Module B

The following exhibit shows both the “statistics options” selected and the “results” obtained.

5

1

2

3

4

6 7

8

9

Page 163: Module B

Normally, the group variable is displayed on the rows and statistics on the columns. If several

statistics are chosen, the output table may be difficult to read or print. In such case, double-click the

table to get into Pivot Table editor. Then, apply “Transpose Rows and Columns” under “Pivot”

menu to view the statistics on rows and groups on columns to become the table well accessible.

Page 164: Module B

3. TIPS AND EXERCISES

3.1 Tips: Do and Don’t

i) Do… first, use the “codebook” procedure to be acquaintance with the household

survey dataset if complete documentation is unavailable;

Don‟t… waste time by searching/ requesting actual coding scheme or by running

frequency tables for all variables.

ii) Do… study the survey questionnaire and “codebook” to select the variables of

interest, and make new datasets or variable sets for further analyses;

Don‟t… try selecting variables on a “trial and error” basis without studying proper

survey documentation or codebook in analyzing a newly available dataset.

iii) Do… make acquaintance with OLAP Cubes procedure; run several frequency

and crosstab tables and practice using the OLAP Cubes;

Don‟t… display several variables in multiple layers in a table since it is difficult

to get the essence of the statistics displayed, and unusable or easily

misinterpret.

iv) Do… expert the data preparation and management techniques such as computing

new variables, selecting cases, creating new variable sets, data validation,

and etc.;

Don‟t… waste time to edit/correct secondary household survey dataset (obtained

from other sources: departments, agencies, organizations, ….

v) Do… start analysis by running “frequencies” to every variable except for the

continuous (scale) variables with several different items. For the

continuous (scale) variables use “Descriptive” procedure to explore their

basic structure;

Don‟t… go into in-depth analyses or calculation of ratio statistics before well

understanding the variables.

vi) Do… crosstab between variables with intrinsic linkages and export the outputs to

a spreadsheet software for better presentation, and create and present

graphs and charts as appropriate in PASW or Excel;

Don‟t… create oversized crosstab tables with multiple layers (use “pivot” technique

to simplify the crosstab tables).

vii) Do… run the crosstab tables (or frequency tables) to get baseline data correctly

and make further calculations and analyses in spreadsheet software;

Don‟t… try to run (and use the outputs) of “ratio statistics” procedure if you are not

sure that the process is perfectly correct.

Page 165: Module B

3.2 Self-evaluation

Do you know when to use codebook procedure in PASW statistics? Very well / Somewhat well / Not so much / Almost None

Are you confident that you can run the following procedures in an active dataset? o Codebook:

Confident / Somewhat confident / Not so much / Not at all o OLAP Cubes:

Confident / Somewhat confident / Not so much / Not at all o Frequencies:

Confident / Somewhat confident / Not so much / Not at all o Crosstabs:

Confident / Somewhat confident / Not so much / Not at all o Ratio Statistics:

Confident / Somewhat confident / Not so much / Not at all

Do you think you can demonstrate to your colleague on how to run: o Simple frequency tables:

Definitely / Could be / Not so sure / Not at all o Frequency tables with appropriate charts:

Definitely / Could be / Not so sure / Not at all o Simple crosstab tables:

Definitely / Could be / Not so sure / Not at all o Crosstab tables with layers:

Definitely / Could be / Not so sure / Not at all o Simple OLAP Cubes:

Definitely / Could be / Not so sure / Not at all o Pivoting crosstab tables:

Definitely / Could be / Not so sure / Not at all 3.3 Hands-on Exercises

1) Import the attached “data1(tab).dat” and define all variables appropriately, and run the

codebook procedure to check whether you have defined the dataset effectively.

2) From the dataset obtained from Exercise 1 above, recode all string variables, and run

the codebook procedure to check whether you have recoded and defined the dataset

effectively.

3) Begin data analysis with selected procedures of your choice to get education

indicators which are useful for EFA monitoring.

4) Get a recent household survey dataset from your country, then note down the step-by-

step procedure on how to make use of it in education planning, especially for EFA

monitoring.

5) Follow the steps defined in the previous question and get the “data, information and

indicators” which you have defined.

Page 166: Module B

4. ANNEX: WEB LINKS FOR FURTHER STUDY ON SPSS/PASW STATISTICS

1. Central Michigan University. SPSS (PASW) On-Line Training Workshop

(See http://calcnet.mth.cmich.edu/org/spss/index.htm )

2. College of Humanities and Social Sciences. Topics in Multivariate Analysis.

(See http://faculty.chass.ncsu.edu/garson/PA765/index.htm)

3. Creative Research Systems: Survey Research Aids

(See http://www.surveysystem.com/resource.htm )

4. East Carolina University. PASW/SPSS Lessons: Univariate Analysis.

(See http://core.ecu.edu/psyc/wuenschk/SPSS/spss-lessons.htm )

5. Newcastle University. Statistics Support.

(See http://www.ncl.ac.uk/iss/statistics/docs/ )

6. Research Method Knowledge Base.

(See http://www.socialresearchmethods.net/kb/index.php )

7. SPSS Web-Based Training.

(See http://www.spss.com/training/wbt/ )

8. Statistical Exercised Using PASW Statistics.

(See http://www.brad.ac.uk/lss/documentation/pasw-statistics-v17-exercise/statistical-

exercises-using-PASW%20Statistics-v17.pdf )

9. UCLS Academic Technology Services. Resources to help you learn and use SPSS.

(See http://www.ats.ucla.edu/stat/spss/default.htm )

10. University of Toronto. SPSS Tutorial.

(See http://www.psych.utoronto.ca/courses/c1/spss/toc.htm )

11. Visual Statistics Studio.

(See http://www.visualstatistics.net/ )

Page 167: Module B

Module B5:

Using Microsoft Excel to Elaborate PASW Outputs for Better Presentation

Contents:

1. MS Excel 2007: Basics 1.1 Result-Oriented User Interface 1.2 New File Formats in Microsoft Office Excel 2007 1.3 Data Handling Capacity of Microsoft Office Excel 2007 1.4 Selected Statistical Functions in Microsoft Office Excel

2. Further Analyses and Presenting Outputs in MS Excel 2.1 Importing PASW Databases into Microsoft Office Excel 2.2 Creating Frequency and Crosstab Tables 2.3 PivotTables (OLAP Cubes) 2.4 Drawing Pivot Charts 2.5 Elaborating PASW Outputs for Better Presentation

3. Tips and Exercises 3.1 Tips: Do and Don’t 3.2 Self-evaluation 3.3 Hands-on Exercises

Purpose and leaning outcomes:

To know how to import PASW outputs into MS Excel 2007

To introduce data handling and data analysis using MS Excel 2007

To explore some advanced features of data presentation in MS Excel 2007

Page 168: Module B

1. MICROSOFT OFFICE EXCEL 2007: BASICS

1.1 Result-Oriented User Interface

Layout of the main menu and the contents of the first menu tab “Home” are as follow:

Many dialog boxes are replaced with drop-down galleries that display the available options, and

descriptive tooltips or sample previews are provided to help choosing the right option. For example,

when clicking on “Paste”, it will display a drop-down galleries with active options depending on

which items are available in the clipboard as:

(1) No items in clipboard (2) After copying an Excel range

(3) After copying a picture / image (4) After copying text from MS Word

Nowadays, Microsoft Excel is the most widely used spreadsheet software all over the world. The new results-oriented user interface intended to make easy to work in Excel 2007. Commands and features are organized on task-oriented tabs that contain logical groups of commands and features. Since its user interface is totally changed, even the regular users require familiarizing with its new features and looks.

Page 169: Module B

The Office clipboard can store up to 24 items. If the mouse is on the “ ”icon located at the bottom

right corner of “Paste” menu, “instant help” on “Clipboard” will be displayed and if the mouse if on

the “Paste”, the tool-tip will be displayed as followings:

And, if click the Clipboard area located at the bottom of the “Paste” menu, a clipboard pane with all

available items kept in the clipboard will be displayed.

Moreover, online help for the clipboard is available

For every activity being performed in the new user interface – whether it's formatting or analyzing

data – Excel presents the tools, tips and help that are most useful to successfully complete that task.

As such, the user interface of Office Excel 2007 is helping to obtain the desired results efficiently.

Clipboard is empty Sample of items

copied from Word

Sample of items

copied from Excel

Thumbnail of the

picture/image copied

Number of items kept

in the clipboard

Page 170: Module B

1.2 New File Formats in Microsoft Office Excel 2007

The previous versions of Excel files (from Excel 2.1 to Excel 2003) use “.xls” for Excel (data) files,

“.xla” for add-ins, and “.xlt” for templates. Excel files with extension “.xls” could hold data sheets,

chart sheets and micro sheets. In Excel 2003, “.xml” is used for XML-based spreadsheet or data

files (XML = Extensible Markup Language). In Office Excel 2007, the following formats and file

extensions are used to distinguish different file types and for better securities:

Excel Workbook .xlsx The default Office Excel 2007 XML-based file format. It cannot store

VBA macro code or Microsoft Office Excel 4.0 macro sheets (.xlm).

Excel Workbook (code) .xlsm The Office Excel 2007 XML-based and macro-enabled file format.

It stores VBA macro code or Excel 4.0 macro sheets (.xlm).

Excel Binary Workbook .xlsb The Office Excel 2007 Binary file format (BIFF12).

Template .xltx The default Office Excel 2007 file format for an Excel template.

It cannot store VBA macro code or Excel 4.0 macro sheets (.xlm).

Template (code) .xltxm The Office Excel 2007 macro-enabled file format for an Excel template.

It stores VBA macro code or Excel 4.0 macro sheets (.xlm).

Excel Add-In .xlam The Office Excel 2007 XML-based and macro-enabled Add-In, a

supplemental program that is designed to run additional code.

It supports the use of VBA projects and Excel 4.0 macro sheets (.xlm).

Moreover, the following file types (or filename extensions) of previous versions of Excel are still

valid Excel files in Office Excel 2007 and can open or save without transforming into 2007 format:

Excel 97-2003 Workbook .xls The Excel 97 - Excel 2003 Binary file format (BIFF8).

Excel 97-2003 Template .xlt The Excel 97 - Excel 2003 Binary file format (BIFF8) for an Excel tem

plate.

Excel 5.0/95 Workbook .xls The Excel 5.0/95 Binary file format (BIFF5).

XML Spreadsheet 2003 .xml XML Spreadsheet 2003 file format (XMLSS).

XML Data .xml XML Data format.

It should be noted that all Excel files created in any version can be opened and saved them back in

the original file type, however, the Office Excel 2007 files cannot be opened in earlier versions of

MS Excel unless the optional Office updates for file format transformation is installed.

Page 171: Module B

1.3 Data Handling Capacity of Microsoft Office Excel 2007

Enabling to explore massive amounts of data in worksheets, Office Excel 2007 supports 1,048,576

rows by 16,384 columns per worksheet (or 234

, i.e., 17 billion cells). This is the size that every

household survey datasets cannot surpass: allowed one million cases across sixteen thousand

variables. Therefore, any household survey dataset can be exported to Excel, and further analyses

can be conducted in Excel 2007 which is much more familiar with most education planners and

administrators.

As seen in the above exhibit, Office Excel 2007 Worksheet is “1 K” (1024) times larger than Excel

2003 worksheet. Although Excel 2007 files can be opened in Excel 2003, the contents of Excel

2007 worksheets which are located outside the Excel 2003 boundaries (65,536 rows x 256

columns) cannot be retrieved into Excel 2003.

Other improvements in Office Excel 2007 compared to Excel 2003 include the followings:

(a) 4 thousand types of formatting allowed in Excel 2003 to unlimited number in the same

workbook in Excel 2007;

(b) the number of cell references per cell is increased from 8 thousand to limited by available

memory;

(c) memory management has been increased from 1 GB to 2 GB;

(d) supports up to 16 million colors; and

(e) supports dual-processors and multithreaded chipsets.

With such improvements, general performance of Excel has moved forward. Moreover, when using

computers with advanced chipsets, calculations are much faster in large, formula-intensive

worksheets.

256 columns

16,384 columns

Page 172: Module B

1.4 Selected Statistical Functions in Microsoft Office Excel 2007

There are altogether 346 built-in functions under 12 different categories in Excel 2007. Summary of

Excel functions under different categories in descending order of number of functions in category is

as following:

Sr. Category Number Per cent

1 Statistical functions 82 23.7%

2 Math and trigonometry functions 60 17.3%

3 Financial functions 53 15.3% 4 Engineering functions 39 11.3%

5 Text functions 27 7.8%

6 Date and time functions 20 5.8% 7 Lookup and reference functions 18 5.2%

8 Information functions 17 4.9%

9 Database functions 12 3.5%

10 Cube functions 7 2.0% 11 Logical functions 6 1.7%

12 Add-in and Automation functions 5 1.4%

Total 346 100.0%

It is difficult to say which Excel functions are required and which are not in analyzing household

survey data since it is more concerned with the experience of the user and types of output needed to

generate. The followings are the functions, directly concerned with analyzing a database or refining

the PASW Statistics output tables.

DAVERAGE Returns the average of selected database entries

DCOUNT Counts the cells that contain numbers in a database

DCOUNTA Counts nonblank cells in a database

DGET Extracts from a database a single record that matches the specified criteria

DMAX Returns the maximum value from selected database entries

DMIN Returns the minimum value from selected database entries

DSTDEV Estimates the standard deviation based on a sample of selected database entries

DSUM Adds the numbers in the field column of records in the database that match the criteria

DVAR Estimates variance based on a sample from selected database entries

AND Returns TRUE if all of its arguments are TRUE

FALSE Returns the logical value FALSE

IF Specifies a logical test to perform

NOT Reverses the logic of its argument

OR Returns TRUE if any argument is TRUE

TRUE Returns the logical value TRUE

ROUND Rounds a number to a specified number of digits

ROUNDDOWN Rounds a number down, toward zero

ROUNDUP Rounds a number up, away from zero

SQRT Returns a positive square root

SUBTOTAL Returns a subtotal in a list or database

SUM Adds its arguments

SUMIF Adds the cells specified by a given criteria

SUMIFS Adds the cells in a range that meet multiple criteria

SUMPRODUCT Returns the sum of the products of corresponding array components

AVERAGE Returns the average of its arguments

AVERAGEA Returns the average of its arguments, including numbers, text, and logical values

Page 173: Module B

AVERAGEIF Returns the average (arithmetic mean) of all the cells in a range that meet a given criteria

AVERAGEIFS Returns the average (arithmetic mean) of all cells that meet multiple criteria.

COUNT Counts how many numbers are in the list of arguments

COUNTA Counts how many values are in the list of arguments

COUNTBLANK Counts the number of blank cells within a range

COUNTIF Counts the number of nonblank cells within a range that meet the given criteria

FREQUENCY Returns a frequency distribution as a vertical array

GEOMEAN Returns the geometric mean

GROWTH Returns values along an exponential trend

HARMEAN Returns the harmonic mean

MAX Returns the maximum value in a list of arguments

MAXA Returns the maximum value in a list of arguments: numbers, text, and logical values

MEDIAN Returns the median of the given numbers

MIN Returns the minimum value in a list of arguments

MINA Returns the smallest value in a list of arguments: numbers, text, and logical values

MODE Returns the most common value in a data set

PERCENTILE Returns the k-th percentile of values in a range

QUARTILE Returns the quartile of a data set

RANK Returns the rank of a number in a list of numbers

STDEV Estimates standard deviation based on a sample

STDEVA Estimates standard deviation based on a sample, including numbers, text, and logical values

TREND Returns values along a linear trend

TRIMMEAN Returns the mean of the interior of a dataset

The detailed descriptions of these functions and examples can be seen in online help of Microsoft

Excel 2007, and thus, will not be elaborated in this module.

Page 174: Module B

2. FURTHER ANALYSES AND PRESENTING OUTPUTS IN MS EXCEL

2.1 Importing PASW Database into Microsoft Office Excel

To read PASW Statistics (*.sav) data files directly in applications that support Open Database

Connectivity (ODBC) or Java Database Connectivity (JDBC), the PASW Statistics data file driver

is required. PASW Statistics itself supports ODBC in the Database Wizard, providing the ability to

leverage the Structured Query Language (SQL) when reading SAV data files in PASW Statistics.

The PASW Statistics data file driver is packed together with other drives which may be required in

accessing different types of databases in a “Data Access Pack (DAP)” which can be downloaded

from the PASW Statistics Website. A version of DAP for Windows, “DAPWin32_5.3_SP2.exe”

(file size: 36,624 KB) is provided in the training CD.

After installing DAP, there will be “SPSS Inc OEM Connect and ConnectXE for ODBC 5.3”

program group in the “Start Menu Programs”. Click “ODBC Administrator”, and follow the steps

to get access to PASW Statistics data files (*.sav) from the applications with ODBC capabilities:

1. Click “File DSN” tab; and

2. Click “Add” button to add a new data source.

With extended data handling capacities, it is possible to analyze any dataset from household surveys for assisting EFA Monitoring with Microsoft Excel. However, it is much easier to use other popular data analysis software such as PASW Statistics, then export the outputs to Excel, and elaborate and present with MS Excel.

1

2

Page 175: Module B

The “Create New Data Source” user dialogue box will appear. There, all available drivers in

the computer will be listed, and

3. Select “SPSS Inc. 32-Bit Data Driver (*.sav)”; and

4. Click “Next”.

There, it will request a new Database System Name (DSN), and

5. Type-in an appropriate DSN name (“SPSS-Training” in this example); and

6. Click “Next”.

7. “Create New Data Source” dialogue will provide the summary information on the

current setting. If it is correct, click “Finish” to complete creation of a „file DSN‟.

At this point the program will request to identify the location and fill in correct folder name

with complete “path” of the PASW Statistics data files.

8. In this example type-in: “c:\....\My Documents\SPSS Training\Sample” where all

sample datasets are stored, and Click “OK”;

9. Click “OK” again to complete and exit from “ODBC Data Source Administrator”.

5

6

4

3

7

Step 8

9

Page 176: Module B

After creation of the new ODBC data source, the newly defined file DSN name, “SPSS-Training”,

will be listed in the Windows applications with ODBC capabilities. Any PASW data files (*.sav)

located in the specified folder can be accessed from other applications, and can retrieve full dataset

through “existing ODBC connections” or partially through “Microsoft Query”.

When clicking “Existing Connections” under “Data” menu in Office Excel 2007, “SPSS-Training”

will be displayed as one of the existing external data sources for Excel (see “A” in the following

exhibit). By selecting this connection, one can retrieve any dataset (whole dataset) from the list.

Similarly, when clicking “From Other Sources” and selecting “From Microsoft Query”, one can see

the “SPSS-Training” as a data source (see “B”), and by following the Wizard, users can retrieve

part of a dataset: only cases which satisfied set conditions and only the selected variables.

In short, follow the steps below to import a complete PASW Statistics dataset into Excel 2007:

1. Click “Data” tab;

2. Again, click “Existing Connections” button to get the “Existing Connections”

dialog box;

3. Select “SPSS-Training” form the list of available connections; and

4. Click “Open” and a complete list of PASW Statistics datasets in the specified folder

(set while creating the file DSN “SPSS-Training”) will be displayed as “Tables”.

5. Select the dataset (by clicking on the name) and click “OK” button;

6. In the import data window, select where to place the imported data, in the “Existing

worksheet” (active worksheet) or in a “New worksheet”. If the “Existing

worksheet” is selected, one can define the place to import data (default is $A$1).

7. Click “OK” to start importing process, which will take a few minutes.

A B

1

2 2

3

4

1

3 4

3(a)

Page 177: Module B

At the end of the importing process, the PASW dataset will be placed on the specified Excel

worksheet with the name like “Table_SPSS_Training” and treated as an Excel “Database Table”.

In this example, the file is saved with the name “Excel2.xlsx”. When opening the Excel file with

imported database, Office Excel 2007 will issue a “Security Warning” with the message “Data

connections have been disabled” together with an “Option” tab. If the imported data requires

updating from the source PASW dataset, or requires importing another dataset, the user must

enable the data connection. Otherwise, the user can choose to disable the data connection.

5

5(a)

7

6

Warning:

Importing data into Excel (as well as into other databases) cannot retrieve metadata (labels,

missing values, etc.), but only data values. Therefore, user must have the codebook of the

dataset (and the survey questionnaire) before doing any analysis. As usual, after successfully

importing PASW datasets, first, the Excel file with imported databases must be saved with an

appropriate name.

Page 178: Module B

In the Excel worksheet, the variable names are placed on the first row with enabling “Autofilter” to

all variables. The “Autofilter” feature can assist in checking the invalid entries and selecting cases

which fulfil the specified rules. If the “Autofilter” is not required, it can be turned off by clicking

on the filter tab, , and click it again to turn on Autofilter.

Example:

To select the cases for the children of aged 6-year, one can click the down arrow sign next to

“HV105” and clear the tick next to select all (to unselect all), tick the box next to 6 and click

“OK”. In the following exhibit, it could be seen in the “status bar” (located at the bottom left

corner) that there are altogether 53,413 records (or cases) in the database, and only 1,302 records

with aged 6 children are found and selected.

If another variable sex (HV104) is filtered to show only “1 (Male)” again, the following output will

be obtained with only 656 records (of aged 6 boys).

Selected value

Non-selected values

Page 179: Module B

Even entire worksheet is selected and copied, and then pasted on a new sheet while filtering, only

the filtered records (or unhidden rows) will be pasted in new worksheet. Then, unwanted variables

can be selected and deleted column by column to clean up the Excel database. The final result is

totally the same as imported through “Microsoft Queries”, which is more complicated for those

who are not acquaintance with manipulating databases (see the steps in the following exhibits).

Select the dataset and send entire set, or variable by variable to the right pane

Setting Condition 1 to import only the cases

of children aged 6

Setting Condition 2 to import only the cases

of “boys”

The database can be sorted while importing with the

selected variables

“Filter” signs!

Non-selected value

Selected value

Page 180: Module B

Setting Option set the location of the

imported database; and the query can be saved

for future use!

The Output (Result) There are 656 cases (+1 row for variable names) in the imported database for the

“aged 6 boys”

Page 181: Module B

2.2 Creating Frequency and Crosstab Tables

The Excel function “FREQUENCY” is useful to create entire frequency table from a range of cells

or from a variable in a database table. On the other hand, “COUNTIFS” can be used to get the

appropriate value for a cell of a frequency table or crosstab table.

Using FREQUENCY Function

“FREQUENCY” is a worksheet function under “Statistical functions” category. It counts how often

values occur within a range of values, and then returns a vertical array of numbers. For example,

use FREQUENCY to count the number of males and females among the household members.

Because FREQUENCY returns an array, it must be entered as an array formula.

The followings are the steps required in construction of a table presenting the sex distribution of

household members, both in absolute number and percentage distribution using FREQUENCY

function. The variable to be used is “HV104” with the codes “1=Male” and “2=Female” in the

imported database “SPSS_Training”.

1. Prepare the table structure, formulas and “bin” array as in the following exhibit;

2. Select cell “B3” and type in “=FREQUENCY(SPSS_Training[HV104],$G$3:$G$4)”;

3. Select the range “B3:B4”;

4. Press “F2” to get into formula editing mode, and press “<Ctrl><Shift>ENTER” to reenter

formula as an array formula; and

5. Set the display formats of the number cells and the table, as necessary.

Using COUNTIF or COUNTIFS Function

A frequency table can also be constructed by using COUNT functions. The above frequency table

can be constructed using:

1. Prepare the table structure, formulas and “codes” as in the previous example;

2. Select cell “B3”, and type in “=COUNTIF(SPSS_Training[HV104],G3)”;

3. Copy “B3” and paste at “B4”; and

4. Ally the display formats of the number cells and the table, as necessary.

Note: In the formula, “=COUNTIFS(SPSS_Training[HV104],G3)” can also be used in this example.

COUNTIF allows only one condition while COUNTIFS can be used with multiple conditions.

Page 182: Module B

Using COUNTIFS Function to construct a crosstab table

Although the “FREQUENCY” function cannot use to construct a crosstab table, the “COUNTIFS”

function can be used to get the number value of each and every cell of the table. The following

example elaborates how to construct a complicated crosstab table of educational attainment

(HV109) by sex of household members (HV104) for the population aged 15-24 (Age: HV105):

1. Prepare the table structure, formulas and “codes” for both variables;

2. Select cell “B5”, and type in: =COUNTIFS(SPSS_Training[HV109],$I5,SPSS_Training[HV104],B$14,SPSS_Training[HV105],">14") -

COUNTIFS(SPSS_Training[HV109],$I5,SPSS_Training[HV104],B$14,SPSS_Training[HV105],">24");

Here, the first COUNTIFS counts the population “aged 14 and above” by specific

education level by specific sex, and the second COUNTIFS counts the population

“aged 24 and above” with the same characteristics. Therefore, the difference

represents for the population “aged 15-24”.

3. Copy “B5” and paste to the range “B4:C11”; and

4. Complete the formulas, ally the display formats and etc., as necessary to obtain the

following output table.

As described above, frequency and crosstab tables can be constructed in Microsoft Office Excel.

However, construction of such tables are much more complicated if the sampling procedure

requires “weighting”. In this case, construct the tables with “weight on” in PASW Statistics and

export the outputs to Microsoft Office Excel for further elaboration and presentation.

Page 183: Module B

2.3 PivotTables (OLAP Cubes)

Unweighted frequency and crosstab tables with multi-layers, which are useful in analyzing

household survey data, can be constructed in Microsoft Office Excel with PivotTable technique. A

PivotTable is an interactive way to quickly summarize large amount of data, to conduct in-depth

analysis and to answer unanticipated questions about the data. It is especially designed for:

Querying large amounts of data in many user-friendly ways;

Subtotaling and aggregating numeric data; summarizing by categories and subcategories,

and creating custom calculations and formulas;

Expanding and collapsing levels of data to focus the results, and drilling down to details

from the summary data for areas of interest;

Moving rows to column or columns to rows to see different summaries of the source data;

Filtering, sorting, grouping, and conditionally formatting the most useful and interesting

subset of data to enable focus on the required information; and

Presenting concise, attractive, and annotated online or printed reports.

In a PivotTable, each column in the source data (or database) becomes a PivotTable field (a „field‟

in Excel is a „variable‟ in PASW Statistics) that summarizes multiple rows of information. A value

field provides the values to be summarized. By default, data (of the variables) in the “Values” area

summarize the underlying source data in the PivotTable using: the SUM function for the numeric

variables, and the COUNT function for the text (string) variables.

To create a PivotTable, first, define its source data, specify a location in the workbook or the

database table, and lay out the fields as following:

1. Select the sheet with imported database and click “Insert” tab in the main menu;

2. Click “PivotTable” button to get the “Create PivotTable” dialog box;

3. Since the active worksheet contains the imported “SPSS_Training” database table, it

will appear automatically in the “Table/Range” selection box. However, users can

change the data source to another table or to a specific range (e.g., A1:C2000);

4. Select where to place the PivotTable: “New Worksheet” or “Existing Worksheet”,

and if “Existing Worksheet” is selected, user should provide the first cell address;

In this example, just leave it as default “New Worksheet”; and

5. Click “OK” to create a new worksheet with “PivotTable creation tools”.

4

1 2

3

5

Page 184: Module B

Then, following new worksheet equipped with tools to assist creating a PivotTable will be created:

And, the following tools are available for creation, elaboration and editing the PivotTable.

6. From “PivotTable Field List” select variables (or fields) and drag and drop to:

(a) Values the variables to make actual summarization (count or sum or etc.)

(b) Row Labels the variables to be displayed on the rows (can be nested)

(c) Column Labels the variables to be displayed on the columns (can be nested)

(d) Report Filter the variables to be used for filtering/subsetting the database;

As soon as a variable is dragged and dropped into a box, the opening sign of PivotTable on the

worksheet will be replaced with an actual PivotTable with default settings.

Construction of a “PivotTable” will be demonstrated by creating a crosstab table of “Educational

attainment by Sex for Population Aged 15-24”.

To do this, first, define which variables (or fields) were to put into which box: “value, row, column

or filter”, to get the required table. In this example, educational attainment (HV109) is the key

variable to be explored and also to display the education levels in the rows.

Step 6

Newly

added sheet

Place mark for

PivotTable 1

Page 185: Module B

Therefore, drag HV109 from the PivotTable filed list and drop it into both:

(a) value (to count how many persons in each category), and

(b) row (to display education levels in rows).

And, the following PivotTable showing the “frequency of HV109” will be created:

The items displayed in rows can also be selected. For example, there are eight items: 0, 1, 2, 3, 4, 5,

8, and (blank), are displayed in cells A4 through A11. Since the code “8” represents “unknown”

and “(blank)” is simply “missing value”, these two items shall not be displayed in the frequency

table, or at least the item “(blank)”. To do this, just click on the dropdown next to “RowLabels” and

uncheck the box next to “(blank)” and click “OK”. However, this refinement will conduct only

when finalizing the PivotTable in this example.

Page 186: Module B

The next step is to place “Sex (HV104)” into column box to obtain the following crosstab table:

Here, “value labels” can be directly typed into a PivotTable, and the new labels will replace the

defaults. For example, the column labels “1” can be replaced with “Male” and “2” with “Female”.

These fine-tunings will be carried out only when finalizing the PivotTable.

The current PivotTable represents entire household population irrespective of age, but the

requirement is just for the “population aged 15-24”. To fulfill this requirement, the cases must be

filtered by “age”. Therefore, send the variable “age (HV105)” to the “filter” box. It should be noted

that, although the filtering variable is set, the table will be unchanged since no filtering is in place.

Therefore, click on the “dropdown” icon next to “(All)” in cell B2, then, tick “Select Multiple

Items” checkbox, and leave the ticks only for the ages between 15 and 24 inclusively.

Page 187: Module B

Above exhibit presents the PivotTable after tuning up captions (value labels) and column width.

PivotTables can be copied the whole or any part of it to be use for other purposes.

PivotTable is more useful if multiple tables with the same structure are required for different groups

(e.g. for different ages), or presenting the same table with selected rows and/or columns only. For

example, the same table for adults (aged 15+) can be created by clicking dropdown icon next to

“(Multiple Items)” in Cell B2, first, tick “(All)”, and clear off ticks next to “0”, “1”, “2”, …, “14”

(see A). Similarly, to create a table for all adults but with “up to complete primary” education only,

click the dropdown icon next to “Row Labels” and select only the first three categories (see B).

As seen in these examples, PivotTable method is user-friendly, powerful and efficient in analyzing

household survey data, especially for the surveys applying “self-weighting” sampling designs.

Place to employ changes

A

B

Page 188: Module B

2.4 Drawing Pivot Charts

PivotChart provides a graphical representation of the data in a PivotTable. The layout and data that

are displayed in a PivotChart can be changed just as in a PivotTable. A PivotChart always has an

associated PivotTable that uses a corresponding layout. Both of them have fields that correspond to

each other, that is, when changing the position of a field in the PivotTable, the corresponding field

in the other report also moves.

In addition to the series, categories, data markers, and axes of standard charts, PivotChart reports

have some specialized elements that correspond to the PivotTable as following:

Filter field: A field to filter data by specific items. In the example, the “age” field

displays data for both sexes. To display data for a single age or selected ages, click the

drop-down arrow next to (All) and then select a number or some numbers.

Values field: A field from the underlying source data that provides values to compare or

measure. Depending on the source data of the report, the summary function can be

changed to Average, Count, Product, or another calculation.

Series field: A field that is assigned to a series orientation in a PivotChart. The items in

the field provide the individual data series. In a chart, series are represented in the legend.

Item: Items represent the unique entries in a column or row field, and appear in the drop-

down lists for report filter fields, category fields, and series fields. Items in a category

field appear as the labels on the category axis of the chart. Items in a series field are listed

in the legend and provide the names of the individual data series.

Category field: A field from the source data assigned to a category orientation in a

PivotChart report. It provides the individual categories for which data points are charted.

In a chart, categories usually appear on the x-axis, or horizontal axis, of the chart.

Customizing the chart: The chart type and other options (such as, the titles, the legend

placement, the data labels, the chart location, and so on) can be changed.

A PivotChart can be created automatically when creating a PivotTable or from an existing

PivotTable. To create a PivotChart from an existing PivotTable, follow the steps:

1. Select any place (cell) on the existing PivotTable, two new menu items “Options”

and “Design” will be added (under “PivotTable Tools” group) in the main menu;

2. Click “PivotChart” under “Options” tab to get the “Insert Chart” dialog box;

2

1

Page 189: Module B

3. Choose “Chart Type” from the “Insert Chart” dialog box; and

4. Click “OK” to create a basic PivotChart together with a “PivotChart Filter Pane”.

PivotChart created automatically is a “draft”. Particularly, there is no chart title. Therefore, it must

be edited using the following tools, which are available when clicking on an active PivotChart:

4

3

Page 190: Module B

For example, to add “Education Level by Sex (Aged 15+)” as the chart title above the drawing

(chart), click on the “Chart Title” under the “Layout” command, and select the third option, „Above

Chart‟. After making few other make-ups such as moving the legend, changing the chart design,

putting in border lines for the plot area, etc., the following PivotChart is successfully created and

ready to use.

Another useful adjustment in both PivotTable and PivotChart is to display the “values” not in the

absolute numbers, but in percentages. To review the percentage distribution of education level by

sex for adults:

1. Click on the “dropdown” of the variable in the “value” area;

2. Select “Value Field Settings…”;

3. Select “Show values as” tab in the “Value Field Settings” dialog box;

4. Select “% of column” in “Show values as” dropdown list; and

5. Click “OK” .

1 2

3

4

5

Page 191: Module B

Then, the following table and chart will be obtained after adjusting display formats, especially

number of decimal places in percentages.

Page 192: Module B

2.5 Elaborating PASW Outputs for Better Presentation

Although PivotTable and PivotChart are user-friendly and efficient way to present household

survey data in both tabular and graphical presentations, the PASW Statistics provides broader

methods and options, and capable of using “weights” for complex sampling techniques. On the

other hand, Excel is more familiar with users, and easier to make further analyses through output

tables from different analyses. Therefore, the best blend is to analyse the dataset in PASW Statistics

and to finalize the outputs in Office Excel.

In this section, use of “weights” in the calculation of school-age population and number of children

currently attending school in PASW Statistics, and calculation and presentation of age-specific

enrolment rates in Microsoft Office Excel 2007, are going to demonstrate step-by-step.

Basic of Weighting

In a town with 2 Wards, there are 100 children aged 6-10 in Ward-1 and 50 in Ward-2. Of those

children, a survey on “schooling status” was conducted by selecting 25 children from Ward-1 and

20 children from Ward-2. It was found out that 5 children (out of 25) from Ward-1 and 6 children

(out of 20) from Ward-2 were not currently in school. Therefore, percentage of out-of-school

children (say, POS) can be estimated as:

POS (Ward-1) = 5 / 25 x 100 = 20.0%

POS (Ward-2) = 6 / 20 x 100 = 30.0%, and

Percentage of out-of-school children in the town can be estimated as:

POS (Ward 1+2) = 11 / 45 x 100 = 24.4%. …………. (1)

POS (Ward 1+2) = (20.0%+30.0%) / 2 = 25.0%. …………. (2)

Although the percentages of out-of-school children by Ward can represent respective Ward, above

percentages calculated for the entire town do not represent correctly. The main reason is the sample

sizes are not “self-weighting” or unbalanced between two Wards: the sampling fraction for Ward-1

is 25 / 100 or 25.0% while that for Ward-2 is 20 / 50 or 40%. In other ward, a child in the sample

from Ward-1 represents 4 children while a sample child from Ward-2 represents just 2.5 children.

To have a correct estimate for the town, it should be calculated as following:

Since the POS (Ward-1) is 20.0%, it is expected to have 20 out-of-school children (20.0% x 100)

in Ward-1 and it is expected to have another 15 children (30.0% x 50) in Ward-2.

Therefore, there could be 35 out-of-school children out of 150 children aged 6-10, and the POS

for the Town is (35 / 150 x 100 = 23.3%).

On the other hand, the appropriate number of out-of-school children in Ward-1 and Ward-2 can be

estimated as 5 x 4.0 = 20 (since one in the sample represents 4 children in Ward-1) and 6 x 2.5 = 15.

These numbers 4.0 and 2.5 are known as “sample weight”, and normally provided in the datasets.

In PASW Statistics, it is easy to apply weights if it is provided in the dataset:

1. Click “Data” on the main menu;

2. Click “Weight Cases…” and “Weight Cases” dialog box will be appeared;

3. In “Weight Cases” dialog box, set “Weight cases by”;

4. Select the variable representing the “weight” (it is HV005 – Sample weight in the

DHS dataset); and

5. Click “OK” to complete weighting process.

When no longer weighting is necessary, select “Do not weight cases” in above Step 3 and click

OK” to step weighting.

Page 193: Module B

The following tables represent population aged 6-10 by sex with and without weighting.

The differences due to weighting can be observed in the percentage distribution of population by

age and sex. Similarly, the following tables present weighted and unweighted number of children

currently attending school (HV110 – Member still in school) by age and sex.

1

2

3

4

5

Page 194: Module B

From these two sets of tables, one can calculate proportion of children currently attending school

by age and sex or percentage of out-of-school children by age and sex, in Excel.

Since, it is easier to export all outputs from PASW Statistics Viewer, first clear unnecessary

outputs, such as logs, notes, and case processing summary; then, export to Excel:

1. Click “File” on the main menu;

2. Click “Export…” and “Export Output” dialog box will be appeared;

3. In “Export Output” dialog box, set:

a) “All” in “Objects to Export”;

b) “Excel (*.xls)” in “Document Type”;

c) Provide “File Name” with folder path; and

d) Click “OK” to begin exporting outputs to Excel.

1

2

Step 3

(a)

(b)

(c)

(d)

Page 195: Module B

At the end of this process, an Excel file, “POS.xls” will be placed in the specified folder with four

cross-tabulation tables exported from PASW. The files can be seen in the following exhibit.

Adding three more columns in the last two tables, making simple calculation of dividing children in

school by total number of children of respective age and sex, the required percentage of children in

school can be obtained easily. That is not such simple in PASW. Percentages of children in school

by age and sex are presented in the following tables with rephrasing of titles and captions.

Page 196: Module B

To demonstrate the visualization of data through charts, the percentage of children in school by age

(or Age-Specific Enrolment Rate) will be presented in “3-D Clustered Column” and “Line” charts

which are appropriate with the data. To create a 3-D Clustered Column chart, follow the steps:

1. On the table, (a) select Cell „A36‟, unmerge and type “Age” into Cell „B37‟;

similarly, (b) select Cell „A43‟, unmerge and type “6-10” to Cell „B43‟;

2. Select the “Data Source” to create chart: age (B37:B43 - X-axis), percentage of

children in school for male (F37:F43 - Series 1) and for female (G37:G43 - Series 2);

3. Click “Insert” on the main menu;

4. Click “Column” to get the list of available Column Charts;

When user places mouse on the “Column”, concise but useful information: “Column

charts are used to compare values across categories”, will be popped up. Similar

information will be popped up when pointing on other chart types also.

5. Click “3-D Clustered Column” icon, the first one under “3-D Column” group;

Then, following draft chart based on the provided data will be displayed instantly.

6. The next step is to finalize the chart in Excel:

a. Click on the chart, and click again on “Layout” under “Chart Tools”;

b. Click “Chart Title”, select “Above Chart” to insert a space for chart title, and

type “Age-Specific Enrolment Rate by Sex, Aged 6-10” into that space; and

c. Click “Axis Titles”, set “Primary Horizontal Axis Title” to appear “Title Below

Axis”, and type “Age” into the space appears.

At this stage, the chart is usable. However, more polishing could be carried out such as:

d. To change the location of legend (just select, drag and drop at new location);

e. To change the gap width between items (select one series, right-click to get pop-

up menu, click “Format data series”, and set “Gap width/depth”);

f. To change the series colour (select one series, right-click to get pop-up menu,

click “Format data series”, and set colour in “Fill”);

g. To format any … (select that item, right-click to get pop-up menu and set); and

h. To move or resize the chart, chart title, legend, etc...

1(a)

4

5

1(b)

2

3

Page 197: Module B

The following chart will be obtained after putting few final touches:

Same procedure should be carried out to create a line graph except selecting the data range to cover

ages 6 to 10, but not total (aged 6-10). The line charts are normally used to display the trends, over

time or age. Therefore, putting total (aged 6-10) in the series will misinform the viewers.

On the other hand, it is nice to include for both sexes in the line graph and the differences could be

observed clearly if the rates begins at 60% instead of 0%. Such few adjustments in the above line

chart will yield the following final one.

Page 198: Module B

3. TIPS AND EXERCISES

3.1 Tips: Do and Don’t

i) Do… export to Excel from PASW Statistics with data in “labels” as much as

possible, rather than exporting only numeric data values;

Don’t… import PASW Statistics datasets to Excel without having codebook (the

coding scheme used in creating dataset) or questionnaire with codes.

ii) Do… practice importing PASW Statistics dataset to Office Excel 2007 and

check the correctness of database table in Excel by constructing frequency

tables;

Don’t… edit imported database before saving and leave computer with unsaved

files.

iii) Do… autofilter on one or more fields (variables) in extracting data with certain

criteria or to review the invalid cases (data validation);

Don’t… forget to release autofilter from the fields which are not using; otherwise

wrongly filtered the cases.

iv) Do… practice using, and use PivotTable and PivotChart as and where

appropriate;

Don’t… try to edit too much PivotTable and PivotChart or undo several times; it

may hamper the computer performance or totally hanged.

v) Do… use PivotTable technique to create frequency and crosstab tables, and

check the outputs thoroughly;

Don’t… trust computer outputs. Don‟t use those tables and charts on presentation

or dissemination before completing thorough checking.

Page 199: Module B

3.2 Self-evaluation

Are you able to work with Microsoft Excel 2007 to: a. import SPSS dataset

Very well / Somewhat well / Not so much / Almost None b. select some rows (cases) using auto-filter

Very well / Somewhat well / Not so much / Almost None c. create frequency table

Very well / Somewhat well / Not so much / Almost None d. construct a two-way (crosstab) table

Very well / Somewhat well / Not so much / Almost None e. develop a Pivot Table

Very well / Somewhat well / Not so much / Almost None f. create a Pivot Chart

Very well / Somewhat well / Not so much / Almost None

Are you confident that you can export selected output tables from PASW Statistics to Microsoft Office Excel 2007? Confident / Somewhat confident / Not so much / Not at all

Are you confident that you can elaborate PASW Statistics output tables in Microsoft Office Excel 2007? Confident / Somewhat confident / Not so much / Not at all

3.3 Hands-on Exercises

1) Import the attached “BDPR50FL(Validate).sav” into Excel.

2) From the dataset obtained from Exercise 1 above, validate the database table in Excel

for various errors, recommend with reasons on whether the imported database is valid

to use.

3) Import the attached “BDPR50FL1.sav” into Excel and extract cases with “out-of-school

children aged 6-10”.

4) Create PivotTable and PivotCharts to present “percentage of out-of-school children

aged 6-15” by Division.