DATA HANDLING USING SPSS 19 - Research Data...

Data Handling Using SPSS 19 Research Data MANTRA

DATA HANDLING

USING SPSS 19

Authored by Laine Ruus, EDINA and Data Library, University of Edinburgh

Edited by Robin Rice and Pauline Ward

August 2014

Based on material first written by Joan Corbett, National Centre for Social Research

as part of the Research Data MANTRA (Management Training) course

http://datalib.edina.ac.uk/mantra

This work is licensed under the Creative Commons Attribution 4.0 License.

To view a copy of this license, visit http:/ /creativecommons.org/licenses/by/4.0/ .

http://creativecommons.org/licenses/by/4.0/




Course Outline

Section Paragraph

Numbers

Introduction 1-3

Data log file and the SPSS environment

4 -7

Running SPSS 8-12

Creating an SPSS system file

13 – 17

Output Viewer 18 – 21

Syntax Window 22 - 23

Displaying and saving dataset information 24 - 26

Descriptive statistics and missing values

27 - 42

Compute and recode

43 - 54

Add variables and add cases

55 - 62

Getting your data out of SPSS 63 - 69

Appendix A: Online resources

Appendix B: Selected inter-system limitations on filenames, variable and value names etc

Appendix C: Common file and variable transformations and their corresponding SPSS commands

Introduction

The objective of this exercise is to introduce you to some techniques for using SPSS as well as some

other tools to support your data management (RDM) activities during the course of your research. It

is not about doing statistical analysis using SPSS, but rather how to organize and document your data

management activities in the context of using SPSS for your analyses.

1. When embarking on the exploration of a new research question, after the literature review, and

the formulation of preliminary hypotheses, the next task is generally to begin to identify (a) what

variables you need in order to test your hypothesis, (b) what datafiles (if any) are available that

contain those variables, and (c) what software to use that has the statistical routines and

capabilities you require.

2. The questions you need to be able to answer, vis-à-vis any software you decide to use, are (a)

does the software support the statistical analyses that are most appropriate for my research

question and data? (b) how robust are the measures that it will produce? (c) will it support the

data exploration and data transformations I will need to perform? (d) how will I get my data into


the software (ie what file formats can it read)?, and (e) most importantly, how can I get my data

out of that software (along with any transformations, computations etc) so that I can read it into

other software for other analyses, or store it in a software-neutral format for the longer term?

This practical assumes you have decided to use SPSS for your analyses, at least in part.

Advantages to SPSS include: flexible input and output capabilities, good metadata management

capabilities, and fairly intuitive command names, as well as good documentation.

3. The data being used in this exercise are a subset of variables and cases from:

Norris, Pippa. (2010).May 6th 2010 British General Election Constituency Results Release 5.0

Subset rev. by J. Corbett. [Dataset]. Harvard University. John F. Kennedy School of Government.

Available: http://www.pippanorris.com/

You’ll notice that the citation specifies that this is ‘Release 5.0’. An important part of data

management is keeping track of dataset versions and documenting the changes that have

happened between versions. The web page describing this data set has this information.

Data log file

4. As part of managing your data it is essential to create your own documentation as you work through your analyses. It is good practice to set up a data log right at the start of a project. Use this to keep track of things like the versions and locations of datafiles and documentation, notes about variables and values, and variable and data transformations, as well as the files containing output from your analyses.

5. What software you choose in which to manage your data log is a question of personal choice. One example of a data log format, using Microsoft Word can be found in ‘ScotMP_data_log.doc’. Some researchers prefer to use a word processor (eg MS Word), others to use a format-neutral text editor, such as Notepad or EditPad Lite, and yet others (including me) prefer the table handling and sorting capability of Microsoft Excel (see the file ‘scotmp_dm.xls). Open a new Excel spreadsheet, and eg on sheet 1, enter, in successive columns:

- Current date (YYYYMMDD) - The input file location and format - The output file location, name and format - A comment as to what was done between input and output. - Give the sheet a name, eg ‘data log’ - Before you do anything else, save this Excel file (assign a location and name that you will

remember), but leave it open.

http://www.pippanorris.com/


Hint: in order to get the correct path and filename of any file in a Windows environment, locate the file in Windows Explorer, and:

Alternative 1: Click in the address bar showing the path at the top of the Windows Explorer window. The display will toggle between read-friendly display, and the full path display. Copy and paste the full path display, or… Alternative 2: Click on the file to select it. Then right-click the mouse, and select ‘Properties’. The exact path and filename will show in the ‘Location’ field of the properties window, and the filename will show in the first dialogue box. Both path and filename can be copied and pasted into your data log.

6. Note: Especially if you are in the habit of working in different computer environments, it is not recommended that you use blanks in file or folder names. Different operating systems treat embedded blanks differently. Instead, use hyphens, underscores, or CamelCase to separate words to make names more readable. Ie, not ‘variable list.xls’ but ‘variable_list.xls’ or ‘VariableList.xls’.

7. It is good practice to assume that you may not always be using SPSS, or the same version of SPSS

for your analyses. You may need to migrate data from/to different computing environments

(Windows, Mac, Unix) and/or different statistical software (SAS, Stata, R, etc). Therefore you

also need to be aware of limitations in file names, variable names, and other metadata such as

variable labels, value labels, and missing values codes when moving data among different

operating systems and software packages, some of which are listed in Appendix B.

Running SPSS

8. When you run SPSS in Windows, two windows are opened automatically:

– a Data editor window - empty until you open a data file or begin to enter variable values, after

which it will have two views, a Variable View and a Data View,

– an Output window, to which your commands and output will be written.

Additional windows which can be opened from File > New or File > Open are:

o a Syntax window, in which you can ‘paste’ syntax from the drop-down menu choices, enter

syntax directly, edit and run syntax,

o a Script window, in which you can enter, and edit, Python scripts.

Three additional windows, in addition to dialogue windows etc., may or may not open

depending on the procedures you are running: (a) a Pivot table editor window, (b) a Chart editor

window, and (c) a Text output editor window.

9. Open SPSS through your programs menu: Start > IBM SPSS Statistics 19. If a dialog box appears

asking you whether you wish to open an existing data source, click ‘Cancel’. Before starting to

use the data, use Edit > Options to make some recommended changes to SPSS environment

defaults.


10. Then select Edit > Options. The Options box has several tabs. Select the General tab and make

sure that, under ‘Variable Lists’, ‘Display names’ and ‘File’ are selected. This will make sure that

the variables in the dataset are displayed by variable name rather than by variable label and that

variables are listed in the same order as they occur in the dataset – knowing this order is

essential when referring to ranges of variables.

11. It is also useful to see the variable names and values in any output. By default SPSS only shows

the labels, not the names or codes. Click on the ‘Output Labels’ tab, and under both ‘Outline

Labeling’ and ‘Pivot Table Labeling’, select the options to show:

o Variables in item labels shown as: ‘Names and Labels’,

o Variable values in item labels shown as: ‘Values and Labels’.


12. Finally, select the ‘Viewer’ tab and check that the ‘Display commands in the log’ checkbox

(bottom left of the screen) is checked. This will ensure that your output file contains not only the

results of any commands (syntax), but also the syntax itself. This is useful for checking for errors,

as well as as a reminder of the details of recodes and other variable transformations, etc. Click

‘OK’.

Creating an SPSS system file

13. In common with most statistical packages, SPSS needs a variety of information in order to read a

dataset: (a) the data, and (b) instructions as to how to read the data. In its simplest form, SPSS

reads a data file (eg ‘scotmp_raw.txt’), a syntax file (eg ‘scotmp_raw.sps’), and using the input

instructions in the syntax (.sps) file, converts that data and metadata into its preferred format,

the system file, which only exists during your current SPSS session unless you save it.

Note: SPSS can read (and write) a variety of formats. See Appendix D for a detailed list of

software dependent formats and the commands to read and write them. SPSS can also read

more complex file formats, such as multiple records per case, mixed files, and hierarchical files.

14. First let’s look at one type of external, raw data file. Run Notepad (Start > All programs >

Accessories > Notepad) and open the file ‘scotmp_raw.txt’ (from the zip file accompanying this

document – you will need to extract the files to a location to which you can browse from SPSS,

see paragraph 15).

Syntax: Scotmp_raw.sps Data: Scotmp_raw.txt

SPSS

SPSS system file


This is the simplest type of ‘fixed format’ file, with one unit of observation (case) in each row,

and the same variable in the same column or adjacent columns of each case; in this instance, the

case is a seat or constituency in the 2010 election, and the variables describe characteristics of

the seat. Using the cursor to move around in the file, determine:

How many cases (rows) are there in this dataset?

What is the highest column number that contains data?

15. In SPSS, use File > Open > Syntax, and browse to locate and open the syntax file

‘scotmp_raw.sps’. Notice that it contains 4 main sections:

o A data statement, which instructs SPSS the type of file to read, where to locate it, as

well as a list of the variables to read, variable names to assign to them, column locations

of each variable, and whether the variable is numeric or alphabetic (string),

o A variable labels section, in which descriptive labels are assigned to each variable,

o A value labels section, in which descriptive labels are assigned to values of the variables

that require description,

o A missing values section, which assigns certain variable values as missing, which affects

how they are used in statistical analyses, data transformations, and case selection.

16. Next, use Windows Explorer to locate the file ‘scotmp_raw.txt’, determine what the path to the

filename is on your computer (see text box in item 5), and substitute that information for the

file=’[path and filename]’ in the syntax file.

17. Click and drag to select the syntax file contents, down to and including the ‘execute.’ at the end

of the file, and click on the large green arrowhead (the ‘Run’ icon) in the SPSS tool bar to run it.

Output Viewer

18. In SPSS 19, the Output Viewer is labelled as the ‘Output1 [Document1] IBM SPSS Statistics

Viewer’). It is the window in which output from your procedures is displayed, as well as the

syntax that generated it (as a result of the options chosen in item 12 above). This output can be

saved. By default the output file is saved as an SPSS-dependent format with default filename

‘Output[n]’, and requires SPSS to be read; you should therefore use File > Export to save it in MS

Word, .txt, .html or .pdf format, with a meaningful filename. For data management this can be

an important tool for keeping track of what you have done to your data and to keep results to

which you want to refer back.


19. Checking: (1) check the Output viewer window for Error messages, as well as the bottom of the

syntax window (2) click on the Data Editor window, and check both the variable view, and the

data view, for anything that looks not quite right. If there are errors, try to figure out what they

are: normally, fix the first error first, and then rerun the job – usually, errors have a cascading

effect, and fixing the first can eliminate later errors. No visible problems? – you have a successful

file ingest, congratulations!

20. Save the SPSS system file. Open the Data editor window, select File > Save as and save the file

with format SPSS Statistics (*.sav). Now record this file in your Data log file.

21. The Output window now contains a table of what variables were read, in what column locations,

as well as names, labels and values assigned to them, and the formats automatically assigned by

SPSS. Save this output file, and record it in your data log file.

Syntax window

22. You can carry out all your data analysis and manipulation (including creating new variables) in

SPSS using drop-down menus. However, for data management purposes, the risk of doing

everything that way is that there may be no record of what you have done. You may need to

come back to your data and reproduce an analysis at a later data. Good data management will

help you and colleagues to trace back if you need to check any results.

23. Saving output files with commands, as described above, can be one way to do this. With SPSS

you can also analyse and manipulate your data using SPSS command language, rather than the

drop-down menus, which is edited and saved in a ‘syntax file’. For some procedures, this is

actually easier and more customiseable than using the drop-down menus.

In the exercises that follow you will mainly be using syntax to explore the dataset and to create

new variables.

Three very important rules to remember about SPSS syntax:

- All commands must end with a full stop.

- Commands are not case sensitive. This document gives commands in UPPER CASE to distinguish

the required portions of the command from the variable portions (eg variable names).

- You can add comments before and after commands, but not in the middle of commands.

Displaying and saving dataset information

24. You can list the variables in a dataset, in their natural order, by clicking on ‘Variable View’ at the

bottom left hand corner of the SPSS data editor window and scrolling up and down the list. It is

also adviseable to produce a variable list in your output file that can be copied into Excel.

Select File > Display Data File Information > Working File.


25. In the Output Viewer log you will see that this procedure has produced two tables, one with the

list of variables and the other with the value labels. You will also see the command DISPLAY

DICTIONARY in the output window. You could have produced the same tables by typing that

command into a Syntax file and running it.


26. Double click on the ‘Variable Information’ table in the Output window to open the table in a

separate window, from which it can be copied (Edit>Select>Table, then Ctrl+C or Edit > Copy)

and pasted (Ctrl+V) onto sheet 2 of the Data log file. Rename sheet 2 with the name of the

source file and what it contains (eg scotmp variable list). You can then do the same with the

table of value labels, copying it onto a third worksheet in the Data log file. This log file can

function as a handy tool for documenting variable and value transformations as well as file

handling information.

Variable labels Value labels

Descriptive statistics

27. You can use SPSS to look at the number of cases in each category of the variables (using the

frequencies command), check for undocumented codes, missing values, etc. It is important to

run these basic procedures to familiarise yourself with any new dataset and to check that there

are no problems.

28. Nominal and ordinal (aka categorical) variables: Frequencies can be run through the drop-down

menus by clicking on Analyse > Descriptive statistics > Frequencies, selecting the variables and

then clicking OK. Do this now (for this example we have chosen bithplac, constbir and win10 –

all 3 are nominal variables, according to the Measure column in the Data Editor variable view)

and then look at the Output window.


29. In the output, in addition to the frequencies there are now two additional pieces of information -

the SPSS syntax that produced the output, and the name of the dataset. You can also see from

the first table in the output that all 3 variables have data for all 59 constituencies (all 3 have 0s in

the ‘Missing’ row). We will look at the concept of ‘missing values’ later.

30. Continuous variables: To generate descriptive statistics for those variables labelled as ‘scale’ in

the Data Editor variable view, the type of information provided by frequencies is usually not very

informative. We need a different command. Keep the Data Editor variable view window open in

the background (a handy reference to the variable names labelled ‘scale’), and select Analyze >

Descriptive Statistics > Descriptives.

31. Select the scale variables you want to look at in the left window, move them to the right

window, and click on Options.


32. Make sure that Mean, Std deviation, Range, Minimum and Maximum are selected, click

‘Continue’, and ‘OK’ on the previous dialogue screen.

33. If, in the previous example, you click on the ‘Paste’ button, instead of ‘OK’, SPSS will write the

current syntax, with all selected options, into your current Syntax file, instead of running it. Here

you need to edit the syntax if you want to make changes, highlight it, then click on the Run icon.

34. Your Output window should now list the scale variables selected, showing their count (‘N’,

minimum, maximum, range, mean and standard deviation (spread around the mean), as well as

the SPSS commands that generated the output.

Had you neglected to copy the syntax into the Syntax window using the paste button earlier, you

could now mouse-over the SPSS commands in the output box (they will be outlined by a yellow

box), and copy (click R-mouse button, select Copy) and paste (select Syntax window, click R-

mouse button, select Paste) them from the Output viewer into the Syntax window.

35. You can also edit the commands in the Syntax window to show the first and last variable names

in the data file to produce one list of descriptive statistics for all the variables in the file:

DESCRIPTIVES VARIABLES=seat to tuscsex1 /STATISTICS=MEAN STDDEV RANGE MIN MAX.

The resulting output shows that SPSS is perfectly capable of weeding out the string variables for

which it cannot compute the requested measures:


36. You can build up a set of commands in your syntax file quite quickly, which is useful for initial

exploration of the data. You should also add your own notes to the syntax file – anything that

you type with an asterisk (*) in front of it will be treated as a comment by SPSS (don’t forget the

full stop at the end of every command or comment). It is good practice to use comments to give

each set of commands a header explaining what the syntax is doing, and if you are working as

part of a team, who wrote the syntax and the date it was written. If you highlight and run

comments, with the additional syntax to which it refers, the comments will also be echoed in

your Output Window.

37. The output and syntax files should be saved for future reference and you can use your data log

to record that they have been created. Output files in native SPSS format have an extension .spv

(.spo in versions prior to SPSS18); syntax files are flat ascii text file, with the extension .sps. As

your Data log file grows, you may find it easier to add new information at the top of the table (ie

reverse chronological order) rather than the bottom.

38. You may find it useful to set up separate sub-folders for your syntax and output files. During

your research project you will inevitably build up a number of files. Alternatively, collect all

syntax files, output files, and revised data files (where applicable) in one subdirectory, to

distinguish them from other analyses of other data files. If you make a habit of adding the date

in YYYYMMDD format at the beginning of each filename, the file list will display in chronological

order by default.

39. During the course of your research you will often have to create your own variables, ie derived

variables. Using the drop-down menus for this purpose is not recommended, because an

essential part of good data management is keeping a detailed record of how new variables have

been created and syntax files or output files with embedded syntax are the best way of doing

this.


However, if in your own work you find that you still prefer to use menus, you should make sure

to always either paste what you have done into a syntax file or save the output file with the

commands in it.

Defining missing values

40. The variable ‘yob’ contains each winning candidate’s year of birth. If you have not already run

Frequencies on this variable, do so now. You will note that the earliest year of birth in the data

set is ‘1941’, and the latest is ‘9999’, with a frequency of 4 cases. We assume that in these 4

constituencies the year of birth of the winning candidate was unknown.

41. A more useful variable than year of birth, for analysis and interpretation of the data, is age at the

time of the 2010 election. This can easily be computed using SPSS’s compute command.

However, when creating the new variable we do not want to include the cases with ‘9999’ in the

calculations. This is an example of where the ‘missing values’ function in SPSS is useful.

42. By setting ‘9999’ to be a missing value the cases with that value will be ignored in any

subsequent analyses. Use the commands below to define ‘9999’ as a missing value and to give it

a value label that indicates why it is not a valid value. Type the following commands into your

syntax editor window, and run them.

MISSING VALUES yob (9999).

ADD VALUE LABELS yob 9999 “year not known”.

FREQUENCIES yob.

Your output file should now show that there are 4 cases with label ‘year not known’, and flagged in

SPSS system file as missing.

Creating derived variables (compute)

43. Now compute a new variable consisting of candidate’s age at the time of the election in 2010:

* calculating MP's age in 2010.

COMPUTE mpage=2010-yob.

EXECUTE.

VARIABLE LABELS mpage "MP age".

FREQUENCIES mpage.

The new frequencies table of mpage shows that there are 4 cases flagged as Missing by the

system (aka ‘System missing’). These should be the same 4 cases for which we defined ‘9999’ as

a missing value in ‘yob’. But with any compute operation, you should always check your results.


44. One way to check is using the SELECT IF commands. Be very careful when using SELECT IF in a

syntax file; if you do not use the command ‘TEMP’ or ‘TEMPORARY’ before the SELECT IF, the

cases that were not selected will be dropped from the active dataset.

* checking value of YOB assigned missing against MPAGE system missing cases.

TEMP.

SELECT IF sysmis (mpage).

FREQUENCIES yob.

The output confirms that indeed the 4 system missing values are the original cases that we

flagged as missing in ‘yob’.

45. It is good practice to give any system missing cases a value. It is sometimes useful to use a

negative number for missing values so they don’t get confused with genuine values. Many large

Government datasets, for example the Scottish Health Survey, use this convention.

* adding a missing value to mpage.

IF sysmis (mpage) mpage=-9.

ADD VALUE LABELS mpage -9 'year of birth not known'.

MISSING VALUES mpage (-9).

EXECUTE.

FREQUENCIES mpage.

46. When creating a new variable it can be useful to have something to check the values against.

Since there are only 59 cases in the dataset it is possible to list the value of the new variable

along with the year of birth and the candidate’s name. You can use this to check a few by doing

the calculation manually. You could also check the computed values against external sources.

* Listing the variable mpage.

LIST winner10 win10 mpage yob.

Creating derived variables (recode)

47. It may be more useful to have a variable showing the age group of the MP rather than the

individual years of age. You can use the recode command to group the age variable that you


have just created. ALWAYS recode into a new variable, otherwise you will write over the

existing one and lose the original values of it.

48. Missing values are treated differently in RECODE. In the COMPUTE command, the value that we

had set to MISSING was ignored in the calculations. In RECODE we define what we want to

happen to it. We want the new variable to also have a value of -9 and for that to be a missing

value in the new variable.

*creating age group variable mpagegrp.

RECODE mpage (20 thru 29=1)(30 thru 39=2)(40 thru 49=3)(50 thru 59=4)(60 thru 69=5) (-9=-9) INTO

mpagegrp.

EXECUTE.

VARIABLE LABELS mpagegrp "Age group of MP".

VALUE LABELS mpagegrp 1"20 to 29"

2"30 to 39"

3"40 to 49"

4"50 to 59"

5"60 to 69"

-9 "age not known".

MISSING VALUES mpagegrp (-9).

FREQUENCIES mpagegrp.

49. To check the recode, one option, with this few values for the scale variable, is to run a crosstab

of the two variables. Using the drop-down menus, Analyze > Descriptive Statistics > Crosstabs

the output looks like this. In the Case Processing Summary you can see that 4 cases are missing,

they are not included in the output crosstab.


50. Using syntax, however, you can specify that you want the missing values included in the

crosstabs. You can set up a crosstabulation syntax using the drop-down menus, click PASTE to

put the syntax into your Syntax file, and add the line ‘/MISSING=INCLUDE’, so that the syntax

looks like the following:

* checking that recode has worked.

DATASET ACTIVATE DataSet1.

CROSSTABS

/TABLES=mpage BY mpagegrp

/FORMAT=AVALUE TABLES

/MISSING=INCLUDE

/CELLS=COUNT

/COUNT ROUND CELL.

51. The output now looks like this and includes the cases with missing values (‘-9’).

52. Now that the data file has been changed, it should be saved with a new file name and recorded

in the Data log.

* Save file as a new version.

SAVE OUTFILE="[path]\[date]scotmp_v2.sav" / MAP.

53. The above example shows one way of naming files to keep track of successive versions. Another

method is to begin each filename with the YYYYMMDD of the date on which it was created, eg:

20140519scotmp.sav

20140521scotmp.sav

Filenames that begin with a standardized date format (as above) will be sorted in chronological

order when listed.

54. It is important to keep a list of derived variables and the syntax files that created them in your

Data log, eg in a new sheet labelled ‘Derived variables’. As you work through your analyses you

can also add notes to remind you if you have to correct any of the derived variables.


ADDING DATA TO A DATA FILE – adding variables

55. If you have two (or more) datasets that have an identifying variable in common between them it is possible to use MATCH FILES to add the variables from one data file to the other. MATCH FILES can match up to 50 files in one operation, as long as:

All files have the same unique identifying variable (key variable) in common (numeric or string)

All files are sorted in ascending order of the key variable

All files have no duplicate cases, or the same number of duplicate cases

All variables to be merged have unique variable names. Variables with duplicate variable names will either be dropped from the merged file, or must be renamed.

As a first step, open the two additional files election.sav and census.sav, generate lists of the variables in them, and copy the variable lists to your Data log file (see 22-24 above). These lists will allow you to easily see what variables the 3 data files have in common. In this case all 3 files have the variable ‘seat’ in common. Also, all 3 files (scotmp_v2.sav, election.sav and census.sav) have the seat as the unit of observation and in all three datasets each row is for one seat, so this is a simple one-on-one merge.

56. In SPSS it is easy to determine whether a variable has a unique value for each case. The

best way to do this is through the drop-down menus. Open the dataset ‘election.sav’. It contains variables that describe the election results in each constituency.

To check whether ‘seat’ is a unique identifier in the datafile, select Data > Identify Duplicate Cases from the drop-down menus.


The syntax for this procedure is somewhat complex, and therefore it is easier to perform

this procedure using the drop-down menus:

DATASET ACTIVATE DataSet2.

* Identifying Duplicate Cases.

SORT CASES BY seat(A).

MATCH FILES

/FILE=*

/BY seat

/FIRST=PrimaryFirst

/LAST=PrimaryLast.

DO IF (PrimaryFirst).

COMPUTE MatchSequence=1-PrimaryLast.

ELSE.

COMPUTE MatchSequence=MatchSequence+1.

END IF.

LEAVE MatchSequence.

FORMATS MatchSequence (f7).

COMPUTE InDupGrp=MatchSequence>0.

SORT CASES InDupGrp(D).

MATCH FILES

/FILE=*

/DROP=PrimaryFirst InDupGrp MatchSequence.

VARIABLE LABELS PrimaryLast 'Indicator of each last matching case as Primary'.

VALUE LABELS PrimaryLast 0 'Duplicate Case' 1 'Primary Case'.

VARIABLE LEVEL PrimaryLast (ORDINAL).

FREQUENCIES VARIABLES=PrimaryLast.

EXECUTE.

This syntax creates a variable ‘PrimaryLast’ with a value of 1 for the first case found with each value

of seat and a value of 0 for any case that has a value of seat that has already been found. In this

dataset each value of seat is unique.

57. Before closing the ‘election.sav’ dataset we will delete the new PrimaryLast variable, sort the dataset by ‘seat’ and save it. But first we need make sure exactly which data set we are


performing these operations on.

* Sorting election.sav on key variable, and dropping 'PrimaryLast'.

DATASET CLOSE All.

GET FILE="M:\mantra\spss_practicum\scotmp_analysis2\election.sav" / DROP=PrimaryLast.

SORT CASES BY seat(A).

SAVE OUTFILE="M:\mantra\spss_practicum\scotmp_analysis2\election_sorted.sav" / MAP.

We can then do the same with the datasets scotmp.sav and census.sav.

Check for duplicates in the variable ‘seat’ Delete the PrimaryLast variable Sort by ‘seat’ Save the sorted file into a new file.

58. In contrast to the syntax for checking for duplicates, the syntax for MATCH FILES is actually quite simple:

*Merging scotmp.sav, election.sav, and census.sav on 'seat'.

DATASET CLOSE All.

MATCH FILES FILE= "M:\mantra\spss_practicum\scotmp_analysis2\20140526_scotmp_sorted.sav"

/FILE="M:\mantra\spss_practicum\scotmp_analysis2\election_sorted.sav"

/FILE="M:\mantra\spss_practicum\scotmp_analysis2\census_sorted.sav"

/BY seat.

EXECUTE.

SAVE OUTFILE ="M:\mantra\spss_practicum\scotmp_analysis2\20140526_scotmp_merged.sav" /

KEEP=all / MAP.

Alternatively, use the drop-down menus: Data > Merge Files > Add Variables.

ADDING DATA TO A DATA FILE – adding cases

59. Another common file transformation activity is to add cases to an existing data file. In this instance, the ‘ scotmp_merged.sav’ dataset contains only Scottish constituencies. In order to compare them with English/Welsh constituencies, we need to add cases describing those constituencies as well.

60. Open the ‘scotmp_merged.sav’ dataset. From the drop-down menus, select Data > Merge files >

Add cases. 61. You will be prompted for the name of an SPSS system file from which to add cases (the file must

be saved as an SPSS system file prior to running this procedure), either one that is already open, or one that is saved in your system:


62. The next dialogue screen lists, in the right window, all variables held in common between the

two files being merged, and in the left window, those that occur in only one or the other file,

with an indication as to which file each variable is from. These unmatched variables can be

included in the output data set (move them into the right window), but will have system missing

values assigned for those cases in which the variables do not occur. Notice that ‘seat’ is spelled

‘Seat’ in the britmp.sav file.

Variables must match on several aspects in order to be ‘paired’ between the two files: (a) the

variable names match is case sensitive, (b) both variables in both files must be string or numeric

(string variables are flagged by ‘>’) (c) string variables must have the same length. Notice how

the ‘seat’ variable is defined in the two files.

As you can see from the syntax Paste, SPSS’s default is to first rename all unmatched variables,

and then drop them. When SPSS does something unexpected, it is a good idea to paste the

syntax into your Syntax file, and check to see what is happening, in case you missed or

misinterpreted some defaults.

*Adding British constituencies. DATASET ACTIVATE DataSet3. ADD FILES /FILE=* /RENAME (aws10 bithplac bnpppc10 bnpsex10 change05 conppc10 consex10 constbir greenppc greensex greenvt1 ks24mi_1 ks24mi_2 ks24mi_3 ks24migr ks24tpop labppc10 labsex10 ldppc10 ldsex10 mpage mpagegrp mpsex10 other10 seat snpppc10 snpsex10 totalvot tusc10 tuscppc1 tuscsex1 tuscvt10 ukipppc1 ukipsex1 unemploy urbanrur uv010000 uv010001 uv0100_1 uv0100_2 uv0100_3 uv0100_4 uv0100_5 uv0100_6 uv0100_7 uv0100_8 uv0100_9 uv010_10 uv010_11 uv010_12 uv016000 uv016001 uv0160_1 uv0160_2 uv0160_3 uv0160_4 uv0160_5 uv0160_6 uv0160_7 uv0160_8 uv0160_9 uv068000 uv068001 uv0680_1 uv0680_2 uv0680_3 uv0680_4 uv0680_5 uv0680_6 uv0680_7 win05 win10 winner10 yob=d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31 d32 d33 d34 d35 d36 d37 d38 d39 d40 d41 d42 d43 d44 d45 d46 d47 d48 d49 d50 d51 d52 d53 d54 d55 d56 d57 d58 d59 d60 d61 d62 d63 d64 d65 d66 d67 d68 d69 d70 d71 d72 d73)


/FILE='DataSet2' /RENAME (bnp0510 con0510 Con_LDswing Greenvt10 lab0510 Lab_Conswing Lab_LDswing ld0510 Maj05 Maj10 pc05 PC10 PCvt10 Seat Seatch10 Turn0510 TypeChange05_10 ukip0510 Win05 Win10=d74 d75 d76 d77 d78 d79 d80 d81 d82 d83 d84 d85 d86 d87 d88 d89 d90 d91 d92 d93) /DROP=d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31 d32 d33 d34 d35 d36 d37 d38 d39 d40 d41 d42 d43 d44 d45 d46 d47 d48 d49 d50 d51 d52 d53 d54 d55 d56 d57 d58 d59 d60 d61 d62 d63 d64 d65 d66 d67 d68 d69 d70 d71 d72 d73 d74 d75 d76 d77 d78 d79 d80 d81 d82 d83 d84 d85 d86 d87 d88 d89 d90 d91 d92 d93. EXECUTE.

You can directly edit the ‘britmp.sav’ file, in the Data View window, to eliminate eg mixed case

variable names, and edit the length of the ‘seat’ string variable so it is the same in both files. And

then retry Add cases.

You may also want to add a region variable, to flag the difference between the Scottish

constituencies and those elsewhere in the UK. You can easily enter a small number of values of a

new variable manually in the Data Editor. Start by selecting Edit > Insert variable from the drop-

down menu.

Try Add cases again, until you are happy with the result you are getting in the merged output

file. Assign it a name and save it, and log the new file in your data log file. You should also save

the revised versions of the ‘britmp.sav’ file and the ‘scotmp_merged.sav’ file.

Writing a raw data file

63. As important as reading in a data set is writing it out in another format, either for use in a

different statistical package, or for long-term preservation. Software such as Stat/Transfer can

be used to transfer data among software packages or to write raw data in a generic format.

Alternatively, use the appropriate SPSS command to write the outfile in an appropriate format

(see Appendix D for a list of SPSS outfile commands and the formats SPSS can write). Note that

SPSS no longer writes out an SPSS syntax file (although Stat/Transfer does).

64. First check the variable types and variable names. The most generic, non-controversial and

flexible data format is fixed-field format ascii. It is can handle many file structures, including

varieties of hierarchical files, and is not sensitive to commas and/or blanks embedded in

variables.

65. Next, generate the variable information from File > Display Data File Information > Working File

for the file you need to save, or run DISPLAY DICTIONARY from the Syntax window. Check

carefully to make sure that eg Print Format and Write Format are the same for each variable.

Save the variable and value lists.

66. The syntax to write a fixed field format file in SPSS is:


DISPLAY DICTIONARY.

WRITE OUTFILE="[path]\[filename].txt” TABLE / ALL.

EXECUTE.

If there is a chance of the output records exceeding 8,192 characters in length, use the following

structure, with a rough, and over-generous estimate of the output record length:

FILE HANDLE [nickname] name=”[path]\[filename].txt” / LRECL=10000.

WRITE OUTFILE=[nickname]TABLE / ALL.

EXECUTE.

67. When writing a raw data file, make sure to SAVE the output table of where the variables are

written to, as well as the file information generated above – this is your only record of what

variable is in what column(s).

68. Open the data file in a format neutral editor, such as Notepad, which will give you a count of the

number of cases, as well as what the record length of the data records really is (you will need to

position the cursor at the end of a line to see the column indicated at in the bar along the

bottom of the screen). Check this column number against the output table generated by SPSS to

make sure that the final column numbers match.

69. Save your raw data file (file extensions such as .txt or .dat), the output file (in a software neutral

format), and the file in which you have stored the variable and value lists. And finally, of course,

update and save your Data log!

Here concludeth today’s lesson………………………….


Appendix A: Online resources:

IBM SPSS Statistics 19 manuals

http://www.unt.edu/rss/class/Jon/SPSS_SC/Manuals/SPSS_Manuals.htm

IBM SPSS Statistics 19 Command Syntax Reference

http://public.dhe.ibm.com/software/analytics/spss/support/Stats/Docs/19.0/Server/User_Man

uals/English/IBM_SPSS_Statistics_Command_Syntax_Reference.pdf

IBM SPSS online documentation (Note: earliest version available is 20.0)

http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp

Raynald’s SPSS tools: http://www.spsstools.net/

University of Edinburgh. Information Services. SPSS

http://www.ed.ac.uk/schools-departments/information-services/research-support/research-

computing/statistics/supported-pkgs/spss/overview

http://www.unt.edu/rss/class/Jon/SPSS_SC/Manuals/SPSS_Manuals.htm

http://public.dhe.ibm.com/software/analytics/spss/support/Stats/Docs/19.0/Server/User_Manuals/English/IBM_SPSS_Statistics_Command_Syntax_Reference.pdf

http://public.dhe.ibm.com/software/analytics/spss/support/Stats/Docs/19.0/Server/User_Manuals/English/IBM_SPSS_Statistics_Command_Syntax_Reference.pdf

http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp

http://www.spsstools.net/

http://www.ed.ac.uk/schools-departments/information-services/research-support/research-computing/statistics/supported-pkgs/spss/overview

http://www.ed.ac.uk/schools-departments/information-services/research-support/research-computing/statistics/supported-pkgs/spss/overview


Appendix B: Selected inter-system limitations on filenames, variable

and value names etc.

FILE NAMES

Operating systems, and versions, differ as to the file name conventions they accept. VM/CMS requires filenames consisting of 8+8+1 characters. Older versions of DOS and Windows require short (max. 8 characters) filenames + a 3-character extension.

Later versions of Windows accept longer extensions (*.html rather than *.htm, or *.sas7bdat) and multiple extensions (*.tar.gz)

Mac operating systems do not require filename extensions. It is therefore very important to keep track of file format information.

Windows and Mac allow filenames with embedded blanks, whereas these need to be surrounded by quotes in Unix/Linux operating systems.

VARIABLE NAMES

in SPSS variable names must begin with a letter or the characters ‘@’, ‘#’ or ‘$’, and names beginning with ‘#’ or ‘$’ have special functions (scratch and system variables). Variable names should not end in periods since this is a command terminator in SPSS.

Case sensitivity: SAS will convert variable names ‘mpage’ and ‘MPage’ to ‘MPAGE’ for purposes of analysis (ie treat all 3 versions as one and the same variable), ‘though not for purposes of display. In Stata, however, these are treated as 3 different variables. In SPSS, existing variable names are not case sensitive, while new variable names are.

Variable names longer than 8 characters are truncated when exported to SPSS versions pre 12.0, SPSS .por files, SAS pre-V7, and Stata versions pre-7.

Special characters such as ‘@’, ‘#’ and ‘$’ are not allowed in SAS variable names, the last three are replaced with underscores. In Stata, the only allowable characters are letters, numbers, and underscores.

SAS variable names must begin with a letter, or an underscore ‘_’. Tab characters embedded in string variables are preserved in tab-delimited export formats.

VARIABLES AND VALUES

When writing out files in Stata 5-6 and Intercooled 7-8 formats, only the first 2,047 variables are saved.

Variables defined in ranges (eg in SAS, ‘V1’-‘V100’) may lose their variable and value labels.

All SPSS user-defined missing values are mapped to a single system-missing value in SAS.

Variable labels longer than 40 bytes are truncated when exported to SAS v6.

Variable labels longer than 80 bytes are truncated when exported to Stata.

Value labels longer than 80 bytes are truncated when exported to Stata.


Appendix C: Common file and variable transformations and their

corresponding SPSS commands:

File transformations SPSS syntax

- Sort cases SORT CASES

- Sort variables SORT VARIABLES

- Transpose (cases and variables) FLIP

- Merge – add cases ADD FILES

- Merge – add variables MATCH FILES

- Weight cases WEIGHT

- Split files SPLIT FILE

- Aggregate data AGGREGATE

Variable transformations

- Compute new variables COMPUTE

- Recode RECODE

- Rank cases RANK

- Random number generation Transform > Random Number Generators

- Count occurrences COUNT

- Shift values SHIFT VALUES

- Time series operations CREATE

RMV

SEASON

DATE

SPECTRA

DATA HANDLING USING SPSS 19 - Research Data...

Documents

Transcript of DATA HANDLING USING SPSS 19 - Research Data...