1 Coding Michael J. Levin Harvard Center for Population and Development Studies...

18
1 Coding Coding Michael J. Levin Michael J. Levin Harvard Center for Harvard Center for Population and Development Population and Development Studies Studies [email protected] [email protected]

Transcript of 1 Coding Michael J. Levin Harvard Center for Population and Development Studies...

11

CodingCoding

Michael J. LevinMichael J. LevinHarvard Center for Harvard Center for

Population and Development Population and Development StudiesStudies

[email protected]@yahoo.com

22

CodingCoding

• Currently, most countries scan their Currently, most countries scan their censuses, but frequently continue to censuses, but frequently continue to key their surveyskey their surveys

• Still, certain variables need to be Still, certain variables need to be translated from words into numberstranslated from words into numbers

• Coding Coding is the process of making is the process of making machine-readable numbers and machine-readable numbers and alphanumericsalphanumerics

33

Coding ConsiderationsCoding ConsiderationsInvestmentInvestment • When developing a coding scheme, When developing a coding scheme,

census and survey staff must consider census and survey staff must consider the returns of each investment of time, the returns of each investment of time, energy and fundsenergy and funds

• Coding considerations are reasonably Coding considerations are reasonably insignificant for small countries or small insignificant for small countries or small surveys since the amount of processing surveys since the amount of processing is much less than for a censusis much less than for a census

44

• Some packages can easily accept Some packages can easily accept and work with alphanumeric dataand work with alphanumeric data

• However, However, mostmost packages have packages have difficulties categorizing and difficulties categorizing and performing calculations (sums, performing calculations (sums, percentages, medians etc.) when percentages, medians etc.) when non-numeric data are included non-numeric data are included

Coding ConsiderationsCoding ConsiderationsSoftwareSoftware

55

Coding ConsiderationsCoding ConsiderationsSoftwareSoftware • Codes that are completely alphabetic characters Codes that are completely alphabetic characters

or a combination of alphabetic characters and or a combination of alphabetic characters and numbers (alphanumerics) should be avoided numbers (alphanumerics) should be avoided whenever possiblewhenever possible

• When forms are scanned, alphanumerics are not When forms are scanned, alphanumerics are not a great problem, but many computer packages a great problem, but many computer packages require considerable manipulation in their userequire considerable manipulation in their use

• In many cases, editing programs require that In many cases, editing programs require that alpha characters be placed between quotation alpha characters be placed between quotation marks, or in some other manner, in order to marks, or in some other manner, in order to process themprocess them

66

Coding ConsiderationsCoding ConsiderationsEditingEditing • Scanned data don’t suffer as much from additional Scanned data don’t suffer as much from additional

columns of informationcolumns of information

• For example, for codes 1 through 9, the scanner For example, for codes 1 through 9, the scanner may pick up an alpha character, or a blank, or a may pick up an alpha character, or a blank, or a stray mark converted to some readable characterstray mark converted to some readable character

• These issues are readily handled in the edit as These issues are readily handled in the edit as described laterdescribed later

• But when two columns are used for an item, for But when two columns are used for an item, for example example relationshiprelationship, scanning will introduce , scanning will introduce errors that would otherwise not be present when a errors that would otherwise not be present when a single column is usedsingle column is used

77

Coding ConsiderationsCoding ConsiderationsEditingEditing • When two columns are used for an item, say codes 1 When two columns are used for an item, say codes 1

to 10, then you introduce a whole new realm of errorsto 10, then you introduce a whole new realm of errors

• Instead of legal values 1 to 9, you now have values Instead of legal values 1 to 9, you now have values coming in that could range anywhere from 0 to 99, as coming in that could range anywhere from 0 to 99, as well as the aforementioned alpha characters, blanks, well as the aforementioned alpha characters, blanks, and stray marksand stray marks

• In most cases, the subject specialists provide the edit In most cases, the subject specialists provide the edit specifications for the item, but these values specifications for the item, but these values automatically increase the time and complexity of the automatically increase the time and complexity of the edit, and could decrease the quality of the final data edit, and could decrease the quality of the final data set.set.

88

Coding ConsiderationsCoding ConsiderationsEditingEditing I. Common ProblemsI. Common Problems

• When the editors receive a value of 13 for When the editors receive a value of 13 for relationshiprelationship, they must start making , they must start making strategic decisions about what to do with strategic decisions about what to do with this value. this value. – Was it meant to be 3, and the 1 is Was it meant to be 3, and the 1 is

erroneous? erroneous? – Was it meant to be 10, and the 3 is Was it meant to be 10, and the 3 is

wrong?wrong?

99

Coding ConsiderationsCoding ConsiderationsEditingEditing II. Common Problems II. Common Problems

• Many countries could have up to 12 items of Many countries could have up to 12 items of information on fertility (children in the household, information on fertility (children in the household, children elsewhere, children dead etc.) children elsewhere, children dead etc.)

• The issue here is how many digits each of those The issue here is how many digits each of those items should beitems should be– When two columns are used, the boys in the When two columns are used, the boys in the

house could be anywhere from 0 to 99; house could be anywhere from 0 to 99; – When only one column is used the numbers When only one column is used the numbers

can only range from 0 to 9;can only range from 0 to 9;

1010

Coding ConsiderationsCoding ConsiderationsEditingEditing II. Common Problems II. Common Problems

• Since it is extremely unlikely that a female would Since it is extremely unlikely that a female would have more than 9 boy children in the household, have more than 9 boy children in the household, having two digits introduces high probability of having two digits introduces high probability of picking up stray marks or scanning misreads – reading picking up stray marks or scanning misreads – reading 9 for a 0, for example, so 91 children instead of 01. 9 for a 0, for example, so 91 children instead of 01.

• However, for total children in the house, total children However, for total children in the house, total children elsewhere, total children dead, and total children, two elsewhere, total children dead, and total children, two columns might be more appropriate columns might be more appropriate

• Much of these decisions depends on the fertility levels Much of these decisions depends on the fertility levels in the countryin the country

1111

Coding ConsiderationsCoding ConsiderationsGood Practice Good Practice

• The following set of standard codes covers the The following set of standard codes covers the majority of relationships for most countries:majority of relationships for most countries:

1.1. Head of household (or householder)Head of household (or householder)

2.2. SpouseSpouse

3.3. ChildChild

4.4. Adopted or step-childAdopted or step-child

5.5. SiblingSibling

6.6. ParentParent

7.7. GrandchildGrandchild

8.8. Other relativeOther relative

9.9. NonrelativeNonrelative

• Some countries add a “0” code for head of Some countries add a “0” code for head of household and can then add a 10household and can then add a 10thth category to the category to the others.others.

1212

Coding ConsiderationsCoding ConsiderationsGood Practice Good Practice

• Many countries, particularly those experiencing Many countries, particularly those experiencing the HIV/AIDS epidemic need much more detailed the HIV/AIDS epidemic need much more detailed information than can be provided by these codes. information than can be provided by these codes.

• Specific information on children-in-law, parents-Specific information on children-in-law, parents-in-law, grandparents, nieces and nephews, and so in-law, grandparents, nieces and nephews, and so forth become crucial in analyzing the HIV/AIDS forth become crucial in analyzing the HIV/AIDS situation in a countrysituation in a country

• In this situation, additional codes are required for In this situation, additional codes are required for the statistical office to carry out its mission, and the statistical office to carry out its mission, and so two digit codes are required.so two digit codes are required.

1313

Coding ConsiderationsCoding ConsiderationsGood Practice Good Practice

• Once the decision is made to use two columns, Once the decision is made to use two columns, the subject matter specialists for this item may the subject matter specialists for this item may choose to use the columns to have significance. choose to use the columns to have significance. For example:For example:Code Code

10 Head of Household 31 Parent

11 Spouse 32 Parent-in-Law

12 Sibling 33 Uncle/Aunt

13 Sibling’s Spouse 41 Grandchild

21 Child 77 Other Relative

22 Adopted Child 88 Non-Relative

23 Step Child 90 Institutional Population

24 Niece/Nephew

1414

Coding ConsiderationsCoding ConsiderationsGood Practice Good Practice

• This type of coding, should be considered for certain This type of coding, should be considered for certain social and economic variables. social and economic variables.

• Ethnicity;Ethnicity;– the major tribal or ethnic grouping would be in the first of two the major tribal or ethnic grouping would be in the first of two

columns and the minor tribal or ethnic grouping (like a sect) columns and the minor tribal or ethnic grouping (like a sect) would be in the second digitwould be in the second digit

• Occupation/Industry;Occupation/Industry;– the first digit would be for the major occupation/industry, the the first digit would be for the major occupation/industry, the

second digit for the minor occupation/industry, and the third second digit for the minor occupation/industry, and the third digit for specific occupation or industrydigit for specific occupation or industry

• Note: Most international coding schemes, by the United Nations agencies, the Note: Most international coding schemes, by the United Nations agencies, the U.S. Census Bureau, and others, already have the levels imbedded in the codes, U.S. Census Bureau, and others, already have the levels imbedded in the codes, so the statistical office does not have to do any additional work.so the statistical office does not have to do any additional work.

1515

Coding ConsiderationsCoding ConsiderationsCommon Codes Common Codes

• A set of common codes for closely related variables A set of common codes for closely related variables can reduce coding errors and assist the data can reduce coding errors and assist the data processors during the editprocessors during the edit

• Common codes also allow data processors, where Common codes also allow data processors, where appropriate, to use an entry from one item to appropriate, to use an entry from one item to determine anotherdetermine another– For example, in many countries, place codes (birthplace, For example, in many countries, place codes (birthplace,

parental birthplace, previous residence, work place), parental birthplace, previous residence, work place), language, ethnicity/race, and citizenship are very similar language, ethnicity/race, and citizenship are very similar

– A common coding scheme for “place” might be developed as A common coding scheme for “place” might be developed as three-digit codes with the first digit representing the three-digit codes with the first digit representing the continent, the second the region, and the third the specific continent, the second the region, and the third the specific country.country.

1616

Coding ConsiderationsCoding ConsiderationsCommon Codes Common Codes

• The structure of coding can facilitate the coding The structure of coding can facilitate the coding process as well as later processing during editing, process as well as later processing during editing, tabulation and analysistabulation and analysis

• For large countries with many immigrants or ethnic For large countries with many immigrants or ethnic groups, codes based on continent, region and groups, codes based on continent, region and country, with different codes or digits assigned to country, with different codes or digits assigned to each, would be preferable to a simple listingeach, would be preferable to a simple listing

• National census/statistical offices can also use National census/statistical offices can also use country numerical codes developed by international country numerical codes developed by international organizations such as the United Nations Statistics organizations such as the United Nations Statistics Division (United Nations, 1999).Division (United Nations, 1999).

1717

Coding ConsiderationsCoding ConsiderationsCommon Codes Common Codes

Group Birthplace

Citizenship

Language Ethnicity

France/French 10 10 10 10

Spain/Spanish 20 20 20 20

Latin America 25 25 20 25

Philippines/Filipino

30 30 30

Iiokano 32

Tagalog 32

England/English 40 40 40 40

Canada 50 50 40 50

USA 52 52 40 52

Examples of common codes for selected items

1818

Coding ConsiderationsCoding ConsiderationsFinal Notes Final Notes

• If a group of items on a questionnaire is not independent of If a group of items on a questionnaire is not independent of each other, national census/survey staff probably should each other, national census/survey staff probably should not ask all of them. The editing team must decide, on a not ask all of them. The editing team must decide, on a case-by-case basis, when to use other items directly for case-by-case basis, when to use other items directly for assignment, and when to use other available variablesassignment, and when to use other available variables

• When definitions differ between censuses (or between a When definitions differ between censuses (or between a census and a survey) for variables such as work or census and a survey) for variables such as work or ethnicity, the national census/statistical office must decide ethnicity, the national census/statistical office must decide how to take these changes into account, both for currently how to take these changes into account, both for currently edited data and for datasets from the prior census, in order edited data and for datasets from the prior census, in order to show trends. If the original, unedited data are available, to show trends. If the original, unedited data are available, data processors can make changes to the appropriate edits data processors can make changes to the appropriate edits and rerun all of them. and rerun all of them.