What’s in a Name? Accounting for Naming Conventions in NCHS Data Linkages

29
What’s in a Name? Accounting for Naming Conventions in NCHS Data Linkages Eric A. Miller National Center for Health Statistics (NCHS) 2012 FCSM Statistical Policy Seminar December 4, 2012

description

What’s in a Name? Accounting for Naming Conventions in NCHS Data Linkages. Eric A. Miller National Center for Health Statistics (NCHS) 2012 FCSM Statistical Policy Seminar December 4, 2012. “Two men say they’re Jesus. One of them must be wrong.”. Mark Knopfler , Dire Straits. - PowerPoint PPT Presentation

Transcript of What’s in a Name? Accounting for Naming Conventions in NCHS Data Linkages

Page 1: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

What’s in a Name? Accounting for Naming Conventions in NCHS Data Linkages

Eric A. MillerNational Center for Health Statistics (NCHS)

2012 FCSM Statistical Policy SeminarDecember 4, 2012

Page 2: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

“Two men say they’re Jesus. One of them must be wrong.”

Mark Knopfler, Dire Straits

Page 3: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

• One reason for data sharing is data linkage

• Assessing the quality of linked data is different from assessing a standalone dataset• The quality of variables from a specific source doesn’t

matter if the linkage is poor• Problems with linkage can produce poor quality data

– Are the data fit for use? + Are the data fit for linkage?

What Does This Have to do With Data Quality?

Page 4: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

Names• Names are commonly used in data

linkages• Important to account for name differences

and naming conventions to produce a high quality linked data file

Page 5: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

Quick Background on Data Linkage

• Deterministic– Exact match on linkage

variables• Frank ≠ Francis

• Probabilistic– Accounts for imperfect data– Probability of a match

• Frank ≈ Francis

Page 6: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

Caveats of Data Linkage• It’s not perfect

Prince ?Prince Rogers Nelson

Prince

Some things are out of our control!

Page 7: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

Caveats of Data Linkage• Varying levels of quality for linkage

variables can substantially increase workload – Clean-up, reformatting– Clerical review

• Analysis of insufficiently linked data can produce biased estimates

Page 8: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

Example - Hispanic Paradox• Despite having a higher risk profile,

Hispanics have been found to have lower mortality rates compared to non-Hispanic whites

Markides and Coreil (1986). Public Health Reports; 101: 253-265

Page 9: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

Mortality Rate per 100,000 Among Women in 1986-1990 National Health Interview Survey Linked to 1991 National Death Index

Age 18-44 Age 45-64 Age 65+0

1000

2000

3000

4000

80

642

3504

182

969

3928

97

480

2438

White-NH Black-NH Hispanic

Rat

e pe

r 100

,000

Liao et al. (1998). Mortality Patterns among Adult Hispanics: Findings from the NHIS, 1986 to 1990. AJPH.

Page 10: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

Potential Reasons for Paradox• Health selective immigration• Salmon bias (return migration)• Advantageous health behaviors and social

support• Data quality / Insufficient linkage

Page 11: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

Potential Reasons for Paradox• Data quality / Insufficient linkage

– Naming conventions for Hispanics differ from other US populations

• Use of mother’s and father’s surname• May not have single middle name

– Less likely to have social security number• Especially among older adults and foreign born

Page 12: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

Percent of “True” Matches for Hispanics and Non-Hispanic Whites by Foreign-Born Status

Hispanic Non-Hispanic White

Foreign-born US-born Foreign-born US-born

Class 1 (“True”)Matches

32.5% 50.0% 57.4% 62.5%

Class 1: records agree on at least 8 digits of SSN as well as first and last name, middle initial, and birth year (+/- 3 years)

Joseph Lariscy. Differential record linkage by Hispanic ethnicity and age in linked mortality studies: Implications for the epidemiologic paradox. J of Aging and Health (2011); 23: 1263-1284.

Page 13: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

What does this have to do with NCHS?

• NCHS Record Linkage Program– Links survey data with data collected from

administrative records– Designed to maximize the scientific value of the

NCHS population-based surveys– Examine factors that influence chronic disease,

disability, health care utilization, morbidity, and mortality

Page 14: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

14

Linked NCHS surveys• National Health Interview Survey (NHIS)

• 1999-2004 NHANES, NHANES III, and NHANES II

• NHANES I Epidemiologic Follow-up Study (NHEFS)

• The Second Longitudinal Study of Aging (LSOA II)

• National Nursing Home Survey (NNHS)

Page 15: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

15

• National Death Index

• Medicare and Medicaid enrollment and claims

• Social Security Administration Retirement and Disability

• Pilot projects– Florida Cancer Data System– Texas Supplemental Nutrition Assistance Program (SNAP)

Linked Administrative Records

Page 16: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

Case Study: NCHS Survey linkage with the NDI

• National Death Index (NDI)– A national file of identifying death record

information (beginning with 1979 deaths)– Every four years we send a file of survey

participants to NDI to conduct a linkage and identify participant deaths

– We take additional steps to try and improve the linkage

Page 17: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

NDI Matching Algorithm• Social Security Number• First name• Middle initial• Last name• Month of birth• Year of birth• Sex• Father’s surname• State of birth• Race• State of residence• State of birth• Marital Status

Page 18: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

Unweighted percent of NHIS sample adults aged 18 or older, refusing to provide SSN, 1997-2009

Page 19: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

NCHS Record Linkage Program

• To make sure we provide research quality data, we spend a lot of time processing the data to increase the chance of finding a true match– Try to increase the number of matches while

minimizing false matches

• Addressing name clean-up and naming conventions is a major activity

Page 20: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

Methods – Name Clean-up • Fix invalid characters• Compress spaces• Remove titles/descriptors/suffixes

– e.g. Mr., baby, jr.

• Linkage uses NYSIIS phonetic codes– Accounts for misspellings or unusual spellings

Page 21: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

Methods – Name Clean-up • Create alternate records

– Sent with original record• Among women substitute surnames for last name• Nicknames (using a look-up table)

– Substituting Elizabeth for Beth

Page 22: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

Nickname Lookup TableSEX NICKNAME PROPER NAMEM ABE ABRAHAM

F AGGIE AGNES

M AL ALBERT

M ALEX ALEXANDER

M ALF ALFRED

F ALLIE ALBERTA

M ANDY ANDREW

Example: If first name=‘Andy’ then alternate record first name=‘Andrew’

Page 23: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

Methods – Name Clean-up – Accounting for Hispanic and Asian naming

conventions• Hispanic

– Hispanic nickname lookup table– switch middle and last

• Asian– switch first and last

Page 24: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

Hispanic Lookup TableSex Formal Name Nicknames

F Adelina Deli Lina

F Adelaida Ade Adela

M Adrián Adri

F Adriana Adri

M Alberto Alber Albertito Beto Berto Tico Tuco Tito

M Alejandro Ale Álex Alejo Jandro Jano Sandro

F Alejandra Sandra Ale Álex Aleja Jandra Jana

M Alfonso Alfon Fon Fonso Fonsi Poncho

F Alicia Ali Licha

Page 25: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

Alternate Records Example

Number First Middle Last

1 David Américo Arias Ortiz

2 David Américo Ortiz

3 David Américo Arias

4 David Américo

5 Big Papi

Page 26: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

Conclusions• Care needs to be taken to avoid false links

– Alternate records increases the number of potential matches

• If two men claim they’re Jesus, they can both be wrong

– Need a higher level of scrutiny to determine that a pair of records match

Page 27: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

Conclusions• Accounting for name differences and naming

conventions improves quality of the linked-data product

• Hope our efforts to account for Hispanic and Asian naming conventions reduces potential bias – Need to evaluate

Page 28: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

Important Considerations• How are names are collected?• How are the names recorded? • More likely to have formal names versus

nicknames?– Surveys may differ from official documents

• Are maiden names (surnames) available?• Are there consistent rules for recording

names?

Page 29: What’s in a  Name? Accounting for Naming Conventions in NCHS Data Linkages

Acknowledgements• Dr. Jennifer Parker• Dr. Dean Judson

Thank you