Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS
-
Upload
hmo-research-network -
Category
Documents
-
view
638 -
download
0
description
Transcript of Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS
Use of SAS-Based Natural Language Processing to Identify Incident and Recurrent Malignancies
Justin A. Strauss, MAResearch Associate III
Kaiser Permanente Southern California
May 1, 2012 • 2012 HMORN Conference • Seattle, Washington
Co-Authors & Funding
• Chun R. Chao, PhD
• Marilyn L. Kwan, PhD
• Syed A. Ahmed, MD
• Joanne E. Schottinger, MD
• Virginia P. Quinn, PhD
Acknowledgements & Funding• Mayra Martinez, Michelle McGuire, Melissa
Preciado, Nirupa Ghai, and Jeff Slezak (KPSC); Lawrence Kushi (KPNC); Debra Ritzwoller (KPCO); Joan Warren (NCI); Jianyu Rao and Jiaoti Huang (UCLA)
• Funding was provided by KPSC Community Benefit and the Cancer Research Network
Malignancy Identification• Malignancy identification is important for clinical
and epidemiologic cancer research.
• Limited quality and availability of incident and recurrent malignancy data within health plans.
• Delayed availability of incident malignancy data from cancer registries.
• Few registries track cancer recurrences.
• Manual chart abstraction slow and expensive.
• Previous research has shown electronic diagnosis codes (e.g., ICD-9) to be unreliable.
Natural Language Processing• Natural language processing (NLP) can be used to identify
and extract information from electronic clinical text, including incident and recurrent malignancy data.
• Increasing opportunity for NLP with adoption of electronic clinical systems in patient care delivery.
• Despite its potential value in clinical and research settings, NLP usage has been relatively sparse. Contributing factors may include:
• Technical complexity
• Systems integration requirements
• Habitual use of existing methods
SCENT Overview• A SAS-based coding, extraction, and nomenclature tool
(SCENT) was developed to identify incident and recurrent malignancies using text from pathology reports.
• SCENT is currently being implemented in two research studies at Kaiser Permanente Southern California (KPSC):
• Intervention to improve medication adherence among breast cancer patients.
• Differences in the prognosis of prostate cancer patients according to their genetic factors
• Use of SAS programming minimizes implementation barriers and increases availability for multisite research.
Description of Methods• SCENT identifies non-negated clinical concepts within
pathology report text.
• Built using SAS Base (does not require Text Miner add-on).
• Makes extensive use of SAS hash objects and regular expressions.
• Includes components for preprocessing, matching, negation and uncertainty detection, extracting diagnostic information (e.g., staging and Gleason score), and classifying report malignancy status.
• Flexibility to assign codes using variety of coding systems.
• Validation used subset of SNOMED 3.x (~1000 concepts).
SCENT Process Diagram
Concept Dictionary (SAS)
Pathology Text (Research Database)Text : Raw text segment from reportLine : Sequential text segment identifier
Regular Expressions
LoopConcepts
Examine Segments
Tokenize Words
[adenocarcinoma[ls]?][papillar(y|ies)]
Extract Data
Code Matches
Tokenize Words
Clean
Enhance
Disease Extent
Diagnostic Certainty
Tumor Staging
Gleason Score
Check Negation
Clinical Concepts (Excel)Type : Morphology, topology, or proceduralCode : SNOMED 3.XClass : Malignant, basaloid, benign, or N/ADescription : Concept description
[intraductal][papillary][adenocarcinoma][with][invasion]
[intraductal][papillary][adenocarcinoma][with][invasion]
[((intra)?duct(al)?)][papillar(y|ies)][adenocarcinoma[ls]?]
[moderately-differentiated ductal adenocarcinoma with papillary][features.][the tumor involves 0.6 cm of one core.]
[moderately-differentiated ductal adenocarcinoma with papillary features.][the tumor involves 0.6 cm of one core.]
Preprocessed TextCode : M-85033
Description : intraductal papillary adenocarcinoma with invasion
[moderately] [differentiated] [ductal] [adenocarcinoma] with [papillary] [features]
moderately differentiated <nlp snm=m85033 type=m class=3>ductal adenocarcinoma with papillary</nlp snm=m85033> features
free (of|from)not? (support[a-z]*|identified)non(?!small|hodgkins)
[((intra)?duct(al)?)]
Match Tokens
Sample Report Coding
LEFT BREAST CORE BIOPSY TWO O CLOCK.<BR>
INVASIVE DUCTAL CARCINOMA NOTTINGHAM GRADE 2.<BR>
NO CALCIFICATION IS IDENTIFIED.<BR>
NO VASCULAR INVASION IS IDENTIFIED.<BR>
HORMONE RECEPTOR AND HER 2 NEU STATUS PENDING AN ADDENDUM WILL FOLLOW.
<NLP SNM=T04030 TYPE=T>LEFT BREAST</NLP SNM=T04030> CORE <NLP SNM=P1140 TYPE=P>BIOPSY</NLP SNM=P1140> TWO O CLOCK.<BR>
INVASIVE <NLP SNM=M85003 TYPE=M CLASS=3>DUCTAL CARCINOMA</NLP SNM=M85003> NOTTINGHAM GRADE 2.<BR>
NO CALCIFICATION IS IDENTIFIED.<BR>
NO VASCULAR INVASION IS IDENTIFIED.<BR>
HORMONE RECEPTOR AND HER 2 NEU STATUS PENDING AN ADDENDUM WILL FOLLOW.
Preprocessed Text
Coded Text
Validation Study• To validate SCENT, trained chart abstractors reviewed
electronic pathology reports.
• Random samples of breast (n=400) and prostate (n=400) cancer patients.
• Patients diagnosed at KPSC between 2000-2007.
• Reports included from six months post-diagnosis through end of 2008.
• In total, 206 breast and 186 prostate cancer patients contributed 490 and 425 eligible reports, respectively.
• SCENT classifications were compared with those of abstractors.
Classification ConcordanceAbstractor Classifications
Benign CancerRecurrence
OtherPrimary Cancer Suspicious
SCENT Classifications % N % N % N % N Kappa
Breast Cancer (Total) (436) (32) (18) (4)
Benign 99.8 435 - - - - 25.0 1 0.96
Cancer Recurrence - - 100.0 32 - - - -
Other Primary Cancer 0.2 1 - - 100.0 18 50.0 2
Suspicious - - - - - - 25.0 1
Prostate Cancer (Total) (356) (29) (36) (4)
Benign 99.4 354 - - 5.6 2 - - 0.95
Cancer Recurrence - - 96.6 28 2.8 1 - -
Other Primary Cancer 0.6 2 3.4 1 91.7 33 - -
Suspicious - - - - - - 100.0 4
Note: incident contralateral breast malignancies were considered to be recurrences.
SCENT Performance Metrics
Sensitivity* Specificity* PPV* NPV*
Breast Cancer 1.00 (0.93-1.00) 0.99 (0.98-1.00) 0.94 (0.85-0.98) 1.00 (0.99-1.00)
Prostate Cancer 0.97 (0.89-0.99) 0.99 (0.98-1.00) 0.97 (0.89-0.99) 0.99 (0.98-1.00)
* Shown with Wilson's 95% confidence interval.
Conclusions• Favorable results suggest SCENT can identify and extract
information about primary and recurrent malignancies from pathology reports.• Rapid cancer case identification.
• Improved measurement accuracy of common study endpoint.
• SCENT has the potential to expedite chart reviews by narrowing the search and highlighting relevant concepts.
• Generalized utility for extracting standardized disease scores and other clinical information.
• SCENT is proof of concept for SAS-based NLP that can be easily shared between institutions to support research.
Limitations & Next Steps• SCENT has a number of limitations, including:
• Unable to disambiguate and contextualize identified clinical concepts without part-of-speech (POS) tagging.
• More susceptible to changes in text structure and increased linguistic variability than statistical NLP approaches.
• General purpose NLP (e.g., cTAKES) likely to perform better outside of pathology.
• Next steps include:• Release SCENT source code and requisite support files.
• Optimize current functionality and assess feasibility of adding methods (e.g., POS tagging, n-grams, statistical classifiers).
• Attempt to identify non-pathologically diagnosed malignancies using radiology reports and clinical progress notes.
• Quantify cost savings associated with SCENT-assisted chart reviews.
Questions?