svm_AD

30
SVM Based Tool for Automatic Biomedical Literature Classification Abhishek Dabral

Transcript of svm_AD

SVM Based Tool for Automatic Biomedical Literature

Classification

Abhishek Dabral

Outline• Introduction and Motivation PubMed database CDC’s epidemiological database Support Vector Machines (SVM)• Feature Selection TFIDF and Z-Score• Experimental design Keyword set, Training sets, Test sets• Performance Metrics Sensitivity, Specificity, Accuracy and Positive predictive value • SVM Tool and Present status

PubMed is a service of the U.S. National Library Of Medicine (NLM) and

National Institutes of Health (NIH) that includes over 18 million

citations from MEDLINE and other life science journals for

biomedical articles back to the 1950s.

PubMed database (www.pubmed.gov)

PubMed database Search

Boolean queries using keywords

Manual scanning of the retrieved records for relevance

Time Consuming

Error Prone

Incomplete

CDC’s Human Genome Epidemiology

Network ( HuGENet™)

CDC: Centers for disease control and prevention

An integrated, searchable knowledge base of genetic associations and human genome epidemiology. (http://www.hugenavigator.net)

Identifiedarticle

Complex Query

AccessPending further review

of article

Articles indexed by gene, disease,

factor, or topic

HTML pages

Upload data tables

ProCite

Eligiblearticles

Select bytitles

Manual review process

Excluded articles- sci lit, non-HuGE

CDC’s weekly Human Genome Epidemiology(HuGE) published literature abstraction process

CDC: Centers for disease control and prevention

Complex query CDC uses for searching PubMed

(((((((((((((((((((genetic[All Fields] AND ((((("disease"[MeSH Terms] OR ("disease susceptibility"[MeSH Terms] OR predisposition[Text Word])) OR disease[Text Word]) OR defect[Text Word]) OR susceptibility[Text Word]) OR ("counseling"[MeSH Terms] OR counseling[Text Word]))) OR (("disease susceptibility"[MeSH Terms] OR susceptibility[Text Word]) AND (("genes"[MeSH Terms] OR gene[Text Word]) OR ("genes"[MeSH Terms] OR genes[Text Word])))) OR (((("mutation"[MeSH Terms] OR mutation[Text Word]) OR (("genes"[MeSH Terms] OR gene[Text Word]) AND ("mutation"[MeSH Terms] OR mutation[Text Word]))) OR (("mutation"[MeSH Terms] OR mutations[Text Word]) AND ("genes"[MeSH Terms] OR gene[Text Word]))) OR (("mutation"[MeSH Terms] OR mutation[Text Word]) AND ("genes"[MeSH Terms] OR gene[Text Word])))) OR ("hereditary diseases"[MeSH Terms] OR genetic disorder[Text Word])) OR (genetic[All Fields] AND ((("TEST"[Substance Name] OR ("TEST"[Substance Name] OR test[Text Word])) OR ("research design"[MeSH Terms] OR testing[Text Word])) OR study[All Fields]))) OR ("genetic screening"[MeSH Terms] OR genetic screening[Text Word])) OR (genetic[All Fields] AND ("risk"[MeSH Terms] OR risk[Text Word]))) OR ("polymorphism (genetics)"[MeSH Terms] OR ("polymorphism (genetics)"[MeSH Terms] OR polymorphism[Text Word]))) OR (((("genotype"[MeSH Terms] OR ("genotype"[MeSH Terms] OR genotype[Text Word])) OR genotyping[All Fields]) OR ("haplotypes"[MeSH Terms] OR haplotype[Text Word])) OR ("haplotypes"[MeSH Terms] OR haplotypes[Text Word]))) OR ((("genome"[MeSH Terms] OR genome[Text Word]) OR genomic[All Fields]) OR ("Genomics"[MeSH Terms] OR genomics[Text Word]))) OR (((gene-environment) OR (gene AND environment)) AND interaction[Text Word])) OR (((genetic[Text Word] OR gene[Text Word]) OR allelic[All Fields]) AND ((variant[All Fields] OR variants[All Fields]) OR (("epidemiology"[MeSH Subheading] OR "epidemiology"[MeSH Terms]) OR frequency[Text Word])))) OR (variant[All Fields] AND (("alleles"[MeSH Terms] OR allele[Text Word]) OR ("alleles"[MeSH Terms] OR alleles[Text Word])))) OR ("heterozygote detection"[MeSH Terms] OR Heterozygote Detection[Text Word])) OR ((Neonatal[All Fields] OR ("infant, newborn"[MeSH Terms] OR newborn[Text Word])) AND (("diagnosis"[MeSH Subheading] OR "mass screening"[MeSH Terms]) OR Screening[Text Word]))) OR germline[All Fields]) OR somatic[All Fields]) OR ("human genome project"[MeSH Terms] OR human genome project[Text Word])) AND ((((((((((((((((((((("epidemiology"[Subheading] OR "epidemiology"[MeSH Terms]) OR epidemiology[Text Word]) OR ("public health"[MeSH Terms] OR public health[Text Word])) OR ((("alleles"[MeSH Terms] OR allele[Text Word]) OR allelic[All Fields]) AND ((("epidemiology"[MeSH Subheading] OR "epidemiology"[MeSH Terms]) OR frequency[Text Word]) OR frequencies[All Fields]))) OR ("public policy"[MeSH Terms] OR policy[Text Word])) OR (("education"[Subheading] OR "education"[MeSH Terms]) OR education[Text Word])) OR "prevalence"[MeSH Terms]) OR prevalence[Text Word]) OR ("prevention and control"[Subheading] OR prevention[Text Word])) OR ("risk"[MeSH Terms] OR risk[Text Word])) OR ((((((((population[Text Word] OR (a number of) OR genetic[All Fields]) OR comparative[All Fields]) OR prospective[All Fields]) OR cohort[All Fields]) OR cross-section[All Fields]) OR cross-sectional[All Fields]) OR case-control[All Fields]) AND (studies OR study[All Fields]))) OR (clinical trial[All Fields] OR randomized controlled trial[All Fields])) OR (("drug interactions"[MeSH Terms] OR interactions[Text Word]) OR (("interpersonal relations"[MeSH Terms] OR "drug interactions"[MeSH Terms]) OR interaction[Text Word]))) OR ("questionnaires"[MeSH Terms] OR questionnaire[Text Word])) OR (("sensitivity and specificity"[MeSH Terms] OR sensitivity[Text Word]) OR ("sensitivity and specificity"[MeSH Terms] OR specificity[Text Word]))) OR ((((case[All Fields] OR cases[All Fields]) OR ("patients"[MeSH Terms] OR patients[Text Word])) OR (study[All Fields] AND group[All Fields])) OR (((((("prevention and control"[MeSH Subheading] OR control[Text Word]) OR controls[All Fields]) OR (healthy[All Fields] AND subjects[All Fields])) OR ("child"[MeSH Terms] OR children[Text Word])) OR ("adult"[MeSH Terms] OR adults[Text Word])) OR individuals[All Fields]))) OR (((("association"[MeSH Terms] OR association[Text Word]) OR ("association"[MeSH Terms] OR associations[Text Word])) OR ("disease"[MeSH Terms] OR disease[Text Word])) AND (("genes"[MeSH Terms] OR gene[Text Word]) OR ("genes"[MeSH Terms] OR genes[Text Word])))) OR oversight[All Fields]) OR ((("genotype"[MeSH Terms] OR genotype[All Fields]) OR allelic[All Fields]) AND distribution[Text Word])) OR (((("genotype"[MeSH Terms] OR genotype[Text Word]) AND ("phenotype"[MeSH Terms] OR phenotype[Text Word])) OR genotype-phenotype[All Fields]) AND correlation[All Fields])) OR (("ethics"[MeSH Terms] OR ethics[Text Word]) OR ethical[All Fields]))) AND "2004/1/29 8.00"[MHDA]:"2004/2/4 8.00"[MHDA])

Distribution of articles in PubMed and HuGE Published Literature: Week of March 2-9, 2005

HuGE complex query

HuGE (no

reviews)

Relevant based

on titles

PubMed (MeSH dates)

2,245

91

342

35,311

SVM is a machine learning algorithm that performs binary and multiway classification of the data

Advantages

Good generalization

Computational efficiency

Robust in high dimensions

Maps non linearly separable training vectors in input space to linearly separable higher dimensional feature space

Finds a separating hyper plane with maximal margin in that higher dimensional space

+1 1 1 0 1 1 0 …..+1 1 1 1 0 1 1 …..+1 1 1 0 1 1 1 …..-1 0 1 0 0 1 0 …..-1 0 0 0 1 0 1 …..-1 1 0 1 0 0 0 …..…

Training set

SVM

LearnModel

SVM

Classify

0 1 1 0 1 1 0 ….. 0 1 1 1 0 1 1 ….. 0 1 1 0 1 1 1 ….. 0 0 1 0 0 1 0 ….. 0 0 0 0 1 0 1 ….. 0 1 0 1 0 0 0 …..…

Test set Result

Kernel

SVM Classifies a set of items into a set of pre- defined categories

CDC

PubMed

PubMed

SVMlight

Implementation of Support Vector Machines (SVMs) in C

Learning module (svm_learn) – to train the model

Classification module (svm_classify) - used to apply the learned model to new examples

File Format

<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info><target> .=. +1 | -1 | 0 <feature> .=. <integer> | "qid"<value> .=. <float><info> .=. <string>

# training examples

Class Label Feature : Value1 6 : 0.0198403253586671 15:0.0339873732306071 29:0.0360280968798065 31:0.0378103484117687

Ranked keywords as features for SVM : TFIDF weighting scheme

aa

df

Nidf log=

aaH

aH idftftfidf ×=

1. TFIDF method (Term Frequency x Inverse Document Frequency) TF measures the number of times a word occurs in the HuGE’s set of abstracts

IDF measures the information content of a word – its rarity across all the abstracts in the background set.

TFIDF is defined as

idfa : Inverse document frequency of word a in the background set dfa : Number of documents (abstracts) in the background set in which word a occurs N : Total number of abstracts in the background set

tfidfH a : Weight of the word a to the HuGE abstracts H tfH a : Number of times word a occurs in the set of HuGE abstracts H

Ranked keywords as features for SVM : Z-Score weighting scheme

2. Z-Score method

21

2211

nn

pnpnp

++=

)11

(21

21

nnpq

ppz

+

−= p1 : Probability with which a given word occurs in the HuGE abstracts p2 : Probability with which that particular word occurs in the background set, n1 : Total word count in the HuGE abstracts n2 : Total word count in the background set.

p and q are defined as

pq −= 1

Z-Score is defined as

Experimental design with different keyword sets, training sets and test sets

Keyword sets 1. Z-Score top 100 keywords 2. Z-Score top 500 keywords 3. Z-Score all 784 keywords 4. TFIDF top 100 keywords 5. TFIDF top 500 keywords 6. TFIDF top 750 keywords 7. TFIDF top 1010 keywords 8. TFIDF top 2010 keywords

Training sets 1. 11000 +ves and 11000 -ves 2. 11000 +ves and 5600 -ves

Test sets (week worth of articles) 1. Feb 12, 2004 2. Apr 1, 2004 3. Apr 8, 2004 4. Jun 3, 2004

Training Set

Known HuGEAnd Non-HuGE abstracts

Convert to SVM Format

Test Set

Unknown HuGE And Non-HuGE articles

Convert to SVM Format

Keywords CL K1 K2 K3…. 1 +1 1 0 1 …. 2 - 1 0 0 1 …. 3 +1 1 1 0 …. 4 +1 0 1 1 ….

Abs

trac

ts

Keywords CL K1 K2 K3…. 1 0 1 0 0 …. 2 0 0 0 1 …. 3 0 1 1 0 …. 4 0 0 1 0 ….

Abs

trac

ts

Conversion of the abstracts in the training and test sets into an abstract vs keyword matrix. CL denotes Class, K denotes keyword

HuGE : Human Genome Epidemiology

Three metrics to evaluate SVM performance

FPTP

TPPPV

TNFP

TNSpySpecificit

FNTP

TPSnySensitivit

+=

+=

+=

)(

)(SVM Human

ExpertCategory

Positive Positive True Positive (TP)

Positive Negative False Positive (FP)

Negative Negative True Negative (TN)

Negative Positive False Negative (FN)

The classification of abstracts by human expert was used as the“gold standard” against which SVM classifications were evaluated

Evaluation of SVM Performance with different keyword sets

Average performance of SVM with different keyword sets as features

02040

6080

100

Sn Sp PPV

Evaluation metrics

Per

cent

age

TFIDF top 100 TFIDF top 500TFIDF top 750TFIDF top 1010TFIDF top 2010Z-Score top 100Z-Score top 500Z-Score all 784

• Best results were obtained (93.6 % average Sensitivity, 91.45% average specificity, 50% average PPV) using the top 2010 keywords obtained from TFIDF

SVM classification outperformed human expert classification

Upon re-evaluation it was found that SVM was able to pick up many more positive articles missed by traditional curation process

SVM performance with and without the complex query

Sensitivity 96.3%

Specificity 96.8%

PPV 80.6%

Training set : corrected training set of equal number of positives and negatives

Keyword set : union of results based on

TFIDF top 2010 and Z-Score all 784

SVM performance directly on the PubMed abstracts (without complex query)

Sensitivity 89.7%Specificity 98.4% PPV 33.3%

PubMed search based on Entrez date (EDAT)

Three random dates chosen 5/17/2001 , 9/11/2001 and 7/13/2001

SVM Overall performance (with complex query)

SVM Tool- Getting it all together

PubMed Query

SVM TOOL

HuGE Articles

Entrez Programming Utilities - tools that provide access to Entrez data outside of the regular web query interface

Customized the ESearch and EFetch routines to get a list of Abstracts from PubMed data.

Examples:

In PubMed display PMIDs 12091962 and 9997 in html retrieval mode and abstract retrieval type: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=12345,9997&retmode=html&rettype=abstract

Search in PubMed for the term cancer for the entrez date from the last 60 days and retrieve the first 100 IDs and translations using the history parameter: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=cancer&reldate=60&datetype=edat&retmax=100&usehistory=y

Entrez ETools

From Abstracts to Test Set

Abstracts.txt

PERLabstract1.txt

abstract2.txt

abstract3.txt

:

:

Extract keywords +

Stem keywords(Porter’s Stemming) +

Stop List (frequent words not be included-’and’,’or’…)+

Keyword ranking (Weighting)Test Set

C

C

PERL Code

Run svm_classify.exe+ Create a markup (HTML) file to display +ve and –ve abstracts

SVM Tool (first prototype)

Present Status (GAPscreener)

http://www.hugenavigator.net/

Present Status (GAPscreener)http://www.hugenavigator.net/

• We combined the notion of keyword extraction based on known ranking schemes of TFIDF and Z-Score and used the keywords as features to drive an SVM based classifier.

• SVM performed better that other supervised learning techniques We obtained an accuracy of upto 96.8% using SVM

• Our methodology outperformed a human expert working on the problem fulltime for 4 years, by identifying additional 20% HuGE abstracts that were missed in human inspection.Questio

ns ?

Backup Slides

A kernel K is a function such that for any points in the input space there exists a function where is a mapping to a feature space.

Different Kernel Functions (KF)

Polynomial KF :

Radial basis KF :

Sigmoid KF :

User Defined KF : Any user defined

function

)().(),( '' xxxxK φφ=

The kernel function: Mapping points in input space to feature space

),( 'xx

φ

dpoly xxxxK ).(),( '' =

).tanh(),( '' θ+= xxkxxK

Non linearly separable data

Mapping data from input space to feature space

)2

||||exp(),(

2

''

σxx

xxK

−=

SVM performance with the corrected training set and with different training set sizes

SVM performance with corrected training set

92.295.4

91 92.3

59.762.8

0

20

40

60

80

100

Previous training set ( 5363 +ve,5362 -ve)

Corrected training set (5363 +ve,5362 -ve)

Training sets

Per

cent

age

Sn

Sp

PPV

Performance of SVM with different training set sizes

0

20

40

60

80

100

100 500 1000 2000 5000 10724 17050

Training set size

Perc

enta

ge Sensitivity

Specificity

PPV

Even with small training sets, SVM was able to pick up the right model parameters and gave reasonable results

Correcting the few errors in training set only slightly improved the results indicating robustness of SVM

Towards higher sensitivity in classification with a bias for positives in the training set

Average performance of SVM with different training sets

93.75 96.2591.5

86.75

50.5

39.25

0

20

40

60

80

100

Training set 11400 +ve and 11300-ve

Training set 11400 +ve and 5300 -ve

Training sets

Per

cen

tag

e

Sn

Sp

PPV

Twice the number of positives over negatives in the training set

Positives weighted two, four and eight fold compared to negatives in the training set

Sensitivity improved at the cost of specificity and PPV

Average SVM performance with positives weighted higher than negatives in training set

97.1 98.8 99.493

73.8

64.859

28.122.7

0

20

40

60

80

100

2 fold 4 fold 8 fold

Positive weights

Per

cent

age Sn

Sp

PPV

Union of results using keywords based on TFIDF and Z-Score methods

Average performance of SVM from the union of results(TFIDF top 2010 & Z-Score all 784 results)

0

20

40

60

80

100

Sn Sp PPV

Evaluation metrics

Perce

ntage

TFIDF top 2010

Z- Score all 784

Union of results

SVM(TFIDF)

SVM(Z Score)

Human expert

Category

Positive(TP)

Positive(TP)

Positive Positive(TP)

Positive(TP)

Negative(FN)

Positive Positive(TP)

Negative(FN)

Positive(TP)

Positive Positive(TP)

Positive(FP)

Positive(FP)

Negative Positive(FP)

Negative(TN)

Negative(TN)

Negative Negative(TN)

Negative(TN)

Positive(FP)

Negative Negative(TN)

Positive(FP)

Negative(TN)

Negative Negative(TN)

Negative(FN)

Negative(FN)

Positive Negative(FN)

A sensitivity of 96.1% specificity of 95.5% and PPV of 65.7% is achieved