svm_AD
-
Upload
abhishek-dabral -
Category
Documents
-
view
40 -
download
0
Transcript of svm_AD
Outline• Introduction and Motivation PubMed database CDC’s epidemiological database Support Vector Machines (SVM)• Feature Selection TFIDF and Z-Score• Experimental design Keyword set, Training sets, Test sets• Performance Metrics Sensitivity, Specificity, Accuracy and Positive predictive value • SVM Tool and Present status
PubMed is a service of the U.S. National Library Of Medicine (NLM) and
National Institutes of Health (NIH) that includes over 18 million
citations from MEDLINE and other life science journals for
biomedical articles back to the 1950s.
PubMed database (www.pubmed.gov)
PubMed database Search
Boolean queries using keywords
Manual scanning of the retrieved records for relevance
Time Consuming
Error Prone
Incomplete
CDC’s Human Genome Epidemiology
Network ( HuGENet™)
CDC: Centers for disease control and prevention
An integrated, searchable knowledge base of genetic associations and human genome epidemiology. (http://www.hugenavigator.net)
Identifiedarticle
Complex Query
AccessPending further review
of article
Articles indexed by gene, disease,
factor, or topic
HTML pages
Upload data tables
ProCite
Eligiblearticles
Select bytitles
Manual review process
Excluded articles- sci lit, non-HuGE
CDC’s weekly Human Genome Epidemiology(HuGE) published literature abstraction process
CDC: Centers for disease control and prevention
Complex query CDC uses for searching PubMed
(((((((((((((((((((genetic[All Fields] AND ((((("disease"[MeSH Terms] OR ("disease susceptibility"[MeSH Terms] OR predisposition[Text Word])) OR disease[Text Word]) OR defect[Text Word]) OR susceptibility[Text Word]) OR ("counseling"[MeSH Terms] OR counseling[Text Word]))) OR (("disease susceptibility"[MeSH Terms] OR susceptibility[Text Word]) AND (("genes"[MeSH Terms] OR gene[Text Word]) OR ("genes"[MeSH Terms] OR genes[Text Word])))) OR (((("mutation"[MeSH Terms] OR mutation[Text Word]) OR (("genes"[MeSH Terms] OR gene[Text Word]) AND ("mutation"[MeSH Terms] OR mutation[Text Word]))) OR (("mutation"[MeSH Terms] OR mutations[Text Word]) AND ("genes"[MeSH Terms] OR gene[Text Word]))) OR (("mutation"[MeSH Terms] OR mutation[Text Word]) AND ("genes"[MeSH Terms] OR gene[Text Word])))) OR ("hereditary diseases"[MeSH Terms] OR genetic disorder[Text Word])) OR (genetic[All Fields] AND ((("TEST"[Substance Name] OR ("TEST"[Substance Name] OR test[Text Word])) OR ("research design"[MeSH Terms] OR testing[Text Word])) OR study[All Fields]))) OR ("genetic screening"[MeSH Terms] OR genetic screening[Text Word])) OR (genetic[All Fields] AND ("risk"[MeSH Terms] OR risk[Text Word]))) OR ("polymorphism (genetics)"[MeSH Terms] OR ("polymorphism (genetics)"[MeSH Terms] OR polymorphism[Text Word]))) OR (((("genotype"[MeSH Terms] OR ("genotype"[MeSH Terms] OR genotype[Text Word])) OR genotyping[All Fields]) OR ("haplotypes"[MeSH Terms] OR haplotype[Text Word])) OR ("haplotypes"[MeSH Terms] OR haplotypes[Text Word]))) OR ((("genome"[MeSH Terms] OR genome[Text Word]) OR genomic[All Fields]) OR ("Genomics"[MeSH Terms] OR genomics[Text Word]))) OR (((gene-environment) OR (gene AND environment)) AND interaction[Text Word])) OR (((genetic[Text Word] OR gene[Text Word]) OR allelic[All Fields]) AND ((variant[All Fields] OR variants[All Fields]) OR (("epidemiology"[MeSH Subheading] OR "epidemiology"[MeSH Terms]) OR frequency[Text Word])))) OR (variant[All Fields] AND (("alleles"[MeSH Terms] OR allele[Text Word]) OR ("alleles"[MeSH Terms] OR alleles[Text Word])))) OR ("heterozygote detection"[MeSH Terms] OR Heterozygote Detection[Text Word])) OR ((Neonatal[All Fields] OR ("infant, newborn"[MeSH Terms] OR newborn[Text Word])) AND (("diagnosis"[MeSH Subheading] OR "mass screening"[MeSH Terms]) OR Screening[Text Word]))) OR germline[All Fields]) OR somatic[All Fields]) OR ("human genome project"[MeSH Terms] OR human genome project[Text Word])) AND ((((((((((((((((((((("epidemiology"[Subheading] OR "epidemiology"[MeSH Terms]) OR epidemiology[Text Word]) OR ("public health"[MeSH Terms] OR public health[Text Word])) OR ((("alleles"[MeSH Terms] OR allele[Text Word]) OR allelic[All Fields]) AND ((("epidemiology"[MeSH Subheading] OR "epidemiology"[MeSH Terms]) OR frequency[Text Word]) OR frequencies[All Fields]))) OR ("public policy"[MeSH Terms] OR policy[Text Word])) OR (("education"[Subheading] OR "education"[MeSH Terms]) OR education[Text Word])) OR "prevalence"[MeSH Terms]) OR prevalence[Text Word]) OR ("prevention and control"[Subheading] OR prevention[Text Word])) OR ("risk"[MeSH Terms] OR risk[Text Word])) OR ((((((((population[Text Word] OR (a number of) OR genetic[All Fields]) OR comparative[All Fields]) OR prospective[All Fields]) OR cohort[All Fields]) OR cross-section[All Fields]) OR cross-sectional[All Fields]) OR case-control[All Fields]) AND (studies OR study[All Fields]))) OR (clinical trial[All Fields] OR randomized controlled trial[All Fields])) OR (("drug interactions"[MeSH Terms] OR interactions[Text Word]) OR (("interpersonal relations"[MeSH Terms] OR "drug interactions"[MeSH Terms]) OR interaction[Text Word]))) OR ("questionnaires"[MeSH Terms] OR questionnaire[Text Word])) OR (("sensitivity and specificity"[MeSH Terms] OR sensitivity[Text Word]) OR ("sensitivity and specificity"[MeSH Terms] OR specificity[Text Word]))) OR ((((case[All Fields] OR cases[All Fields]) OR ("patients"[MeSH Terms] OR patients[Text Word])) OR (study[All Fields] AND group[All Fields])) OR (((((("prevention and control"[MeSH Subheading] OR control[Text Word]) OR controls[All Fields]) OR (healthy[All Fields] AND subjects[All Fields])) OR ("child"[MeSH Terms] OR children[Text Word])) OR ("adult"[MeSH Terms] OR adults[Text Word])) OR individuals[All Fields]))) OR (((("association"[MeSH Terms] OR association[Text Word]) OR ("association"[MeSH Terms] OR associations[Text Word])) OR ("disease"[MeSH Terms] OR disease[Text Word])) AND (("genes"[MeSH Terms] OR gene[Text Word]) OR ("genes"[MeSH Terms] OR genes[Text Word])))) OR oversight[All Fields]) OR ((("genotype"[MeSH Terms] OR genotype[All Fields]) OR allelic[All Fields]) AND distribution[Text Word])) OR (((("genotype"[MeSH Terms] OR genotype[Text Word]) AND ("phenotype"[MeSH Terms] OR phenotype[Text Word])) OR genotype-phenotype[All Fields]) AND correlation[All Fields])) OR (("ethics"[MeSH Terms] OR ethics[Text Word]) OR ethical[All Fields]))) AND "2004/1/29 8.00"[MHDA]:"2004/2/4 8.00"[MHDA])
Distribution of articles in PubMed and HuGE Published Literature: Week of March 2-9, 2005
HuGE complex query
HuGE (no
reviews)
Relevant based
on titles
PubMed (MeSH dates)
2,245
91
342
35,311
SVM is a machine learning algorithm that performs binary and multiway classification of the data
Advantages
Good generalization
Computational efficiency
Robust in high dimensions
Maps non linearly separable training vectors in input space to linearly separable higher dimensional feature space
Finds a separating hyper plane with maximal margin in that higher dimensional space
+1 1 1 0 1 1 0 …..+1 1 1 1 0 1 1 …..+1 1 1 0 1 1 1 …..-1 0 1 0 0 1 0 …..-1 0 0 0 1 0 1 …..-1 1 0 1 0 0 0 …..…
Training set
SVM
LearnModel
SVM
Classify
0 1 1 0 1 1 0 ….. 0 1 1 1 0 1 1 ….. 0 1 1 0 1 1 1 ….. 0 0 1 0 0 1 0 ….. 0 0 0 0 1 0 1 ….. 0 1 0 1 0 0 0 …..…
Test set Result
Kernel
SVM Classifies a set of items into a set of pre- defined categories
CDC
PubMed
PubMed
SVMlight
Implementation of Support Vector Machines (SVMs) in C
Learning module (svm_learn) – to train the model
Classification module (svm_classify) - used to apply the learned model to new examples
File Format
<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info><target> .=. +1 | -1 | 0 <feature> .=. <integer> | "qid"<value> .=. <float><info> .=. <string>
# training examples
Class Label Feature : Value1 6 : 0.0198403253586671 15:0.0339873732306071 29:0.0360280968798065 31:0.0378103484117687
Ranked keywords as features for SVM : TFIDF weighting scheme
aa
df
Nidf log=
aaH
aH idftftfidf ×=
1. TFIDF method (Term Frequency x Inverse Document Frequency) TF measures the number of times a word occurs in the HuGE’s set of abstracts
IDF measures the information content of a word – its rarity across all the abstracts in the background set.
TFIDF is defined as
idfa : Inverse document frequency of word a in the background set dfa : Number of documents (abstracts) in the background set in which word a occurs N : Total number of abstracts in the background set
tfidfH a : Weight of the word a to the HuGE abstracts H tfH a : Number of times word a occurs in the set of HuGE abstracts H
Ranked keywords as features for SVM : Z-Score weighting scheme
2. Z-Score method
21
2211
nn
pnpnp
++=
)11
(21
21
nnpq
ppz
+
−= p1 : Probability with which a given word occurs in the HuGE abstracts p2 : Probability with which that particular word occurs in the background set, n1 : Total word count in the HuGE abstracts n2 : Total word count in the background set.
p and q are defined as
pq −= 1
Z-Score is defined as
Experimental design with different keyword sets, training sets and test sets
Keyword sets 1. Z-Score top 100 keywords 2. Z-Score top 500 keywords 3. Z-Score all 784 keywords 4. TFIDF top 100 keywords 5. TFIDF top 500 keywords 6. TFIDF top 750 keywords 7. TFIDF top 1010 keywords 8. TFIDF top 2010 keywords
Training sets 1. 11000 +ves and 11000 -ves 2. 11000 +ves and 5600 -ves
Test sets (week worth of articles) 1. Feb 12, 2004 2. Apr 1, 2004 3. Apr 8, 2004 4. Jun 3, 2004
Training Set
Known HuGEAnd Non-HuGE abstracts
Convert to SVM Format
Test Set
Unknown HuGE And Non-HuGE articles
Convert to SVM Format
Keywords CL K1 K2 K3…. 1 +1 1 0 1 …. 2 - 1 0 0 1 …. 3 +1 1 1 0 …. 4 +1 0 1 1 ….
Abs
trac
ts
Keywords CL K1 K2 K3…. 1 0 1 0 0 …. 2 0 0 0 1 …. 3 0 1 1 0 …. 4 0 0 1 0 ….
Abs
trac
ts
Conversion of the abstracts in the training and test sets into an abstract vs keyword matrix. CL denotes Class, K denotes keyword
HuGE : Human Genome Epidemiology
Three metrics to evaluate SVM performance
FPTP
TPPPV
TNFP
TNSpySpecificit
FNTP
TPSnySensitivit
+=
+=
+=
)(
)(SVM Human
ExpertCategory
Positive Positive True Positive (TP)
Positive Negative False Positive (FP)
Negative Negative True Negative (TN)
Negative Positive False Negative (FN)
The classification of abstracts by human expert was used as the“gold standard” against which SVM classifications were evaluated
Evaluation of SVM Performance with different keyword sets
Average performance of SVM with different keyword sets as features
02040
6080
100
Sn Sp PPV
Evaluation metrics
Per
cent
age
TFIDF top 100 TFIDF top 500TFIDF top 750TFIDF top 1010TFIDF top 2010Z-Score top 100Z-Score top 500Z-Score all 784
• Best results were obtained (93.6 % average Sensitivity, 91.45% average specificity, 50% average PPV) using the top 2010 keywords obtained from TFIDF
SVM classification outperformed human expert classification
Upon re-evaluation it was found that SVM was able to pick up many more positive articles missed by traditional curation process
SVM performance with and without the complex query
Sensitivity 96.3%
Specificity 96.8%
PPV 80.6%
Training set : corrected training set of equal number of positives and negatives
Keyword set : union of results based on
TFIDF top 2010 and Z-Score all 784
SVM performance directly on the PubMed abstracts (without complex query)
Sensitivity 89.7%Specificity 98.4% PPV 33.3%
PubMed search based on Entrez date (EDAT)
Three random dates chosen 5/17/2001 , 9/11/2001 and 7/13/2001
SVM Overall performance (with complex query)
Entrez Programming Utilities - tools that provide access to Entrez data outside of the regular web query interface
Customized the ESearch and EFetch routines to get a list of Abstracts from PubMed data.
Examples:
In PubMed display PMIDs 12091962 and 9997 in html retrieval mode and abstract retrieval type: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=12345,9997&retmode=html&rettype=abstract
Search in PubMed for the term cancer for the entrez date from the last 60 days and retrieve the first 100 IDs and translations using the history parameter: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=cancer&reldate=60&datetype=edat&retmax=100&usehistory=y
Entrez ETools
From Abstracts to Test Set
Abstracts.txt
PERLabstract1.txt
abstract2.txt
abstract3.txt
:
:
Extract keywords +
Stem keywords(Porter’s Stemming) +
Stop List (frequent words not be included-’and’,’or’…)+
Keyword ranking (Weighting)Test Set
C
C
PERL Code
Run svm_classify.exe+ Create a markup (HTML) file to display +ve and –ve abstracts
SVM Tool (first prototype)
• We combined the notion of keyword extraction based on known ranking schemes of TFIDF and Z-Score and used the keywords as features to drive an SVM based classifier.
• SVM performed better that other supervised learning techniques We obtained an accuracy of upto 96.8% using SVM
• Our methodology outperformed a human expert working on the problem fulltime for 4 years, by identifying additional 20% HuGE abstracts that were missed in human inspection.Questio
ns ?
A kernel K is a function such that for any points in the input space there exists a function where is a mapping to a feature space.
Different Kernel Functions (KF)
Polynomial KF :
Radial basis KF :
Sigmoid KF :
User Defined KF : Any user defined
function
)().(),( '' xxxxK φφ=
The kernel function: Mapping points in input space to feature space
),( 'xx
φ
dpoly xxxxK ).(),( '' =
).tanh(),( '' θ+= xxkxxK
Non linearly separable data
Mapping data from input space to feature space
)2
||||exp(),(
2
''
σxx
xxK
−=
SVM performance with the corrected training set and with different training set sizes
SVM performance with corrected training set
92.295.4
91 92.3
59.762.8
0
20
40
60
80
100
Previous training set ( 5363 +ve,5362 -ve)
Corrected training set (5363 +ve,5362 -ve)
Training sets
Per
cent
age
Sn
Sp
PPV
Performance of SVM with different training set sizes
0
20
40
60
80
100
100 500 1000 2000 5000 10724 17050
Training set size
Perc
enta
ge Sensitivity
Specificity
PPV
Even with small training sets, SVM was able to pick up the right model parameters and gave reasonable results
Correcting the few errors in training set only slightly improved the results indicating robustness of SVM
Towards higher sensitivity in classification with a bias for positives in the training set
Average performance of SVM with different training sets
93.75 96.2591.5
86.75
50.5
39.25
0
20
40
60
80
100
Training set 11400 +ve and 11300-ve
Training set 11400 +ve and 5300 -ve
Training sets
Per
cen
tag
e
Sn
Sp
PPV
Twice the number of positives over negatives in the training set
Positives weighted two, four and eight fold compared to negatives in the training set
Sensitivity improved at the cost of specificity and PPV
Average SVM performance with positives weighted higher than negatives in training set
97.1 98.8 99.493
73.8
64.859
28.122.7
0
20
40
60
80
100
2 fold 4 fold 8 fold
Positive weights
Per
cent
age Sn
Sp
PPV
Union of results using keywords based on TFIDF and Z-Score methods
Average performance of SVM from the union of results(TFIDF top 2010 & Z-Score all 784 results)
0
20
40
60
80
100
Sn Sp PPV
Evaluation metrics
Perce
ntage
TFIDF top 2010
Z- Score all 784
Union of results
SVM(TFIDF)
SVM(Z Score)
Human expert
Category
Positive(TP)
Positive(TP)
Positive Positive(TP)
Positive(TP)
Negative(FN)
Positive Positive(TP)
Negative(FN)
Positive(TP)
Positive Positive(TP)
Positive(FP)
Positive(FP)
Negative Positive(FP)
Negative(TN)
Negative(TN)
Negative Negative(TN)
Negative(TN)
Positive(FP)
Negative Negative(TN)
Positive(FP)
Negative(TN)
Negative Negative(TN)
Negative(FN)
Negative(FN)
Positive Negative(FN)
A sensitivity of 96.1% specificity of 95.5% and PPV of 65.7% is achieved