AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm...
-
Upload
lucas-rogers -
Category
Documents
-
view
218 -
download
0
Transcript of AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm...
AAAI05 Tutorial on Bioinformatics & Machine Learning
Jinyan Li & Limsoon WongInstitute for Infocomm Research
21 Heng Mui Keng TerraceSingapore
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
1. Intro to Biology & Bioinformatics
2. TIS Predictionfeature generation &
feature selection
3. Tx Optimization for Childhood ALL
emerging patterns & PCL
4. Analysis of Proteomics Data decision tree ensembles &
CS4
5. Prognosis based on Gene
Expression extreme sample selection &
ERCOF
6. Subgraphs in Protein
Interactions closed patterns
7. Mining Errors from Bio
Databasesassociation rules
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Tutorial Plan
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Coverage
• Data:– DNA sequences, gene expression profiles,
proteomics mass-spectra, protein sequences, protein structures, & biomedical text/annotations
• Feature selection: 2, t-statistics, signal-to-noise, entropy, CFS
• Machine learning and data mining: – Decision trees, SVMs, emerging patterns,
closed patterns, graph theories
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Body and Cell
• Our body consists of a number of organs
• Each organ composes of a number of tissues
• Each tissue composes of cells of the same type
• Cells perform two types of function– Chemical reactions
needed to maintain our life
– Pass info for maintaining life to next generation
• In particular– Protein performs
chemical reactions– DNA stores & passes info– RNA is intermediate
between DNA & proteins
DNA
• Stores instructions needed by the cell to perform daily life function
• Consists of two strands interwoven together and form a double helix
• Each strand is a chain of some small molecules called nucleotides
Francis Crick shows James Watson the model of DNA in their room number 103 of the Austin Wing at the Cavendish Laboratories, Cambridge
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
A C G T U
Classification of Nucleotides
• 5 diff nucleotides: adenine(A), cytosine(C), guanine(G), thymine(T), & uracil(U)
• A, G are purines. They have a 2-ring structure• C, T, U are pyrimidines. They have a 1-ring
structure• DNA only uses A, C, G, & T
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
A
T
10Å
G
C
10Å
Watson-Crick Rule
• DNA is double stranded in a cell • One strand is reverse complement of the
other• Complementary bases:
– A with T (two hydrogen-bonds)– C with G (three hydrogen-bonds)
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Chromosome
• DNA is usually tightly wound around histone proteins and forms a chromosome
• The total info stored in all chromosomes constitutes a genome
• In most multi-cell organisms, every cell contains the same complete set of chromosomes – May have some small diff due to mutation
• Human genome has 3G bases, organized in 23 pairs of chromosomes
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Gene
• A gene is a sequence of DNA that encodes a protein or an RNA molecule
• About 30,000 – 35,000 (protein-coding) genes in human genome
• For gene that encodes protein– In Prokaryotic genome, one gene corresponds
to one protein– In Eukaryotic genome, one gene may
correspond to more than one protein because of the process “alternative splicing”
Central Dogma
• Gene expression consists of two steps– Transcription
DNA mRNA
– Translation mRNA Protein
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Genetic Code• Start codon: ATG (code for M)• Stop codon: TAA, TAG, TGA
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Protein
• A sequence composed from an alphabet of 20 amino acids– Length is usually 20 to
5000 amino acids – Average around 350
amino acids
• Folds into 3D shape, forming the building block & performing most of the chemical reactions within a cell
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Classification of Amino Acids
• Amino acids can be classified into 4 types
• Positively charged (basic)– Arginine (Arg, R)– Histidine (His, H)– Lysine (Lys, K)
• Negatively charged (acidic)– Aspartic acid (Asp, D)– Glutamic acid (Glu, E)
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Classification of Amino Acids• Polar (overall
uncharged, but uneven charge distribution. can form hydrogen bonds with water. they are called hydrophilic)– Asparagine (Asn, N)– Cysteine (Cys, C)– Glutamine (Gln, Q)– Glycine (Gly, G)– Serine (Ser, S)– Threonine (Thr, T)– Tyrosine (Tyr, Y)
• Nonpolar (overall uncharged and uniform charge distribution. cant form hydrogen bonds with water. they are called hydrophobic)– Alanine (Ala, A)– Isoleucine (Ile, I)– Leucine (Leu, L)– Methionine (Met, M)– Phenylalanine (Phe, F)– Proline (Pro, P)– Tryptophan (Trp, W)– Valine (Val, V)
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
BioinformaticsApplications• Bio Data Searching• Gene finding• Cis-regulatory DNA• Gene/Protein Network• Protein/RNA Structure
Prediction• Evolutionary Tree
reconstruction• Infer Protein Function• Disease Diagnosis,
Prognosis, & Treatment Optimization, ...
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Part 2:Translation Initiation Site Recognition
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT............................................................ 80................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
A Sample cDNA
• What makes the second ATG the TIS?
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Approach
• Training data gathering• Signal generation
– k-grams, distance, domain know-how, ...
• Signal selection– Entropy, 2, CFS, t-test, domain know-how...
• Signal integration– SVM, ANN, PCL, CART, C4.5, kNN, ...
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Training & Testing Data
• Vertebrate dataset of Pedersen & Nielsen [ISMB’97]
• 3312 sequences• 13503 ATG sites• 3312 (24.5%) are TIS• 10191 (75.5%) are non-TIS• Use for 3-fold x-validation expts
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Signal Generation
• K-grams (ie., k consecutive letters)– K = 1, 2, 3, 4, 5, …– Window size vs. fixed position– Up-stream, downstream vs. any where in window– In-frame vs. any frame
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Too Many Signals
• For each k, there are 4k * 3 * 2 k-grams• If we use k = 1, 2, 3, 4, 5, we have
24 + 96 + 384 + 1536 + 6144 = 8184 features!
• This is too many for most machine learning algorithms
Need to do signal selection– t-stats, 2, CFS, signal-to-noise, entropy, gini
index, info gain, info gain ratio, ...
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Signal Selection Basic Idea
• Choose a signal w/ low intra-class distance• Choose a signal w/ high inter-class distance
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Signal Selection by T-Statistics
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Signal Selection by 2
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Sample K-GramsSelected by CFS
• Position –3• in-frame upstream ATG• in-frame downstream
– TAA, TAG, TGA, – CTG, GAC, GAG, and GCC
Kozak consensus
Leaky scanning
Stop codon
Codon bias?
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Signal Integration
• kNN– Given a test sample, find the k training
samples that are most similar to it. Let the majority class win
• SVM– Given a group of training samples from two
classes, determine a separating plane that maximises the margin of error
• Naïve Bayes, ANN, C4.5, CS4, ...
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy
Naïve Bayes 84.3% 86.1% 66.3% 85.7%
SVM 73.9% 93.2% 77.9% 88.5%
Neural Network 77.6% 93.2% 78.8% 89.4%
Decision Tree 74.0% 94.4% 81.1% 89.4%
Results (3-fold x-validation)
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Technique Comparisons
• Pedersen&Nielsen [ISMB’97]
– Neural network– No explicit features
• Zien [Bioinformatics’00]
– SVM+kernel engineering
– No explicit features
• Hatzigeorgiou [Bioinformatics’02]
– Multiple neural networks
– Scanning rule– No explicit features
• This approach– Explicit feature
generation– Explicit feature
selection– Use any machine
learning method w/o any form of complicated tuning
– Scanning rule is optional
F
L
I
MV
S
P
T
A
Y
H
Q
N
K
D
E
C
WR
G
A
T
E
L
R
S
stop
How about using k-grams from the translation?
mRNAprotein
Copyright © 2003, 2004, 2005 by Jinyan Li and Limsoon Wong
Amino-Acid Features
Copyright © 2003, 2004, 2005 by Jinyan Li and Limsoon Wong
Amino-Acid Features
Copyright © 2003, 2004, 2005 by Jinyan Li and Limsoon Wong
Amino Acid K-Grams Discovered (by entropy)
Copyright © 2003, 2004, 2005 by Jinyan Li and Limsoon Wong
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
ATGpr
Ourmethod
Validation Results (on Chr X and Chr 21)
• Using top 100 features selected by entropy and train SVM on Pedersen & Nielsen’s
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Part 3:Treatment Optimization of Childhood Leukemia
Image credit: FEER
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Childhood ALL
• Major subtypes are: – T-ALL, E2A-PBX, TEL-
AML, MLL genome rearrangements, BCR-ABL, Hyperdiploid>50
• Diff subtypes respond differently to same Tx
Over-intensive Tx – Development of
secondary cancers– Reduction of IQ
Under-intensiveTx – Relapse
• The subtypes look similar
• Conventional diagnosis– Immunophenotyping– Cytogenetics– Molecular diagnostics
Unavailable in most ASEAN countries
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Image credit: Affymetrix
Single-Test Platform ofMicroarray & Machine
Learning
Overall Strategy
Diagnosis of subtype
Subtype- dependentprognosis
Risk-stratifiedtreatmentintensity
• For each subtype, select genes to develop classification model for diagnosing that subtype
• For each subtype, select genes to develop prediction model for prognosis of that subtype
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Subtype Diagnosis by PCL
• Gene expression data collection• Gene selection by 2• PCL Classifier training by emerging
pattern• Classifier tuning (optional for some
machine learning methods)• Apply PCL for diagnosis of future cases
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
EmergingPatterns
• An emerging pattern is a set of conditions– usually involving several features– that most members of a class satisfy – but none or few of the other class satisfy
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
PCL: Prediction by Collective Likelihood
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Childhood ALL Subtype Diagnosis Workflow
A tree-structureddiagnostic workflow was recommended byour doctor collaborator
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Training and Testing Sets
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
The classifiers are all applied to the 20 genes selected by 2 at each level of the tree
Accuracy of Various Classifiers
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Understandability of EP & PCL
• E.g., for T-ALL vs. OTHERS, one ideally discriminatory gene 38319_at was found, inducing these 2 EPs
• These give us the diagnostic rule
Conclusions
Conventional Tx:• intermediate intensity to everyone 10% suffers relapse 50% suffers side effects costs US$150m/yr
Our optimized Tx:• high intensity to 10%• intermediate intensity to 40%• low intensity to 50%• costs US$100m/yr
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
•High cure rate of 80%• Less relapse
• Less side effects• Save US$51.6m/yr
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Part 4:Topology of Protein Interaction Networks: Hubs, Cores, Bipartites
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Courtesy of JE Wampler
Protein Representations
• Four types of representations for proteins: primary, secondary, tertiary, and quaternay
• Protein interactions play an important role in inter cellular communication, in signal transduction, in the regulation of gene expression
Complex based
Computational Methods
Motif Discovery
Domain Domain
Interactions
Sequence based
Structure based
(Docking)
Experimental
Methods(Phage Display,
mutagenesis, two hybrid)
Interaction Determination Approaches
Copyright © 2005 by Jinyan Li and Limsoon Wong
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Protein Interaction Databases
• PDB---Protein Databank (http://www.rcsb.org/pdb/; reference article: H.M. Berman, etal: The Protein Data Bank. NAR, 28 pp. 235-242 (2000) ) structure info; complex info.
• BIND---The Biomolecular Interaction Network Database (http://binddb.org; Bader et al. NAR, 29. 2001)
• The GRID: The General Repository for Interaction Datasets http://biodata.mshri.on.ca/grid/servlet/Index
• DIP---Database of Interacting Proteins (http://dip.doe-mbi.ucla.edu/). Xml format.
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Maslov & Sneppen, Science, v296, 2002
A power law: Most neighbors of highly connected nodes have rather low connectivity. The highly connected nodes are called hubs
318 edges329 nodes
In nucleus ofS. cerevisiae
Graph Representation of Protein Interactions
Yeast SH3 domain-domainInteraction network:
394 edges, 206 nodesTong et al. Science, v295. 2002
8 proteins containing SH35 binding at least 6 of them
K-cores in Protein Interaction Graphs
Copyright © 2005 by Jinyan Li and Limsoon Wong
Bipartite Subgraphs
Copyright © 2005 by Jinyan Li and Limsoon Wong
SH3 Proteins SH3-Binding Proteins
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Some Definitions
• A graph G=VG, EG, where VG is a set of vertices and EG is a set of edges. No self loops, no parallel edges, undirected
• A graph H= VH, EH is a subgraph of G, if VH is contained in VG and EH is contained in EG
• H = V1V2, EH is bipartite subgraph of G if
– H is a subgraph of G, and
– V1V2 = {}, and
– every edge in EH joins a vertex in V1 and a vertex in V2
V1 V2
Example Bipartite Subgraph of Protein Interaction
Network
Copyright © 2005 by Jinyan Li and Limsoon Wong
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Maximal Complete Bipartite Subgraphs
• A complete bipartite subgraph H = V1V2, EH of G is said to be maximal if there is no other complete bipartite subgraph H’ = V’1 V’2, EH’ of G such that V1 V’1 and V2 V’2
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
The problem previously studied by Epptein (1994, Info Proc Lett);Makino & Uno (2004, SWAT); Zaki & Ogihara (1998).
Study of Protein Interaction Network Topology
• List all maximal complete bipartite subgraphs from a protein interaction network (Step 1)
• Discover motifs from these protein groups (Step 2)
• Form binding motif pairs (Step 3)
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Efficient Mining of Maximal Complete Bipartite Subgraphs• Equiv to mining of closed patterns from
adjacency matrix of G• Why:
– the number of closed patterns is even, – is precisely double of the number of the
maximal complete bipartite subgraphs, and – there is one-to-one correspondence betw a
maximal complete bipartite subgraph & a closed pattern pair
• Many state-of-the-art algorithms for mining closed patterns
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
A is symmetrical, main diagonal is 0
Adjacency Matrix: A Special Transactional Database (TDB)• Let G be a graph with VG = {v1, v2, …, vp}
• Adjacency matrix A of G is a p x p matrix defined by
A[i,j] = 1 if (vi, vj) EG
0 otherwise
Support threshold = 2
)(CfgC TDBPasquier etal, ICDT 99; Grahne & Zhu,
FIMI03
TDB, Freq Patterns, Equiv Classes, & Closed
Patterns
Copyright © 2005 by Jinyan Li and Limsoon Wong
A protein interaction network taken from GRIDBreitkreutz, Genome Biol. 2003
4904 proteins, 17440 interactions, ave size of neighborhood is 3.56
Some Expt Results
Mining algorithm---FPclose* Grahne & Zhu, 2003
Copyright © 2005 by Jinyan Li and Limsoon Wong
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Fixed Point Idea
• The idea: f(x)=x• Instantiation in life science, f(DNA)=DNA,
f(proteinType) = proteinType (Meng et al. Bioinformatics, 2004)
• The variable x can be a pair of AA segment patterns, then after the transformations by f, the stable pattern is binding motif pair. (Li and Li, Bioinformatics, 2005)
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Part 5:Treatment Prognosis for DLBC Lymphoma
Image credit: Rosenwald et al, 2002
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Diffuse Large B-Cell Lymphoma
• DLBC lymphoma is the most common type of lymphoma in adults
• Can be cured by anthracycline-based chemotherapy in 35 to 40 percent of patients
DLBC lymphoma comprises several diseases that differ in responsiveness to chemotherapy
• Intl Prognostic Index (IPI) – age, “Eastern Cooperative
Oncology Group” Performance status, tumor stage, lactate dehydrogenase level, sites of extranodal disease, ...
• Not very good for stratifying DLBC lymphoma patients for therapeutic trials
Use gene-expression profiles to predict outcome of chemotherapy?
“extreme”sampleselection
ERCOF
Copyright © 2005 by Jinyan Li and Limsoon Wong
Our Approach
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Short-term Survivors v.s. Long-term Survivors
T: sampleF(T): follow-up time
E(T): status (1:unfavorable; 0: favorable)c1 and c2: thresholds of survival time
Short-term survivors who died within a short
period
F(T) < c1 and E(T) = 1
Long-term survivors who were alive after a
long follow-up time
F(T) > c2
Extreme Sample Selection
ERCOFEntropy-Based
Rank Sum Test
& Correlati
on Filtering
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
• llm(X) = sign(kk*Yk* (Xk•X) + b)
• svm(X) = sign(kk*Yk* (Xk• X) + b)
svm(X) = sign(kk*Yk* K(Xk,X) + b)
where K(Xk,X) = (Xk• X)
SVM Kernel Function
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
LinearLearningMachine
SVM =LLM +Kernel
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Linear Kernel SVM regression function
bixTKyaTG ii
i ))(,()(
T: test sample, x(i): support vector,yi: class label (1: short-term survivors; -1: long-term survivors)
Transformation function (posterior probability)
)(1
1)(
TGeTS
))1,0()(( TS
S(T): risk score of sample T
Risk Score Construction
Knowledge Discovery from Gene Expression of “Extreme” Samples
“extreme”sampleselection:< 1 yr vs > 8 yrs
knowledgediscovery from gene expression
240 samples
80 samples26 long-
term survivors
47 short-term survivors
7399genes
84genes
T is long-term if S(T) < 0.3
T is short-term if S(T) > 0.7 Copyright © 2005 by Jinyan Li and Limsoon Wong
From Rosenwald et al., NEJM 2002
80 test casesp-value of log-rank test: < 0.0001Risk score thresholds: 0.7, 0.3
Kaplan-Meier Plots
Copyright © 2005 by Jinyan Li and Limsoon Wong
Improvement over IPICases rated as IPI lowcan still be splitp-value = 0.0063
W/o sample selection (p =0.38)
No clear difference on the overall survival of the 80 samples in the validation group of DLBCL study, if no training sample selection conducted
Merit of “Extreme” Samples
Copyright © 2005 by Jinyan Li and Limsoon Wong
W/ sample selection (p < 0.0001)
Effectiveness of ERCOF
Copyright © 2005 by Jinyan Li and Limsoon Wong
22Total wins 38 46 47 48 51 47 67
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Conclusions
• Selecting extreme cases as training samples is an effective way to improve patient outcome prediction based on gene expression profiles and clinical information
• ERCOF is a systematic feature selection method that is very useful – ERCOF is very suitable for SVM, 3-NN, CS4,
Random Forest, as it gives these learning algos highest no. of wins
– ERCOF is suitable for Bagging also, as it gives this classifier the lowest no. of errors
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Part 6:Decision Tree Based In-Silico Cancer Diagnosis
• Bagging, • Boosting, • Random Forest, • Randomization Trees, & • CS4
A B A
B A
x1
x2
x4
x3
> a1
> a2
Structure of Decision Trees
• If x1 > a1 & x2 > a2, then it’s A class
• C4.5 & CART, two of the most widely used• Easy interpretation, but accuracy generally
unattractive
Leaf nodes
Internal nodes
Root node
B
A
Copyright © 2005 by Jinyan Li and Limsoon Wong
Elegance of Decision Trees
• Diagnosis of central nervous system in hematooncologic patients (Lossos et al, Cancer, 2000)
• Prediction of neurobehavioral outcome in head-injury survivors (Temin et al, J. of Neurosurgery, 1995)
• Reconstruction of molecular networks from gene expression data (Soinov et al, Genome Biology, 2003)
AB
B AA
Copyright © 2005 by Jinyan Li and Limsoon Wong
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Single coverage & Fragmentation Problem
For Decision trees
Decision Tree Construction
• Steps– Select the “best”
feature as the root node of the whole tree
– After partition by this feature, select the best feature (wrt the subset of training data) as the root node of this sub-tree
– Recursively, until the partitions become pure or almost pure
• 3 Measures to Evaluate Which Feature is Best– Gini index– Information gain– Information gain ratio
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
• h1, h2, h3 are indep classifiers w/ accuracy = 60%
• C1, C2 are the only classes
• t is a test instance in C1
• h(t) = argmaxC{C1,C2} |{hj {h1, h2, h3} | hj(t) = C}|
• Then prob(h(t) = C1)
= prob(h1(t)=C1 & h2(t)=C1 & h3(t)=C1) +
prob(h1(t)=C1 & h2(t)=C1 & h3(t)=C2) +
prob(h1(t)=C1 & h2(t)=C2 & h3(t)=C1) +
prob(h1(t)=C2 & h2(t)=C1 & h3(t)=C1)
= 60% * 60% * 60% + 60% * 60% * 40% + 60% * 40% * 60% + 40% * 60% * 60%= 64.8%
Motivating Example
Bagging: Bootstrap Aggregating, Breiman 1996
50 p + 50 nOriginal training set
48 p + 52 n 49 p + 51 n 53 p + 47 n…A base inducer such as C4.5
A committee H of classifiers: h1 h2 …. hk
Draw 100 samples with replacement (injecting radomness)
Equal voting
Copyright © 2005 by Jinyan Li and Limsoon Wong
AdaBoost: Adaptive Boosting, Freund & Schapire 1995
100 instanceswith equal weight
A classifier h1
error
If error is 0 or >0.5 stop
Otherwise re-weight correctly classified instances by: e1/(1-e1)
Renormalize total weight to 1
100 instanceswith different weights
A classifier h2
error
Samples misclassifiedby hi
Copyright © 2005 by Jinyan Li and Limsoon Wong
Random Forest Breiman 2001
50 p + 50 nOriginal training set
48 p + 52 n 49 p + 51 n 53 p + 47 n…A base inducer (not C4.5 but revised)
A committee H of classifiers: h1 h2 …. hk
Equal voting
Copyright © 2005 by Jinyan Li and Limsoon Wong
Rootnode
Original n number offeatures
Selection is from mtry
number of randomly chosen features
A Revised C4.5 as Base Classifier
Similar to Bagging, but the base inducer is not the standard C4.5Make use twice of randomness
Copyright © 2005 by Jinyan Li and Limsoon Wong
Rootnode
Original n number of features
Select one randomly from{feature 1: choice 1,2,3 feature 2: choice 1, 2, . . . feature 8: choice 1, 2, 3} Total 20 candidates
Equal voting on the committee of such decision trees
Randomization Trees Dietterich 2000
Make use of randomness in the selection of the best split point
Copyright © 2005 by Jinyan Li and Limsoon Wong
1
2
k
tree-1
tree-2
tree-k
total k trees
root nodes
CS4: Cascading & Sharing for Decision Trees, Li et al 2003
• Selection of root nodes is in a cascading manner!
• Doesn’t make use of randomness• Not equal voting
Copyright © 2005 by Jinyan Li and Limsoon Wong
BaggingRandom
Forest
AdaBoost.M1
Randomization Trees
CS4
Rules may not be correct whenapplied to training data
Rules correct
Summary of Ensemble Classifiers
Copyright © 2005 by Jinyan Li and Limsoon Wong
Example of Decision Tree
• Yeoh et al., Cancer Cell 1:133-143, 2002; Differentiating MLL subtype from other subtypes of childhood leukemia
• Training data (14 MLL vs 201 others), Test data (6 MLL vs 106 others), Number of features: 12558
Given a test sample, at most 3 of the 4 genes’ expression values are needed to make a decision!
Copyright © 2005 by Jinyan Li and Limsoon Wong
Performance of CS4
• Whether top-ranked features have similar gain ratios
• Whether cascading trees have similar training performance
• Whether the trees have similar structure
• Whether the expanding tree committees can reduce the test errors gradually
• Gain ratios are: 0.39, 0.36, 0.35, 0.33, 0.33, 0.33, 0.33, 0.32, 0.31, 0.30; 0.30, 0.30, 0.30, 0.29, 0.29, 0.28, 0.28, 0.28, 0.28, 0.28.
Copyright © 2005 by Jinyan Li and Limsoon Wong
Discovery of Diagnostic Biomarkers for Ovarian
Cancer• Motivation: cure rate ~
95% if correct diagnosis at early stage
• Proteomic profiling data obtained from patients’ serum samples
• The first data set by Petricoin et al was published in Lancet, 2002
• Data set of June-2002.• 253 samples: 91 controls
and 162 patients suffering from the disease; 15154 features (proteins, peptides, precisely, mass/charge identities)
SVM: 0 errors; Naïve Bayes: 19 errors; k-NN: 15 errors.
Copyright © 2005 by Jinyan Li and Limsoon Wong
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Part 7:Mining Errors from Bio Databases
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Data Cleansing, Koh et al, DBiDB 2005
• 11 types & 28 subtypes of data artifacts– Critical artifacts (vector
contaminated sequences, duplicates, sequence structure violations)
– Non-critical artifacts (misspellings, synonyms)
• > 20,000 seq records in public contain artifacts
• Identification of these artifacts are impt for accurate knowledge discovery
• Sources of artifacts– Diverse sources of data
• Repeated submissions of seqs to db’s
• Cross-updating of db’s
– Data Annotation• Db’s have diff ways for data
annotation• Data entry errors can be introduced• Different interpretations
– Lack of standardized nomenclature
• Variations in naming• Synonyms, homonyms, & abbrevn
– Inadequacy of data quality control mechanisms
• Systematic approaches to data cleaning are lacking
A Classificati
on of Errors
Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic
RECORD
SINGLE SOURCE DATABASE
Invalid values
Ambiguity
Incompatible schema
ATTRIBUTE
Uninformative sequences
Undersized sequences
Annotation error
Dubious sequences
Sequence redundancy
Data provenance flaws
Cross-annotation error
Sequence structure violation
Vector contaminated sequence
Erroneous data transformation
MULTIPLESOURCE DATABASE
• Among the 5,146,255 protein records queried using Entrez to the major protein or translated nucleotide databases , 3,327 protein sequences are shorter than four residues (as of Sep, 2004).
• In Nov 2004, the total number of undersized protein sequences increases to 3,350.
• Among 43,026,887 nucleotide records queried using Entrez to major nucleotide databases, 1,448 records contain sequences shorter than six bases (as of Sep, 2004).
• In Nov 2004, the total number of undersized nucleotide sequences increases to 1,711.
Undersized protein sequences in major databases
218 171
42
528
116 151
1015
364 383
12351
1253 2 120 0 23
0
200
400
600
800
1000
1200
1 2 3
Sequence Length
Num
ber
of r
ecor
ds
DDBJ
EMBL
GenBank
PDB
SwissProt
PIR
Undersized nucleotide sequences in major databases
2 3 924
5573 69
115
81
233228
108 108
5167
6
40 45
77104
0
50
100
150
200
250
1 2 3 4 5
Sequence LengthN
um
ber
of
reco
rds
DDBJ
EMBL
GenBank
PDB
Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic
Example Meaningless Seqs
RECORD
SINGLE SOURCE DATABASE
Invalid values
Ambiguity
Incompatible schema
ATTRIBUTE
Overlapping intron/exon
Annotation error
Dubious sequences
Sequence redundancy
Data Provenance flaws
Cross-annotation error
Sequence structure violation
Vector contaminated sequence
Erroneous data transformation
MULTIPLESOURCE DATABASE
• Syn7 gene of putative polyketide synthase in NCBI TPA record BN000507 has overlapping intron 5 and exon 6.
• rpb7+ RNA polymerase II subunit in GENBANK record AF055916 has overlapping exon 1 and exon 2.
Example Overlapping Intron/Exon
Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic
RECORD
SINGLE SOURCE DATABASE
Invalid values
Ambiguity
Incompatible schema
ATTRIBUTE
Annotation error
Dubious sequences
Sequence redundancy
Data provenance flaws
Cross-annotation error
Sequence structure violation
Vector contaminated sequence
Erroneous data transformation
MULTIPLESOURCE DATABASE
Replication of sequence informationDifferent views
Overlapping annotations of the same sequence
Submission of the same sequence to different databases
• Repeated submission of the same sequence to the same database
• Initially submitted by different groups
• Protein sequences may be translated from duplicate nucleotide sequences
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=protein&list_uids=11692005&dopt=GenPept
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=protein&list_uids=11692005&dopt=GenPept
Example Seqs w/ Identical Info
Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic
Association Rule Mining
for De-duplication
Compute similarity scores from known duplicate pairs
Select matching criteria
Generate association rules
Detect duplicates using the rules
Copyright © 2005 by Jinyan Li and Limsoon Wong.
Features to Match
Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
AAG39642 AAG39643 AC0.9 LE1.0 DE1.0 DB1 SP1 RF1.0 PD0 FT1.0 SQ1.0
AAG39642 Q9GNG8 AC0.1 LE1.0 DE0.4 DB0 SP1 RF1.0 PD0 FT0.1 SQ1.0
P00599 PSNJ1W AC0.2 LE1.0 DE0.4 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0
P01486 NTSREB AC0.0 LE1.0 DE0.3 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0
O57385 CAA11159 AC0.1 LE1.0 DE0.5 DB0 SP1 RF0.0 PD0 FT0.1 SQ1.0
S32792 P24663 AC0.0 LE1.0 DE0.4 DB0 SP1 RF0.5 PD0 FT1.0 SQ1.0
P45629 S53330 AC0.0 LE1.0 DE0.2 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0
LE1.0 PD0 SQ1.0 (99.7%)SP1 PD0 SQ1.0 (97.1%)SP1 LE1.0 PD0 SQ1.0 (96.8%)DB0 PD0 SQ1.0 (93.1%)DB0 LE1.0 PD0 SQ1.0 (92.8%)DB0 SP1 PD0 SQ1.0 (90.4%)DB0 SP1 LE1.0 PD0 SQ1.0 (90.1%)RF1.0 SP1 LE1.0 PD0 SQ1.0 (47.6%)RF1.0 DB0 LE1.0 PD0 SQ1.0 (44.0%)
Similarity scores of known
duplicate pairs
Association rule mining
Frequent item-set with support
Association Rule Mining
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Scorpion venom dataset containing 520 records
695 duplicate pairs are collectively identified.
Snake PLA2 venom dataset containing 780 records
Entrez (GenBank, GenPept, SwissProt, DDBJ, PIR, PDB)
251 duplicate pairs 444 duplicate pairs
scorpion AND (venom OR toxin) serpentes AND venom AND PLA2
Expert annotation
Dataset
Duplicates detected by association rules
6
36.3
0.3
49.4
5.2
32.7
0.11.8 2.4 3.8
5.77.5 7.9 9.4
0
10
20
30
40
50
60
Association rulesF
P%
an
d F
N%
FP% FN% x 1000
S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.1%)Rule 7
S(Seq)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.4%)Rule 6
S(Seq)=1 ^ M(Seq Length)=1 ^ M(PDB)=0 ^ M(DB)=0 (92.8%)Rule 5
S(Seq)=1^ M(PDB)=0 ^ M(DB)=0 (93.1%)Rule 4
S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 (96.8%)Rule 3
S(Seq)=1 ^ M(PDB)=0 ^ M(Species)=1 (97.1%)Rule 2
S(Seq)=1 ^ N(Seq Length)=1 ^ M(PDB)=0 (99.7%)Rule 1
Results
Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic
Rule 1. Identical sequences with the same sequence length and not originated from PDB are 99.7% likely to be duplicates.
Rule 2. Identical sequences with the same sequence length and of the same species are 97.1% likely to be duplicates.
Rule 3. Identical sequences with the same sequence length, of the same species and not originated from PDB are 96.8% likely to be duplicates.
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong
Acknowledgements
TIS Prediction
• Huiqing Liu, Roland Yap, Fanfan Zeng
Treatment Optimization for Childhood ALL
• James Downing, Huiqing Liu, Allen Yeoh
Prognosis based on Gene Expression Profiling
• Huiqing Liu
Analysis of Proteomics Data
• Huiqing Liu
Subgraphs in Protein Interactions
• Haiquan Li, Donny Soh, Chris Tan, See Kiong Ng
Mining Errors from Bio Databases
• Vladimir Brusic, Judice Koh, Mong Li Lee