Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics.

Limsoon WongLaboratories for Information Technology

Singapore

From Dataminingto Bioinformatics

What is Bioinformatics?

Themes of Bioinformatics

Bioinformatics = Data Mgmt + Knowledge Discovery

Data Mgmt =Integration + Transformation + Cleansing

Knowledge Discovery = Statistics + Algorithms + Databases

Benefits of Bioinformatics

To the patient:Better drug, better treatment

To the pharma:Save time, save cost, make more $

To the scientist:Better science

From Informatics to Bioinformatics

IntegrationTechnology(Kleisli)

Cleansing & Warehousing (FIMM)

MHC-PeptideBinding(PREDICT)

Protein InteractionsExtraction (PIES)

Gene Expression & Medical RecordDatamining (PCL)

Gene FeatureRecognition (Dragon)

VenomInformatics

1994 19981996 2000 2002

8 years of bioinformaticsR&D in Singapore

ISS KRDL LIT

Quick Samplings

Epitope PredictionTRAP-559AAMNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSEEVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLNLNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRSLLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVILTDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNRFLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEKTASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQCEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENIIDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQKPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDNQNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGNRHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHEKPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVPGAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN

Epitope Prediction Results Prediction by our ANN model for HLA-A11

29 predictions 22 epitopes 76% specificity

1 66 100Rank by BIMAS

Number of experimental binders 19 (52.8%) 5 (13.9%) 12 (33.3%)

Prediction by BIMAS matrix for HLA-A*1101

Transcription Start Prediction

Transcription Start Prediction Results

Medical Record Analysis

Looking for patterns that are valid novel useful understandable

age sex chol ecg heart sick49 M 266 Hyp 171 N64 M 211 Norm 144 N58 F 283 Hyp 162 N58 M 284 Hyp 160 Y58 M 224 Abn 173 Y

Gene Expression Analysis

Classifying gene expression profiles find stable differentially expressed genes find significant gene groups derive coordinated gene expression

Medical Record & Gene Expression Analysis Results

PCL, a novel “emerging pattern’’ method

Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks

Works well for gene expressions

Cancer Cell, March 2002, 1(2)

Behind the Scene

Vladimir Bajic Vladimir Brusic Jinyan Li See-Kiong Ng Limsoon Wong Louxin Zhang

Allen Chong Judice Koh SPT Krishnan Huiqing Liu Seng Hong Seah Soon Heng Tan Guanglan Zhang Zhuo Zhangand many more:

students, folks from geneticXchange,MolecularConnections, and other collaborators….

Questions?

A More Detailed Account

Jonathan’s rules : Blue or CircleJessica’s rules : All the rest

What is Datamining?

Whose block is this?

Jonathan’s blocks

Jessica’s blocks

What is Datamining?

Question: Can you explain how?

The Steps of Data Mining

Training data gathering Signal generation

k-grams, colour, texture, domain know-how, ... Signal selection

Entropy, 2, CFS, t-test, domain know-how... Signal integration

SVM, ANN, PCL, CART, C4.5, kNN, ...

Translation Initiation Recognition

Microsoft Word Document

A Sample cDNA

299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT............................................................ 80................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

What makes the second ATG the translation initiation site?

Signal Generation

K-grams (ie., k consecutive letters) K = 1, 2, 3, 4, 5, … Window size vs. fixed position Up-stream, downstream vs. any where in window In-frame vs. any frame

A C G T

seq1seq2seq3

Too Many Signals

For each value of k, there are4k * 3 * 2 k-grams

If we use k = 1, 2, 3, 4, 5, we have4 + 24 + 96 + 384 + 1536 + 6144 = 8188features!

This is too many for most machine learning algorithms

Signal Selection (Basic Idea)

Choose a signal w/ low intra-class distance Choose a signal w/ high inter-class distance

Which of the following 3 signals is good?

Signal Selection (eg., t-statistics)

Signal Selection (eg., MIT-correlation)

Signal Selection (eg., 2)

Signal Selection (eg., CFS)

Instead of scoring individual signals, how about scoring a group of signals as a whole?

CFS A good group contains signals that are highly

correlated with the class, and yet uncorrelated with each other

Homework: find a formula that captures the key idea of CFS above

Sample k-grams Selected

Position –3 in-frame upstream ATG in-frame downstream

TAA, TAG, TGA, CTG, GAC, GAG, and GCC

Kozak consensusLeaky scanning

Stop codon

Codon bias

Signal Integration

kNNGiven a test sample, find the k training samples

that are most similar to it. Let the majority class win.

SVMGiven a group of training samples from two

classes, determine a separating plane that maximises the margin of error.

Naïve Bayes, ANN, C4.5, ...

Results (on Pedersen & Nielsen’s mRNA)

TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy

Naïve Bayes 84.3% 86.1% 66.3% 85.7%

SVM 73.9% 93.2% 77.9% 88.5%

Neural Network 77.6% 93.2% 78.8% 89.4%

Decision Tree 74.0% 94.4% 81.1% 89.4%

Acknowledgements

Roland Yap Zeng Fanfan A.G. Pedersen H. Nielsen

Questions?

Common Mistakes

Self-fulfilling Oracle

Consider this scenario Given classes C1 and C2 w/ explicit signals Use 2 to C1 and C2 to select signals s1, s2, s3 Run 3-fold x-validation on C1 and C2 using s1,

s2, s3 and get accuracy of 90% Is the accuracy really 90%? What can be wrong with this?

Phil Long’s Experiment

Let there be classes C1 and C2 w/ 100000 features having randomly generated values

Use 2 to select 20 features Run k-fold x-validation on C1 and C2 w/ these

20 features Expect: 50% accuracy Get: 90% accuracy! Lesson: choose features at each fold

Apples vs Oranges

Consider this scenario: Fanfan reported 89% accuracy on his TIS

prediction method Hatzigeorgiou reported 94% accuracy on her

TIS prediction method So Hatzigeorgiou’s method is better What is wrong with this conclusion?

Apples vs Oranges Differences in datasets used:

Fanfan’s expt used Pedersen’s dataset Hatzigeorgiou’s used her own dataset

Differences in counting: Fanfan’s expt was on a per ATG basis Hatzigeorgiou’s expt used the scanning rule and

thus was on a per cDNA basis When Fanfan ran the same dataset and count

the same way as Hatzigeorgiou, got 94% also!

Questions?

Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics.

Documents

Transcript of Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics.

Impact of microRNAs on Organization of Protein Interactions and Formation of Protein Complexes Limsoon Wong 7 April 2011 (Thanks: Wilson Goh, Guimei Liu,

Copyright 2004 limsoon wong CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 3 February.

Bioinformatics and Biomarker Discovery Part 3: Examples · 2009-08-23 · Biomarker Discovery Part 3: Examples Limsoon Wong ... • E.-J. Yeoh et al., “Classification, subtype discovery,

Copyright © 2004 by Limsoon Wong A Biology Review.

Naturally Embedded Query Languageswongls/talks/wls-icdt2014.pdf · A Retrospective on Naturally Embedded Query Languages Peter Buneman, Val Tannen, Limsoon Wong

CS2220: Intro to Computational Biology Course Briefingwongls/courses/cs2220/2015/Lect1a_2220... · Title: CS5238: Combinatorial Methods in Computation Biology Author: limsoon wong

Copyright © 2005 by Limsoon Wong Discovering Binding Motif Pairs from Interacting Protein Groups Limsoon Wong Institute for Infocomm Research Singapore.

Introduction to the Knowledge Discovery Department Institute for Infocomm Research Limsoon Wong

Limsoon Wong - comp.nus.edu.sg

Copyright 2003 limsoon wong Recognition of Gene Features Limsoon Wong Institute for Infocomm Research BI6103 guest lecture on ?? February 2004 For written.

Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong.

Limsoon Wong Kent Ridge Digital Labs Singapore

Copyright 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read chapter 14 of The Practical Bioinformatician, CS2220:

Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore.

Simple Ideas That Made It Big Wong Limsoon 11 February 2011 Some Inventions.

Exciting Bioinformatics Adventures Limsoon Wong Institute for Infocomm Research.

Eﬀective Pruning Techniques for Mining Quasi-Cliques · Eﬀective Pruning Techniques for Mining Quasi-Cliques Guimei Liu and Limsoon Wong ... ing. In this paper, we propose an

Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.

Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore.

What is a Microarray? Background on Microarrayswongls/courses/cs2220/... · 2 7 Gene Expression Measurement by Affymetrix GeneChip Array Copyright 2010 © Limsoon Wong Click to watch