The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n...

54
The Challenge of Predicting Gene Function Ross D. King Department of Computer Science University of Wales, Aberystwyth
  • date post

    18-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n...

Page 1: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

The Challenge ofPredicting Gene Function

Ross D. King Department of Computer Science University of Wales, Aberystwyth

Page 2: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Gene Function Prediction

The most important revelation from the sequenced genomes is that the functions of typically only between 60-70% of the predicted genes are known with any confidence.

The new science of functional genomics is dedicated to determining the function of the genes of unassigned function, and to further detailing the function of genes with purported function

Page 3: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Data Mining Prediction

We have developed a method for predicting the functional class of gene products based on ILP/Relational data mining.

The idea is to learn a reliable predictive function on the examples of genes with products of known function.

Then apply this function to genes where the functional class is unknown.

We call this approach: Data Mining Prediction (DMP).

Page 4: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Predicting Gene Function in Yeast

We will demonstrate our approach using ORFs in yeast (Saccharomyces cerevisiae).

● Using the MIPS functional classification scheme● For those ORFs whose function is currently unknown● Using 5 types of data:

1. Sequence statistics2. Homology (sequence similarity)3. Predicted Secondary Structure 4. Expression (microarray)5. Phenotype

Page 5: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

We want to map from sequence to function class

Sequence 4

Sequence 3

Sequence 2

Sequence 1

FunctionClass 2

FunctionClass 1

Page 6: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Classification Schemes 1MIPS/GeneOntology

1,0,0,0 "METABOLISM"2,0,0,0 "ENERGY"3,0,0,0 "CELL CYCLE AND DNA PROCESSING"4,0,0,0 "TRANSCRIPTION"5,0,0,0 "PROTEIN SYNTHESIS"6,0,0,0 "PROTEIN FATE (folding, modification, destination)"8,0,0,0 "CELLULAR TRANSPORT AND TRANSPORT MECHANISMS"10,0,0,0 "CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION MECHANISM"11,0,0,0 "CELL RESCUE, DEFENSE AND VIRULENCE"13,0,0,0 "REGULATION OF / INTERACTION WITH CELLULAR ENVIRONMENT"14,0,0,0 "CELL FATE"29,0,0,0 "TRANSPOSABLE ELEMENTS, VIRAL AND PLASMID PROTEINS"30,0,0,0 "CONTROL OF CELLULAR ORGANIZATION"40,0,0,0 "SUBCELLULAR LOCALISATION"62,0,0,0 "PROTEIN ACTIVITY REGULATION"63,0,0,0 "PROTEIN WITH BINDING FUNCTION OR COFACTOR REQUIREMENT "67,0,0,0 "TRANSPORT FACILITATION"98,0,0,0 "CLASSIFICATION NOT YET CLEAR-CUT"99,0,0,0 "UNCLASSIFIED PROTEINS"

Page 7: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Classification Schemes 2

1,0,0,0 "METABOLISM"1,1,0,0 "amino acid metabolism"1,2,0,0 "nitrogen and sulfur metabolism"1,3,0,0 "nucleotide metabolism"1,4,0,0 "phosphate metabolism"1,5,0,0 "C-compound and carbohydrate metabolism"1,6,0,0 "lipid, fatty-acid and isoprenoid metabolism"1,7,0,0 "metabolism of vitamins, cofactors, and prosthetic groups"1,20,0,0 "secondary metabolism"

Hierarchy of classes

Page 8: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Classification schemes 3

1,0,0,0 "METABOLISM"1,1,0,0 "amino acid metabolism"1,1,1,0 "amino acid biosynthesis"1,1,4,0 "regulation of amino acid metabolism"1,1,7,0 "amino acid transport"1,1,10,0 "amino acid degradation (catabolism)"1,1,99,0 "other amino acid metabolism activities"

1,2,0,0 "nitrogen and sulfur metabolism"1,3,0,0 "nucleotide metabolism"1,4,0,0 "phosphate metabolism"1,5,0,0 "C-compound and carbohydrate metabolism"1,6,0,0 "lipid, fatty-acid and isoprenoid metabolism"1,7,0,0 "metabolism of vitamins, cofactors, and prosthetic groups"1,20,0,0 "secondary metabolism"

... and ORFs may have multiple functions too!

Hierarchy of classes

Page 9: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Sequence Data

478 attributes in total

field description typeaa_rat_X % of amino acid X in the protein realseq_len length of the protein sequence intaa_rat_pair_X_Y % of the amino acids X and Y consecutively realmol_wt molecular weight of the protein inttheo_pI theoretical pI (isoelectric point) realatomic_comp_X atomic composition of X (C,H,N,O,S) realaliphatic_index aliphatic index realhydro grand average of hydropathy realstrand the DNA strand 'w' or 'c'position the number of exons (no. of start positions) intcai codon adaptation index realmotifs number of PROSITE motifs inttmSpans number of transmembrane spans

intchromosome chromosome number 1..16,mit

Page 10: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Homology dataYAL001C: mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdlsdk....

PSI-BLASTSequence databaseNRDB

sfc3:keyword(membrane)length(358)dbref(prosite)dbref(embl)

genetfcsfc3wsv442cg9463f1l3

organismbaker's yeastfission yeastwhite spot virusfruit flyArabidopsis

score0.01.0e-182.12.93.0

We look up the associated information from SwissProt

Page 11: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Predicted Secondary Structure Data

mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdlsdkkvk...cbbbbccaaaaaaaaaaaacccccbbbbaaaaaacccbbccccccb...

We record length and relative positions of the secondary structure elements.

This is relational data.

Page 12: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Expression Data

Spellman et al (1998), Roth et al (1998)DeRisi et al (1997), Eisen et al (1998)Gasch et al (2000, 2001), Chu et al (1998)

• Microrarray experiments to measure expression changes in yeast under a variety of conditions, including cell cycle, heat shock, diauxic shift.

• Short time series data, numerical-valued

0 7 14 21YBR166C 0.33 -0.17 0.04 -0.07YOR357C -0.64 -0.38 -0.32 -0.29YLR292C -0.23 0.19 -0.36 0.14YGL112C -0.69 -0.89 -0.74 -0.56...

Page 13: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Phenotype Data• Data from knockout gene growth experiments • Many missing data• 69 attributes x 1461 ORFs of known function• 991 genes of unknown function• Data taken from 3 sources (TRIPLES, MIPS, EUROFAN)

s = sensitive (less growth)w = wild-type (no observable effect) r = resistant (more growth)n = no data

ORF

YAL001CYAL019WYAL021CYAL029C

calcofluor white

w n n n

sorbitol

n s n w

benomyl

n w n w

...

deleted ORFgrowth medium

H2O2

w w n r

Page 14: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

What are the Machine Learning Issues?

• Large volume of data• Missing data• Accurate results required • Intelligible results required• Class hierarchy • Multiple labels • Relational data

Page 15: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Relational vs Propositional

orf time0 time7 time14yal001c 0.34 0.52 0.48yal002w0.76 0.82 0.89yal003w0.77 0.46 0.78yal004c 0.38 0.50 0.49

orf SwissProtID e-valyal001c p03415 2e-4yal001c p08640 8e-58yal002wp32583 6e-52yal002wp08775 3e-42

SwissProtID keywordp03415 apoptosisp03415 repeatp03415 zincp08640 membrane

Propositional: single table, fixed number of columns/attributes

Relational: multiple tables, multiple values

Page 16: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Data Mining Prediction (DMP)

Entire database

Data for rule creation

2/3 1/3

2/3 1/3

PolyFARM

C4.5Rule

gener-ation

Selectbestrules

Measurerule

accuracy

Validation data

Trainingdata

Allrules

Bestrules

Test data

Results

Page 17: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Warmr

Warmr is an ILP Algorithm Developed by Dehaspe et al.

It is an ILP version of the well known Apriori data mining algorithm.

Designed to find frequent patterns in a datalog database.

Page 18: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

PolyFARM

struc(Pos1, a) ^neighbour(Pos1, Pos2, c) ^neighbour(Pos2, Pos3, a) ^coil_dist(high)

• First-order association rule mining• Finding all frequent first order patterns

in the data• Distributed on a Beowulf cluster• 47,034 homology patterns, f > 5%• 19,628 structure patterns, f > 2%[Clare & King PADL 2003]

hom(SPID, close) ^ sq_len(SPID, short) ^classification(SPID, ecoli)

A close homology to a short protein in E. coli

Contains alpha-coil-alpha with a high overall coil distribution

Page 19: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Propositionalisation

patt1 patt2 patt3 patt4 ... patt47034

YAL001C 0 1 0 0 ... 1YAL002W 0 1 1 0 ... 1YAL003W 1 0 0 1 ... 0YAL004W 1 1 0 0 ... 1YAL005C 0 0 0 0 ... 1...

Transforming relational data into boolean attributes

Page 20: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Dichotomic Search 1

As an alternative to the WARMR data-mining approach, we developed a frequent pattern finding method based on dichotomic search.

This approach uses domain-specific logics as intermediates between propositional logic and predicate logic.

Page 21: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Dichotomic Search 2

Most existing algorithms traverse the search space in either a top-down or a bottom-up fashion. We propose a new approach based on dichotomic search which explores the search space in both direction, allowing larger steps

Dichotomic search combines completeness (w.r.t. concepts), non-redundancy, and flexibility.

Ferre, S. & King, R.D. (2005). Fundamenta Informaticae

Page 22: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Data Mining Prediction (DMP)

Entire database

Data for rule creation

2/3 1/3

2/3 1/3

PolyFARM

C4.5Rule

gener-ation

Selectbestrules

Measurerule

accuracy

Validation data

Trainingdata

Allrules

Bestrules

Test data

Results

Page 23: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

C4.5Open source decision tree algorithm•propositional learning•commonly used•produces interpretable rules •reliable•fast•accurate

Made modifications for:•multiple labels•hierarchical labels[Clare & King Bioinformatics 2002]

aa_ratio_pair_p_y

strand

aa_rat_a

metabolism

transport

cell fate

transcription

>6.4<=6.4

w c

>0.232<=0.232

Page 24: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Data Mining Prediction (DMP)

Entire database

Data for rule creation

2/3 1/3

2/3 1/3

PolyFARM

C4.5Rule

gener-ation

Selectbestrules

Measurerule

accuracy

Validation data

Trainingdata

Allrules

Bestrules

Test data

Results

Page 25: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Results Many rules from each data type Rules at each level of hierarchy Some classes are much easier to predict than others

(for example "protein synthesis" at 71-93%, "energy" at 20-47%)

Good levels of accuracy on held out test data Many predictions for ORFs of unknown function

(some function at some level is predicted for 96% of the ORFs of unknown function)

Some rules explainable by biology -> scientific knowledge discovery

Clare & King (2003) Bioinformatics suppl. 2., 42-49

Page 26: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Accuracy Table

Level

Datatype 1 2 3 4 all

Seq 55 55 33 0 71

Struc 49 43 0 0 58

Hom 65 38 69 20 55

Expr 42 37 35 0 75

Phen 75 40 7 0 68

Page 27: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Expression Data Rule

If in the micro-array experiment (sorbitol incubation) the ORF expression is > -0.25 and in the micro-array experiment (nitrogen depletion) the ORF expression is <= -1.29 and in the micro-array experiment (YPD stationary phase) the ORF expression is > -1.06then the function of this ORF is ”pheromone response, mating type determination, sex-specific proteins"

Accuracy on training data: 11/12 (92%)Accuracy on the test data: 3/4 (75%)21 predictions made

Page 28: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Structure Rule

• 80% accurate on test data• Most matching ORFs belong to the Mitochondrial Carrier Family• These have 6 long transmembrane alpha-helices of about 20-30

amino acids• Why do we notice alpha-helices of length 10-14?

If true: coil (of length 3) followed by alpha (10 <= length < 14)and true: coil (of length 1 or 2) followed by alpha (10 <= length < 14)and true: coil (of length 3) followed by alpha (3 <= length < 6)and false: coil followed by beta followed by coil (c-b-c)and false: coil (6 <= length < 10) followed by alpha (of length 1 or 2)then the function of this ORF is "mitochondrial transport"

Page 29: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

AlignmentYJL133W -------NEYNPLIHCLC----GSISGSTCAAITTPLDCIKTVLQIRG------------ 251YKR052C -------NSYNPLIHCLC----GGISGATCAALTTPLDCIKTVLQVRG------------ 241YIL006W ----NNTNSINLQRLIMA----SSVSKMIASAVTYPHEILRTRMQLKS------------ 310YBR104W ----LTRNEIPPWKLCLF----GAFSGTMLWLTVYPLDVVKSIIQNDD------------ 271YGR096W ----KTTAAHKKWELATLNHSAGTIGGVIAKIITFPLETIRRRMQFMNSKHLEK------ 250YJR095W -----QMDVLPSWETSCI----GLISGAIGPFSNAPLDTIKTRLQKDK------------ 246YKL120W -----LMKDGPALHLTAS-----TISGLGVAVVMNPWDVILTRIYNQK------------ 261YLR348C -----FDASKNYTHLTAS-----LLAGLVATTVCSPADVMKTRIMNGS------------ 239YMR166C ----DGRDGELSIPNEILT---GACAGGLAGIITTPMDVVKTRVQTQQPPSQSNKSYSVT 300YDL198C ------DYSQATWSQNFIS---SIVGACSSLIVSAPLDVIKTRIQNRN------------ 242YGR257C ----RFASKDANWVHFINSFASGCISGMIAAICTHPFDVGKTRWQISMMN---------- 302YDL119C FIHYNPEGGFTTYTSTTVNTTSAVLSASLATTVTAPFDTIKTRMQLEP------------ 255

YJL133W -SQTVSLEIMRKADTFSKAASAIYQVYGWKGFWRGWKPRIVANMPATAISWTAYECAKHF 310YKR052C -SETVSIEIMKDANTFGRASRAILEVHGWKGFWRGLKPRIVANIPATAISWTAYECAKHF 300YIL006W -DIPDSIQRR-----LFPLIKATYAQEGLKGFYSGFTTNLVRTIPASAITLVSFEYFRNR 364YBR104W -LRKPKYKNS-----ISYVAKTIYAKEGIRAFFKGFGPTMVRSAPVNGATFLTFELVMRF 325YGR096W FSRHSSVYGSYKGYGFARIGLQILKQEGVSSLYRGILVALSKTIPTTFVSFWGYETAIHY 310YJR095W ---SISLEKQSGMKKIITIGAQLLKEEGFRALYKGITPRVMRVAPGQAVTFTVYEYVREH 303YKL120W ----GDLYKG-----PIDCLVKTVRIEGVTALYKGFAAQVFRIAPHTIMCLTFMEQTMKL 312YLR348C ----GDHQP------ALKILADAVRKEGPSFMFRGWLPSFTRLGPFTMLIFFAIEQLKKH 289YMR166C HPHVTNGRPAALSNSISLSLRTVYQSEGVLGFFSGVGPRFVWTSVQSSIMLLLYQMTLRG 360YDL198C ---FDNPESG------LRIVKNTLKNEGVTAFFKGLTPKLLTTGPKLVFSFALAQSLIPR 293YGR257C ---NSDPKGGNRSRNMFKFLETIWRTEGLAALYTGLAARVIKIRPSCAIMISSYEISKKV 359YDL119C ----SKFTNS------FNTFTSIVKNENVLKLFSGLSMRLARKAFSAGIAWGIYEELVKR 305

Page 30: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

AlignmentYJL133W -------cccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 251YKR052C -------cccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 241YIL006W ----ccccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 310YBR104W ----ccccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 271YGR096W ----cccccccccccccbaaaaaaaaaaaaaaacccaaaaaaaaaacccccccc------ 250YJR095W -----cccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaccc------------ 246YKL120W -----ccccccaaaaaaa-----aaaaaaaaaacccaaaaaaaaaacc------------ 261YLR348C -----ccccccaaaaaaa-----aaaaaaaaaacccaaaaaaaaaacc------------ 239YMR166C ----cccccccccaaaaaa---aaaaaaaaaaacccaaaaaaaaaacccccccccccccc 300YDL198C ------cccccccaaaaaa---aaaaaaaaaaacccaaaaaaaaaacc------------ 242YGR257C ----ccccccccccccaaaaaaaaaaaaaaaaacccaaaaaaaaaacccc---------- 302YDL119C ccccccccccccccaaaaaaaaaaaaaaaaaaacccaaaaaaaaaacc------------ 255

YJL133W -ccccccccccccccaaaaaaaaaaaccccaaaaccaaaaaaacaaaaaaaaaaaaaaaa 310YKR052C -ccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 300YIL006W -ccccccccc-----aaaaaaaaaaaccccaaacccaaaaaaaccaaaaaaaaaaaaaaa 364YBR104W -ccccccccc-----aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 325YGR096W cccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 310YJR095W ---ccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 303YKL120W ----cccccc-----aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 312YLR348C ----ccccc------aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 289YMR166C cccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 360YDL198C ---cccccca------aaaaaaaaaacccaaaaacccaaaaaaaaaaaaaaaaaaaaaaa 293YGR257C ---ccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 359YDL119C ----ccccca------aaaaaaaaaacccaaaaacccaaaaaaccaaaaaaaaaaaaaaa 305

Page 31: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Homology ruleIf the ORF is not weakly homologous to a protein in klebsiellaand is strongly homologous to a protein in desulfurococcalesand is strongly homologous to a short protein in cyprinidaethen the function of this ORF is "Protein fate (folding, modification, destination)"

• This rule is 100% accurate on test data• Almost all matching ORFs are from the 20S proteasome

subunit for degradation of proteins• These subunits exist in archaea and eukaryotes, but only

in one specific branch of bacteria (actinomycetes).

Page 32: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Homology ruleIf the ORF is not weakly homologous to a protein in klebsiellaand is strongly homologous to a protein in desulfurococcalesand is strongly homologous to a short protein in cyprinidaethen the function of this ORF is "Protein fate (folding, modification, destination)"

• This rule is 100% accurate on test data• Almost all matching ORFs are from the 20S proteasome

subunit for degradation of proteins• These subunits exist in archaea and eukaryotes, but only

in one specific branch of bacteria (actinomycetes).

Page 33: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Application of DMP to Bacterial Genomes

Successful for both M. tuberculosis and E. coli.

Of the ORFs with no assigned function >40% were predicted to have a function at one or more levels of the class hierarchy.

It was found that many of the predictive rules were more general than possible using sequence homology.

References

King et al. (2000) KDD 2000

King et al. (2000) Yeast (Comparative and Functional Genomics)

King et al. (2001) Bioinformatics

Page 34: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Example Rule (level 2 E. coli) If the ORF is not predicted to have a -strand of length 3 a homologous protein from class Chytridiomycetes was foundThen its functional class is “Cell processes, Transport/binding proteins”

12/13 (86%) correct on Test Set - probability of this result occurring by chance is estimated at 4x10-7. 24 ORFs of unknown function are predicted by the rule.

16 ORFs now with putative or confirmed function - 93.8%

accurate predictions

Page 35: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Experimental Conformation The original bacterial ORF predictions were made

over three years ago.

In the intervening time many more ORFs have been sequenced, making traditional homologous prediction methods more accurate and sensitive, and the function of some ORFs have been determined by wet biology.

The E. coli genome has been re-annotated by Monica Riley’s group.

Page 36: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

“Wet” Biology conformation A number of predictions have been confirmed or

falsified by new “wet” experimental data.

This new data is biased towards hard classes. Despite this the results are still good:– Level 2: 23 predictions - 47.8% accuracy– Level 3: 23 predictions - 43.4% accuracy

This is very much better than random as there are many classes.

Page 37: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Confirmation of “Wet” PredictionsORF Rule Predicted Class Confirmed Function Resultb0805 8 Cell envelop Outer membrane protein Cb1519 15 Degradation of small molecules Trans-aconitate methyltransferase Cb1533 43 Transport/binding proteins Cysteine pathway metabolite transport Cb1981 42 Transport/binding proteins Shikimate and dehydroshikimate transport

proteinC

b1981 56 Transport/binding proteins Shikimate and dehydroshikimate transportprotein

C

b2210 15 Degradation of small molecules Malate:quinone oxidoreductase Cb2392 43a Transport/binding proteins High-affinity manganese transporter Cb2392 43b Transport/binding proteins High-affinity manganese transporter Cb2392 54 Transport/binding proteins High-affinity manganese transporter Cb2924 45 Transport/binding proteins Component of the MscS mechanosensitive

channel – “new gene family”C

b3839 43 Transport/binding proteins Essential component of translocase Cb0103 42 Transport/binding proteins dephospho-CoA kinase Wb0103 41 Transport/binding proteins dephospho-CoA kinase Wb0103 43 Transport/binding proteins dephospho-CoA kinase Wb1822 15 Degradation of small molecules 23S rRNA m1G745 methyltransferase Wb2530 35 Global regulatory functions cysteine desulfurase Wb2392 14 Degradation of small molecules High-affinity manganese transporter Wb2889 50 Energy metabolism carbon Isopentenyl diphosphate isomerase Wb3222 54 Transport/binding proteins ManNAc kinase Wb3223 39 Ribosome constituents ManNAc epimerase Wb3337 28 Laterally acquired elements regulatory or redox component Wb3338 39 Ribosome constituents Periplasmic endochitinase Wb3569 32 Laterally acquired elements transcriptional regulator of xylose utilization Wb3955 8 Cell envelop Required for invasion of brain microvascular

endothelial cellsEF

b3955 18 Energy metabolism carbon Required for invasion of brain microvascularendothelial cells

EA

b3955 20 Energy metabolism carbon Required for invasion of brain microvascularendothelial cells

EA

Page 38: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Extension to Arabidopsis Genome Collaborative project with the Institute of Grassland and

Environmental Research and the University of Nottingham.

Large increase in data: 6,000 (yeast) -> 25,000 ORFs.

Large amount of micro-array data from the Nottingham Arabidopsis stock centre.

The increase in data is a challenge to our machine learning algorithms, 100s MBs.

Clare, A., Karwath, A., Ougham, H. and King, RD (2006) Functional

Bioinformatics for Arabidopsis thaliana. Bioinformatics 2006 22: 1130-1136;

Page 39: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Results

Accuracy comparable to yeast and bacteria

Large fraction of genes of currently unknown function are predicted.

Some rules could be interpreted in terms of known biology

Clare, A., Karwath, A., Ougham, H. and King, RD (2006) Functional

Bioinformatics for Arabidopsis thaliana. Bioinformatics 2006 22: 1130-1136;

Page 40: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Gibberellin Biosynthesis Prediction

Gibberellin is an important plant hormone. Chosen because of interesting phenotypes – often

extreme size. Insertion of a promoter to overproduce gene product. Result

– 2 days earlier flowering– Average leaf number and weight increased at 21

days. This phenotype is consistent with prediction.

Page 41: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.
Page 42: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Leaf number

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00

21 24 28 31 34

days after sowing

Nu

mb

er o

f le

aves

Leaf number increases more rapidly in the mutant (yellow bars) than in wildtype Landsberg erecta

(blue bars)

Page 43: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Paclobutrazol (P) (inhibitor of gibberllin) abolishes the difference between mutant (M) and wildtype (L)

C = control

Average Leaf number at 21 days Expt 4

0.0

2.0

4.0

6.0

8.0

1

Treatment

Da

ys

LC

MC

LP

MP

MCLC LP MP

Page 44: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Availability

All rules and data available at http://www.aber.ac.uk/compsci/Research/bio/dss/

All predictions available at http://www.genepredictions.org

Page 45: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

ILP 2005 Challenge 1

Yeast function prediction data used as a community challenge: http://www.protein-logic.com/

The intention of the challenge was to provide a real-world data set to test of how far we have progressed in the field of ILP and multi-relational data mining. The questions we wanted to answer were: Are the tools up to the job? Do they scale? Do they handle noisy, sparse and complex data?

Page 46: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

ILP 2005 Challenge 2

A. J. Knobbe, E. K. Y. Ho, R. Malik: ILP CHallenge 2005: The Safarii MRDM environment.

C. Perlich: Approaching the ILP 2005 challenge: Class-Conditional Bayesian Propositionalization for Genetic Classification.

J. Struyf, C. Vens, T. Croonenborghs, S. Dzeroski, H. Blockeel: Applying Predictive Clustering Trees to the Inductive Logic Programming 2005 Challenge Data.

F. Riguzzi: A Simple Approach to a Multi-Label Classification Problem.

Page 47: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Propositional Approach

Zafer Barutcuoglu, Robert E. Schapire and Olga G. Troyanskaya. Hierarchical multi-label prediction of gene function. Bioinformatics (in press)

Hierarchy of SVMs. Uses a Bayesian net to combine predictions.

Page 48: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Conclusions• Data mining and machine learning are powerful

tools for functional genomics.

• The DMP method can be successfully applied to different genomes (bacterial, yeast, Arabidopsis) to predict gene functional class.

• Micro-array data is a useful component in DMP.

• Biological insight can be extracted from DMP rules.

• The structure of gene prediction problems makes them an exciting test bed for machine learning methods.

Page 49: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Acknowledgements

Amanda Clare Aberystwyth Andreas Karwath Freiburg (Aberystwyth) Luc DehaspePharmaDM Helen Ougham IGER

BBSRC

Page 50: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

The Need for Logic to Represent Scientific Knowledge

Logic is the best understood way to represent knowledge.

Traditional statistics, machine learning, and data mining are based on propositional logic.

For some problems we require a richer description language, i.e. first-order predicate calculus.

Using logic programming (predicate calculus) we can incorporate deduction, abduction, and induction.

Page 51: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Inductive Logic Programming Inductive Logic Programming (ILP) uses logic

programs (first-order predicate calculus) to learn with: describe examples, theories, and background knowledge.

For certain types of problem ILP is a powerful data analysis technique - more accurate, and more comprehensible results than conventional methods.

Has been successfully applied to a number of biological/chemical problems.

Page 52: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

ILP for Science

The key advantage of ILP for scientific applications is that it allows the application of compact relational representations that are natural for scientists to use. This allows domain understandable rules to be automatically formed.

This advantage comes at a computational cost. However, non-technical reasons are probably the greatest barrier to adoption of ILP. For example, it is very difficult to explain the benefits of ILP to domain experts.

Page 53: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Prediction of Lethality

Instead of using microarray-data to prediction the functional class of a gene we have been using the same approach to predict whether a gene knock-out will be lethal (grown in a rich medium).

If false: the function of the ORF is cell cycleand true: the function of the ORF is rRNA transcriptionand in the micro-array experiment (cell cycle) the ORF expression is > -0.79 then the knockout is lethal.

Example Rule: Test accuracy 82% (Default 21%).

Page 54: The Challenge of Predicting Gene Function n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth.

Summary Results

Using voting (2 or more rules agree on a prediction)– Level 2 :128 ORFs predicted - 87.5% accuracy– Level 3 : 23 ORFs predicted - 91.3% accuracy

All predictions– Level 2 :335 ORFs predicted - 64.5% accuracy– Level 3: 204 ORFs predicted - 44.6% accuracy