Mining Plant Pathogen Genomes for Effectors

Post on 26-Jan-2015

113 views 2 download

description

Presentation given as part of the EMBO Workshop on Plant-Microbe Interactions, at The Sainsbury Laboratory, Norwich, 20th June 2012. This presentation describes bioinformatic and statistical considerations for the prediction of plant pathogen effectors from genome sequences and annotation, with several literature examples.

Transcript of Mining Plant Pathogen Genomes for Effectors

Mining  pathogen  genomes  for  effectors  

Leighton  Pritchard  

The  overall  goal  l Star0ng  from  a  genome  sequence,  iden0fy  genes  that  code  for    candidate  effectors  (or,  star0ng  from  gene  product  complement,  iden0fy  candidate  effectors)  

The  overall  goal  l Star0ng  from  a  genome  sequence,  iden0fy  genes  that  code  for    candidate  effectors  (or,  star0ng  from  gene  product  complement,  iden0fy  candidate  effectors)  

The  overall  goal  l Star0ng  from  a  genome  sequence,  iden0fy  genes  that  code  for    candidate  effectors  (or,  star0ng  from  gene  product  complement,  iden0fy  candidate  effectors)  

The  overall  goal  l Star0ng  from  a  genome  sequence,  iden0fy  genes  that  code  for    candidate  effectors  (or,  star0ng  from  gene  product  complement,  iden0fy  candidate  effectors)  

What  is  an  effector?  l Molecule  produced  by  pathogen  that  (directly?)  modifies  host    molecular/biochemical  ‘behaviour’,  e.g.  

l  Inhibits  enzyme  ac0on  (Cladosporium  fulvum  AVR2,  AVR4;  Phytophthora  infestans  EPIC1,  EPIC2B;  P.  sojae  glucanase  inhibitors)  

l  Cleaves  protein  target  (Pseudomonas  syringae  AvrRpt2)  

l  (De-­‐)phosphorylates  protein  target  (Pseudomonas  syringae  AvrRPM1,  AvrB)  

l  Addi0onal  component  in/retarge0ng  host  system,  e.g.  E3  ligase  ac0vity  (P.  syringae  AvrPtoB;  P.  infestans  Avr3a)  

l  Regulatory  control  (Xanthomonas  campestris  AvrBs3,  TAL  effectors)  

What  is  an  effector?  l Molecule  produced  by  pathogen  that  (directly?)  modifies  host    molecular/biochemical  ‘behaviour’,  e.g.  

l  Inhibits  enzyme  ac0on  (Cladosporium  fulvum  AVR2,  AVR4;  Phytophthora  infestans  EPIC1,  EPIC2B;  P.  sojae  glucanase  inhibitors)  

l  Cleaves  protein  target  (Pseudomonas  syringae  AvrRpt2)  

l  (De-­‐)phosphorylates  protein  target  (Pseudomonas  syringae  AvrRPM1,  AvrB)  

l  Addi0onal  component  in/retarge0ng  host  system,  e.g.  E3  ligase  ac0vity  (P.  syringae  AvrPtoB;  P.  infestans  Avr3a)  

l  Regulatory  control  (Xanthomonas  campestris  AvrBs3,  TAL  effectors)  

What  is  an  effector?  l No  unifying  biochemical  mechanism;  may  act  inside  or  outwith    host  cell  

l No  formal,  agreed  defini0on  (direct/indirect  ac0on;  structural  damage  –  PCWDEs,  etc.)  

l No  single  ‘test  for  candidate  effectors’  l  Really  tes0ng  for  protein  family  membership  and/or  evidence  of    

‘effector-­‐like  behaviour’  

l  A  general  sequence  classifica0on  problem  (func0onal  annota0on)  

l  Many  possible  bioinforma0c/computa0onal  approaches  

l  No  big  red  bu[on  

What  is  an  effector?  l No  unifying  biochemical  mechanism;  may  act  inside  or  outwith    host  cell  

l No  formal,  agreed  defini0on  (direct/indirect  ac0on;  structural  damage  –  PCWDEs,  etc.)  

l No  single  ‘test  for  candidate  effectors’  l  Really  tes0ng  for  protein  family  membership  and/or  evidence  of    

‘effector-­‐like  behaviour’  

l  A  general  sequence  classifica0on  problem  (func0onal  annota0on)  

l  Many  possible  bioinforma0c/computa0onal  approaches  

l  No  big  red  bu[on  

 

What  is  an  effector?  l No  unifying  biochemical  mechanism;  may  act  inside  or  outwith    host  cell  

l No  formal,  agreed  defini0on  (direct/indirect  ac0on;  structural  damage  –  PCWDEs,  etc.)  

l No  single  ‘test  for  candidate  effectors’  l  Really  tes0ng  for  protein  family  membership  and/or  evidence  of    

‘effector-­‐like  behaviour’  

l  A  general  sequence  classifica0on  problem  (func0onal  annota0on)  

l  Many  possible  bioinforma0c/computa0onal  approaches  

l  No  big  red  bu[on  

Surgery  without  knife  skills?  

Before  we  start…  

A   F   4   7  “If  a  card  has  a  vowel  on  one  side,  it  has  an  even  number  on  the  other  side.”  Which  card(s)  are  useful  to  turn  over  to  test  this  proposi0on?  

Before  we  start…  

A   F  

4   7  

A   7  

F   4  

A   4  

F   7  

Before  we  start…  

A   F  

4   7  

A   7  

F   4  

A   4  

F   7  Wason  SelecIon  Task:  confirma0on  bias,  context  

Why  is  this  relevant?  

effector   not  effector   RxLR   not  

RxLR  

“If  a  protein  has  an  RxLR  moIf,  it  is  an  effector.”  Which  experiments  are  useful  to  perform  to  test  this  proposi0on?  

Effector  Club  

The  first  rule  of  finding  effectors  is:  

You  are  not  finding  effectors  

Effector  Club  

l Classifica0on  of  sequences  is  modelling  

l  simplified  representa0on  of  reality  

l  criteria  based  on  known  effectors  

l  Iden0fies  candidate  effectors  l  experimental  verifica0on  required  

l General  bioinforma0c  problem  

l  specifics  vary  for  each  classifier  (model)  

Effector  Club  

l Classifica0on  of  sequences  is  modelling  

l  simplified  representa0on  of  reality  

l  criteria  based  on  known  effectors  

l  Iden0fies  candidate  effectors  l  experimental  verifica0on  required  

l General  bioinforma0c  problem  

l  specifics  vary  for  each  classifier  (model)  

Effector  Club  

l Classifica0on  of  sequences  is  modelling  

l  simplified  representa0on  of  reality  

l  criteria  based  on  known  effectors  

l  Iden0fies  candidate  effectors  l  experimental  verifica0on  required  

l General  bioinforma0c  problem  

l  specifics  vary  for  each  classifier  (model)  

Sequence  space  

An  abstract  concept  

Sequence  space  

Each  point  is  a  sequence  

Sequence  space  

d1  d2  

d1  <  d2  Distance  reflects  sequence  similarity  

Sequence  space  

Known  exemplar:  red  

Sequence  space  

Define  distance  from  the  example  ≈  ‘similar’  

Sequence  space  

‘similar’  sequences  are  same  class  (e.g.  func0on)  

Sequence  space  

Known  exemplars:  red  

Sequence  space  

Define  a  centre,  and  a  distance  that  includes  the  examples  

Sequence  space  

Classify  ‘similar’  sequences  

Finding  effectors  l Simple:  

1.  Have  one  or  more  examples  of  your  effector  (class)  

2.  Define  some  kind  of  appropriate  threshold  of  similarity  

3.  Check  all  the  gene/gene  product  sequences  in  the  genome  against  that  threshold  

Finding  effectors  l Simple:  

1.  Have  one  or  more  examples  of  your  effector  (class)  

2.  Define  some  kind  of  appropriate  threshold  of  similarity  

3.  Check  all  the  gene/gene  product  sequences  in  the  genome  against  that  threshold  

Finding  effectors  l Simple:  

1.  Have  one  or  more  examples  of  your  effector  (class)  

2.  Define  some  kind  of  appropriate  threshold  of  similarity  

3.  Check  all  the  gene/gene  product  sequences  in  the  genome  against  that  threshold  

Finding  effectors  l Simple:  

1.  Have  one  or  more  examples  of  your  effector  (class)  

2.  Define  some  kind  of  appropriate  threshold  of  similarity  

3.  Check  all  the  gene/gene  product  sequences  in  the  genome  against  that  threshold  

There  are  50  slides  to  go…  it’s  not  that  simple  

It’s  not  that  simple  

l How  do  we  define  ‘distance’?  

l How  large  a  ‘distance’  do  we  take?  

l How  do  we  know  we’ve  chosen  a  sensible  ‘distance’?  

It’s  not  that  simple  

l How  do  we  define  ‘distance’?  

l How  large  a  ‘distance’  do  we  take?  

l How  do  we  know  we’ve  chosen  a  sensible  ‘distance’?  

It’s  not  that  simple  

l How  do  we  define  ‘distance’?  

l How  large  a  ‘distance’  do  we  take?  

l How  do  we  know  we’ve  chosen  a  sensible  ‘distance’?  

CharacterisIcs  of  known  effectors  l Modularity  

l  Delivery:  localisa0on/transloca0on  domain(s)  

l  Ac0vity:  func0onal/interac0on  domain(s)  

l Sequence  mo0fs  

l  Localisa0on/transloca0on  domain(s)  ocen  common  to  effector  class  (e.g.  RxLR,  T3E)  

l  Func0onal  domain(s)  may  be  common  to  effector  class  (e.g.  TAL),  or  divergent  (e.g.  RxLR,  T3E)  

CharacterisIcs  of  known  effectors  l Modularity  

l  Delivery:  localisa0on/transloca0on  domain(s)  

l  Ac0vity:  func0onal/interac0on  domain(s)  

Greenberg  JT,  Vinatzer  BA  (2003)  Iden0fying  type  III  effectors  of  plant  pathogens  and  analyzing  their  interac0on  with  plant  cells.  Curr  Opin  Microbiol  6:  20–28.  Collmer  A,  Lindeberg  M,  Petnicki-­‐Ocwieja  T,  Schneider  DJ,  Alfano  JR  (2002)  Genomic  mining  type  III  secre0on  system  effectors  in  Pseudomonas  syringae  yields  new  picks  for  all  TTSS  prospectors.  Trends  in  Microbiology  10:  462–469.  

CharacterisIcs  of  known  effectors  l Modularity  

l  Delivery:  localisa0on/transloca0on  domain(s)  

l  Ac0vity:  func0onal/interac0on  domain(s)  

Dong  S,  Yu  D,  Cui  L,  Qutob  D,  Tedman-­‐Jones  J,  et  al.  (2011)  Sequence  Variants  of  the  Phytophthora  sojae  RXLR  Effector  Avr3a/5  Are  Differen0ally  Recognized  by  Rps3a  and  Rps5  in  Soybean.  PLoS  ONE  6:  e20172.  doi:10.1371/journal.pone.0020172.t004.  Bouwmeester  K,  Meijer  HJG,  Govers,  F  (2011)  At  the  fron0er;  RXLR  effectors  crossing  the  Phytophthora-­‐host  interface.  FronCers  in  Plant-­‐Microbe  InteracCons  10.3389  

CharacterisIcs  of  known  effectors  l Modularity  

l  Delivery:  localisa0on/transloca0on  domain(s)  

l  Ac0vity:  func0onal/interac0on  domain(s)  

l Sequence  mo0fs  

l  Localisa0on/transloca0on  domain(s)  typically  common  to  effector  class  (e.g.  RxLR,  T3E,  CHxC)  

l  Func0onal  domain(s)  may  be  common  to  effector  class  (e.g.  TAL),  or  divergent  (e.g.  RxLR,  T3E  in  general)  

Boch  J,  Scholze  H,  Schornack  S,  Landgraf  A,  Hahn  S,  et  al.  (2009)  Breaking  the  code  of  DNA  binding  specificity  of  TAL-­‐type  III  effectors.  Science  326:  1509–1512.  doi:10.1126/science.1178811.  

CharacterisIcs  of  known  effectors  l “Arms  Races”  occur:  

l  Host  defences  track  effector  evolu0on  

l  Effectors  evade  host  defences  

l Divergence  of  effectors  under  selec0on  pressure  l  Diversifying  selec0on;  divergence  may    

result  from  evasion  of  detec0on,  rather    than  change  of  biochemical  ‘func0on’  

l Effectors  may  be  found  preferen0ally  in    characteris0c  loca0ons    

l  P.  infestans  ‘gene  sparse’  regions  

Raffaele  S,  Win  J,  Cano  LM,  Kamoun  S  (2010)  Analyses  of  genome  architecture  and  gene  expression  reveal  novel  candidate  virulence  factors  in  the  secretome  of  Phytophthora  infestans.  BMC  Genomics  11:  637.  doi:10.1186/1471-­‐2164-­‐11-­‐637.  

CharacterisIcs  of  known  effectors    l Applica0on  of  ‘filters’:  reduce  the  number  of  sequences  to  check  

l  Presence/absence  filters:  

� SignalP  (export  signal)  

� RxLR/T3SS  (transloca0on  signal)  

� Expression  (used  by  pathogen)  

� Posi0ve  selec0on  (suggests  arms  race)  

� etc…  

l Workflows  (e.g.  Galaxy,  Taverna)  useful  here  

Fabro  G,  Steinbrenner  J,  Coates  M,  Ishaque  N,  Baxter  L,  et  al.  (2011)  Mul0ple  candidate  effectors  from  the  oomycete  pathogen  Hyaloperonospora  arabidopsidis  suppress  host  plant  immunity.  PLoS  Pathog  7:  e1002348.  doi:10.1371/journal.ppat.1002348.  

Redefining  sequence  space  l Effectors  may  share  common  module,  but  otherwise  be  dissimilar.  

l We  can  emphasise  sequence  similarity  by  focusing  on  the  common  region  

l  this  is  essen0ally  ‘redefining’  sequence  space  

l  brings  known  effectors  ‘together’  

l  may  bring  non-­‐effectors  with  similar  sequence  closer,  too  

SSMMMAAAAAAAA SSMMMBBBBBBBB SSMMMAAAAAAAA SSMMMBBBBBBBB SSMMMAAAAAAAA SSMMMBBBBBBBB

Sequence  space  

Comparing  whole  sequences  

AAAAAAAA  

BBBBBBB  

Redefining  sequence  space  l Effectors  may  share  common  module,  but  otherwise  be  dissimilar.  

l We  can  emphasise  similarity  by  focusing  on  regions  common  to  an  effector  class,  e.g.  T3SS,  L-­‐FLAK  

l  this  is  essen0ally  redefining  sequence  space  

l  brings  known  effectors  ‘closer  together’  

l  may  bring  non-­‐effectors  with  similar  sequence  closer,  too  

SSMMMAAAAAAAA SSMMMBBBBBBBB SSMMMAAAAAAAA SSMMMBBBBBBBB SSMMMAAAAAAAA SSMMMBBBBBBBB

Sequence  space  

Pull  domains  together,  push  non-­‐domains  away  

Building  a  classifier  

l How  do  we  define  ‘distance’?  

l How  large  a  ‘distance’  do  we  take?  

l How  do  we  know  we’ve  chosen  a  sensible  ‘distance’?  

Defining  a  distance  l Sequence  iden0ty  (op0mal  alignment)  

l Derived  score  (based  on  sequence  iden0ty/alignment)  

l  Bit  score  in  BLAST  

l  E-­‐value  in  BLAST  

l Derived  score  (based  on  other  measures)  

l  Bit  score  in  HMMer  

l Clustering  l  Sequence  iden0ty  (e.g.  CD-­‐HIT)  

l  MCL  

Defining  a  distance  l Sequence  iden0ty  

l Derived  score  (based  on  sequence  iden0ty/alignment)  

l  Bit  score  in  BLAST  

l  E-­‐value  in  BLAST  

l Derived  score  (based  on  other  measures)  

l  Bit  score  in  HMMer  

l Clustering  l  Sequence  iden0ty  (e.g.  CD-­‐HIT)  

l  MCL  

Defining  a  distance  l Sequence  iden0ty  

l Derived  score  (based  on  sequence  iden0ty/alignment)  

l  Bit  score  in  BLAST  

l  E-­‐value  in  BLAST  

l Derived  score  (based  on  other  measures)  [not  alignment]  

l  Bit  score  in  HMMer  

l Clustering  l  Sequence  iden0ty  (e.g.  CD-­‐HIT)  

l  MCL  

Defining  a  distance  l Sequence  iden0ty  

l Derived  score  (based  on  sequence  iden0ty/alignment)  

l  Bit  score  in  BLAST  

l  E-­‐value  in  BLAST  

l Derived  score  (based  on  other  measures)  

l  Bit  score  in  HMMer  

l Clustering  (not  strictly  a  distance)  l  Sequence  iden0ty  (e.g.  CD-­‐HIT)  

l  MCL  

(we’re  really  assessing  criteria  for  class  membership)  

Defining  a  distance:  sequence  idenIty  l  Distance  between  sequences  ≈  difference  between  sequences  

l  sequence  iden0ty:  propor0on  of  iden0cal  symbols  

l  e.g.  BLAST  output    

l  Gotchas:  not  always  symmetrical;  dependent  on  alignment  parameters!

Score = 95.3 bits (51), Expect = 3e-24 ! Identities = 161/212 (76%), Gaps = 15/212 (7%) ! Strand=Plus/Plus !!Query 1 GCCGCCGGGGTTCAGGTTCCACCCGACGGACGAGGAGCTGGTGATGC-ACTACCT-CTGC 58 ! ||||||||||||||||||||||||||||||||||||||| | | | ||||||| | || !Sbjct 1 GCCGCCGGGGTTCAGGTTCCACCCGACGGACGAGGAGCTCATCAC-CTACTACCTGCGGC 59 !!Query 59 CGGCGGT-GC-GCCGGCCTCCCCATCGCCGTCCCCATCATCGCCGAGGTCGACCTCTACA 116 ! | | | || | |||| | || || || | ||||||||||||||| ||| ||| !Sbjct 60 AGAAGATCGCCGACGGCGGCTTCA-CGGCGAGGGC--CATCGCCGAGGTCGATCTCAACA 116 !!Query 117 AGTTCGATCCATGGCATCTCCCA-AGAATGGCGCTGTACGGC-GAGAAGGAGTGGTACTT 174 ! ||| ||| |||||| |||||||| |||| ||| | || |||||||| |||||||| !Sbjct 117 AGTGCGAGCCATGGGATCTCCCAGAGAA-GGCAAAA-ATGGGAGAGAAGGAATGGTACTT 174 !!Query 175 CTTCTCCCCTC-GGGACCGCAAGTACCCGAAC 205 ! ||| | ||| |||| || |||||||| ||| !Sbjct 175 CTT-TAGCCTAAGGGATCGAAAGTACCC-AAC 204 !  

Defining  a  distance:  sequence  idenIty  l  Distance  between  sequences  ≈  difference  between  sequences  

l  sequence  iden0ty:  propor0on  of  iden0cal  symbols  

l  e.g.  BLAST  output  

l  Gotchas:  not  always  symmetrical;  dependent  on  alignment  parameters!

Score = 4970 !Length of alignment = 533 !Sequence Solyc11g008000.1.1 : 1 - 529 (Sequence length = 529) !Sequence Solyc11g005920.1.1 : 1 - 688 (Sequence length = 688) !Percentage ID = 32.83 !!Score = 5040 !Length of alignment = 533 !Sequence Solyc11g005920.1.1 : 1 - 688 (Sequence length = 688) !Sequence Solyc11g008000.1.1 : 1 - 529 (Sequence length = 529) !Percentage ID = 32.46 !  

(pairwise  alignment  in  Jalview)  

Defining  a  distance:  sequence  idenIty  l  Distance  between  sequences  ≈  difference  between  sequences  

l  sequence  iden0ty:  propor0on  of  iden0cal  symbols  

l  e.g.  BLAST  output  

l  Gotchas:  not  always  symmetrical;  dependent  on  alignment  parameters!

Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 31.4 bits (64), Expect = 4e-05, Method: Composition-based stats. ! Identities = 37/134 (28%), Positives = 53/134 (40%), Gaps = 12/134 (8%) !!Query 34 GFRFHPTDEELVLYYLKRKICRRRILLDA---IAETDVY-KWEPEDLPDLSKLKTGD--- 86 ! GFRF PTD E V + L + + + D+ D Y + EP D+ D !Sbjct 7 GFRFSPTDAEAVTFLL--RFIAGKFMDDSGFITTHVDTYSEQEPWDIYSHGVPCCNDDND 64 !!Query 87 -RQWFFFSPRDRKYPNGARSNRASKHGYWKATGKDRIITCNSRAV-GVKKTLVFYKGRAP 144 ! Q+ FF +K S G WK K + + V G KK++ YK + !Sbjct 65 CTQYRFFITTLKKKSESRYSRNVGNKGSWKQQDKSKPVRKKGGPVIGYKKSMC-YKNKGY 123 !!Query 145 VGERTDWVMHEYTM 158 ! E W+M EY + !Sbjct 124 KQEDGHWLMKEYDL 137 !!

(BLASTP,  BLOSUM80  matrix)  

Defining  a  distance:  sequence  idenIty  l  Distance  between  sequences  ≈  difference  between  sequences  

l  sequence  iden0ty:  propor0on  of  iden0cal  symbols  

l  e.g.  BLAST  output  

l  Gotchas:  not  always  symmetrical;  dependent  on  alignment  parameters!

Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 7e-07, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) !!Query 31 FPPGFRFHPTDEELVLYYLKRKICRRRILLDAIAETDVYKW---EPEDLPDLSKLKTGDR 87 ! + GFRF PTD E V + L R I + + T V + EP D+ D !Sbjct 4 LEEGFRFSPTDAEAVTFLL-RFIAGKFMDDSGFITTHVDTYSEQEPWDIYSHGVPCCNDD 62 !!Query 88 ----QWFFFSPRDRKYPNGARSNRASKHGYWKATGKDRIITCNSRAVGVKKTLVFYKGRA 143 ! Q+ FF +K S G WK K + + V K + YK + !Sbjct 63 NDCTQYRFFITTLKKKSESRYSRNVGNKGSWKQQDKSKPVRKKGGPVIGYKKSMCYKNKG 122 !!Query 144 PVGERTDWVMHEYTMDEEELKRCQNAQDYYALYKVFKKS 182 ! E W+M EY + L + L + K++ !Sbjct 123 YKQEDGHWLMKEYDLSTYILDKFDKDCRDIVLCAIKKRT 161 !!

(BLASTP,  BLOSUM45  matrix)  

Defining  a  distance:  beyond  idenIty  

Query 1 GCCGCCGGGGTTCAGGTTCCACCCGACGGACGAGGAGCTGGTGATGC-ACTACCT-CTGC 58 ! ||||||||||||||||||||||||||||||||||||||| | | | ||||||| | || !Sbjct 1 GCCGCCGGGGTTCAGGTTCCACCCGACGGACGAGGAGCTCATCAC-CTACTACCTGCGGC 59 !!Query 59 CGGCGGT-GC-GCCGGCCTCCCCATCGCCGTCCCCATCATCGCCGAGGTCGACCTCTACA 116 ! | | | || | |||| | || || || | ||||||||||||||| ||| ||| !Sbjct 60 AGAAGATCGCCGACGGCGGCTTCA-CGGCGAGGGC--CATCGCCGAGGTCGATCTCAACA 116 !!Query 117 AGTTCGATCCATGGCATCTCCCA-AGAATGGCGCTGTACGGC-GAGAAGGAGTGGTACTT 174 ! ||| ||| |||||| |||||||| |||| ||| | || |||||||| |||||||| !Sbjct 117 AGTGCGAGCCATGGGATCTCCCAGAGAA-GGCAAAA-ATGGGAGAGAAGGAATGGTACTT 174 !!Query 175 CTTCTCCCCTC-GGGACCGCAAGTACCCGAAC 205 ! ||| | ||| |||| || |||||||| ||| !Sbjct 175 CTT-TAGCCTAAGGGATCGAAAGTACCC-AAC 204 !  

Iden0ty  ≈  yes/no  We  can  quan0fy  similarity  in  ‘bits’  

Defining  a  distance:  bit  score  and  E-­‐value  l  Bit  score  and  E-­‐value  can  be  used  as  distance  measures.  

l  I  prefer  (normalised)  bit  scores  

l  Small  changes  in  score  →  large  changes  in  E  

l  E  varies  linearly  with  database  size;  λS  independent  of  database  size  

E  =  kmne-­‐λS  

Defining  a  distance:  bit  score  and  E-­‐value  l  Bit  score  and  E-­‐value  can  be  used  as  distance  measures.  

l  I  prefer  (normalised)  bit  scores  

l  Small  changes  in  score  →  large  changes  in  E  

l  E  varies  linearly  with  database  size  and  query  length;  λS  independent  of  database  size  

E  =  kmne-­‐λS  

Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 31.4 bits (64), Expect = 4e-05, Method: Composition-based stats. ! Identities = 37/134 (28%), Positives = 53/134 (40%), Gaps = 12/134 (8%) !!!!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 7e-07, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) !

BLOSUM80  

BLOSUM45  

Defining  a  distance:  bit  score  and  E-­‐value  l  Bit  score  and  E-­‐value  can  be  used  as  distance  measures.  

l  I  prefer  (normalised)  bit  scores  

l  Small  changes  in  score  →  large  changes  in  E  

l  E  varies  linearly  with  database  size  and  query  length;  λS  independent  of  database  size  

E  =  kmne-­‐λS  

Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 7e-07, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) ! Score = 30.0 bits (89), Expect = 8e-05, Method: Composition-based stats. ! Identities = 23/114 (21%), Positives = 36/114 (32%), Gaps = 0/114 (0%)!(db size: 5 sequences) !!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 1e-06, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) ! Score = 30.0 bits (89), Expect = 1e-04, Method: Composition-based stats. ! Identities = 23/114 (21%), Positives = 36/114 (32%), Gaps = 0/114 (0%) !(db size: 483 sequences) !!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !***** No hits found ***** !(db size: 644 sequences) !!

Defining  a  distance:  bit  score  and  E-­‐value  l  Bit  score  and  E-­‐value  can  be  used  as  distance  measures.  

l  I  prefer  (normalised)  bit  scores  

l  Small  changes  in  score  →  large  changes  in  E  

l  E  varies  linearly  with  database  size  and  query  length;  λS  independent  of  database  size   Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !

Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 7e-07, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) ! Score = 30.0 bits (89), Expect = 8e-05, Method: Composition-based stats. ! Identities = 23/114 (21%), Positives = 36/114 (32%), Gaps = 0/114 (0%)!(db size: 5 sequences) !!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 1e-06, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) ! Score = 30.0 bits (89), Expect = 1e-04, Method: Composition-based stats. ! Identities = 23/114 (21%), Positives = 36/114 (32%), Gaps = 0/114 (0%) !(db size: 483 sequences) !!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !***** No hits found ***** !(db size: 644 sequences) !!

E  =  kmne-­‐λS  

Defining  a  distance:  alignment  v  profile  ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!!

Alignments  compare  two  sequences  Profiles  capture  informaIon  from  several    sequences  

Defining  a  distance:  alignment  v  profile  ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!ACAAAT!!

consensus  

Alignments  compare  two  sequences  Profiles  capture  informaIon  from  several    sequences  

Defining  a  distance:  alignment  v  profile  ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!ACAAAT!![AT][CG]A[ACGT][ACGT][TCA]![AT]-[CG]-A-X(2)-{G}!

regular  expression  

Alignments  compare  two  sequences  Profiles  capture  informaIon  from  several    sequences  

Defining  a  distance:  alignment  v  profile  ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!ACAAAT!!

PSSM  

123456!A405221!C040112!G010110!T100112!!

[AT][CG]A[ACGT][ACGT][TCA]![AT]-[CG]-A-X(2)-{G}!

Alignments  compare  two  sequences  Profiles  capture  informaIon  from  several    sequences  

Defining  a  distance:  alignment  v  profile  ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!ACAAAT!![AT][CG]A[ACGT][ACGT][TCA]![AT]-[CG]-A-X(2)-{G}!

123456!A405221!C040112!G010110!T100112!!

hidden  Markov  model  (HMM)  

Alignments  compare  two  sequences  Profiles  capture  informaIon  from  several    sequences  

Defining  a  distance:  bit  scores  in  HMMer  l  HMMer  works  differently  to  BLAST:  profile  HMMs  

l  Sta0s0cal  model  of  mul0ple  sequence  alignment  (not  pairwise  sequence  alignment)  

l  phmmer  and  jackhmmer  equivalents  of  BLASTP  and  PSIBLAST  

l Explicit  sta0s0cal  representa0on  of  alignment  uncertainty  

l Sequence  scores,  not  alignment  scores  

l Bit  score  is  ‘log-­‐odds’  bit  score:  log-odds = log

✓P (sequence matches alignment)

P (sequence matches null model)

Goritschnig  S,  Krasileva  KV,  Dahlbeck  D,  Staskawicz  BJ  (2012)  Computa0onal  predic0on  and  molecular  characteriza0on  of  an  oomycete  effector  and  the  cognate  Arabidopsis  resistance  gene.  PLoS  GeneCcs  8:  e1002502.  doi:10.1371/journal.pgen.1002502.  Haas  BJ,  Kamoun  S,  Zody  MC,  Jiang  RHY,  Handsaker  RE,  et  al.  (2009)  Genome  sequence  and  analysis  of  the  Irish  potato  famine  pathogen  Phytophthora  infestans.  Nature  461:  393–398.  doi:10.1038/nature08358.  Whisson  SC,  Boevink  PC,  Moleleki  L,  Avrova  AO,  Morales  JG,  et  al.  (2007)  A  transloca0on  signal  for  delivery  of  oomycete  effector  proteins  into  host  plant  cells.  Nature  450:  115–118.  doi:10.1038/nature06203.  

Defining  a  distance:  bit  scores  in  HMMer  l  HMMer  works  differently  to  BLAST:  profile  HMMs  

l  Sta0s0cal  model  of  mul0ple  sequence  alignment  (not  pairwise  sequence  alignment)  

l  phmmer  and  jackhmmer  equivalents  of  BLASTP  and  PSIBLAST  

l Explicit  sta0s0cal  representa0on  of  alignment  uncertainty  

l Sequence  scores,  not  alignment  scores  

l Bit  score  is  ‘log-­‐odds’  bit  score:  log-odds = log

✓P (sequence matches alignment)

P (sequence matches null model)

Defining  a  distance:  bit  scores  in  HMMer  l  HMMer  works  differently  to  BLAST:  profile  HMMs  

l  Sta0s0cal  model  of  mul0ple  sequence  alignment  (not  pairwise  sequence  alignment)  

l  phmmer  and  jackhmmer  equivalents  of  BLASTP  and  PSIBLAST  

l Explicit  sta0s0cal  representa0on  of  alignment  uncertainty  

l Sequence  scores,  not  alignment  scores  

l Bit  score  is  ‘log-­‐odds’  bit  score:  log-odds = log

✓P (sequence matches alignment)

P (sequence matches null model)

Defining  a  distance:  bit  scores  in  HMMer  l  HMMer  works  differently  to  BLAST:  profile  HMMs  

l  Sta0s0cal  model  of  mul0ple  sequence  alignment  (not  pairwise  sequence  alignment)  

l  phmmer  and  jackhmmer  equivalents  of  BLASTP  and  PSIBLAST  

l Explicit  sta0s0cal  representa0on  of  alignment  uncertainty  

l Sequence  scores,  not  alignment  scores  

l Bit  score  is  ‘log-­‐odds’  bit  score:  log-odds = log

✓P (sequence matches alignment)

P (sequence matches null model)

Null  model  is  a  control  Choice  of  null  model  can  be  important  

Defining  a  distance:  bit  scores  in  HMMer  l  HMMer  works  differently  to  BLAST:  profile  HMMs  

l  Sta0s0cal  model  of  mul0ple  sequence  alignment  (not  pairwise  sequence  alignment)  

l  phmmer  and  jackhmmer  equivalents  of  BLASTP  and  PSIBLAST  

l Explicit  sta0s0cal  representa0on  of  alignment  uncertainty  

l Sequence  scores,  not  alignment  scores  

l Bit  score  is  ‘log-­‐odds’  bit  score:  log-odds = log

✓P (sequence matches alignment)

P (sequence matches null model)

Sequence  matches  alignment  beier  than  control  (null)  →  log-­‐odds  >  0  Sequence  matches  control  (null)  beier  than  alignment  →  log-­‐odds  <  0  Sequence  matches  alignment  and  control  (null)  equally  →  log-­‐odds  ≈  0  

Defining  a  distance:  bit  scores  in  HMMer  

Query: NAM [M=129] !Accession: PF02365.10 !Description: No apical meristem (NAM) protein !Scores for complete sequences (score includes all domains): ! --- full sequence --- --- best 1 domain --- -#dom- ! E-value score bias E-value score bias exp N Sequence Description ! ------- ------ ----- ------- ------ ----- ---- -- -------- ----------- ! 3.1e-54 171.0 0.1 5.3e-54 170.3 0.1 1.4 1 StNac1_5 ! 5.5e-54 170.2 0.1 8.8e-54 169.6 0.1 1.3 1 NbNac1_1 ! 4e-53 167.4 0.1 6.3e-53 166.8 0.1 1.3 1 NbNac2_1 ! 1.5e-52 165.6 0.1 3.3e-52 164.5 0.1 1.6 1 StNac2_5 !!!Domain annotation for each sequence (and alignments): !>> StNac1_5 ! # score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc! --- ------ ----- --------- --------- ------- ------- ------- ------- ------- ------- ---- ! 1 ! 170.3 0.1 5.3e-54 5.3e-54 1 128 [. 28 156 .. 28 157 .. 0.97 !! Alignments for each domain: ! == domain 1 score: 170.3 bits; conditional E-value: 5.3e-54 ! PF02365.10 1 lppGfrFhPtdeelvveyLkkkvegkkleleevikevdiykvePwdLp..akvkaeekewyfFskrdkkyatgkrknratksgyWkatgkdkevlskkg 97 ! lp+G+rF+Ptdeelv++yL+ k++g + ++ +vi+evdi+k+ePwdLp ++v+++++ew+fF+++d+ky++g+r nrat++gyWkatgkd+++++kkg! StNac1_5 28 LPVGYRFRPTDEELVNHYLRLKINGADSQV-SVIREVDICKLEPWDLPdlSVVESHDNEWFFFCPKDRKYQNGQRLNRATERGYWKATGKDRNIVTKKG 125 ! 699************************999.99***************888899999****************************************** PP !! PF02365.10 98 elvglkktLvfykgrapkgektdWvmheyrl 128 ! +++g+kktLv+y grap+g++t+Wv+heyr+ ! StNac1_5 126 AKIGMKKTLVYYIGRAPEGKRTHWVIHEYRA 156 ! *****************************96 PP !!

l Easy  to  read  bit  scores  from  HMMer  output  

Defining  a  distance:  composiIon  l  Some0mes,  sequence  comparison  doesn’t  tell  you  much  (e.g.  T3  

effector  signals)  

l  Can  use  ‘bulk  proper0es’  of  sequence  composi0on  l  Many  ways  to  derive  a  ‘distance’  

Greenberg  JT,  Vinatzer  BA  (2003)  Iden0fying  type  III  effectors  of  plant  pathogens  and  analyzing  their  interac0on  with  plant  cells.  Curr  Opin  Microbiol  6:  20–28.  Arnold  R,  Brandmaier  S,  Kleine  F,  Tischler  P,  Heinz  E,  et  al.  (2009)  Sequence-­‐based  predic0on  of  type  III  secreted  proteins.  PLoS  Pathog  5:  e1000376.  doi:10.1371/journal.ppat.1000376.  

Defining  a  ‘distance’:  clustering  

Defining  a  ‘distance’:  clustering  

Not  really  a  distance,  more  a  bound  Sequences  that  cluster  with  your  known  examples  

Defining  a  ‘distance’:  CD-­‐HIT  clusters  l Clustering  tool,  online  at  h[p://weizhong-­‐lab.ucsd.edu/cd-­‐hit/  

l  Sequences  sorted  by  decreasing  length  

l  First  sequence  is  representa0ve  of  first  cluster:  ‘seen’  

l  Consider  each  remaining  sequence  in  turn:  compare  with  ‘seen’  set  

� Similarity  of  sequence  with  ‘seen’  sequence  >  threshold?  Merge  into  cluster  

� Otherwise  start  new  cluster:  ‘seen’  

l Fast,  but  can  be  sensi0ve  to  sequence  set  composi0on  (use  mul0-­‐step).  

Defining  a  ‘distance’:  CD-­‐HIT  clusters  l Clustering  tool,  online  at  h[p://weizhong-­‐lab.ucsd.edu/cd-­‐hit/  

l  Sequences  sorted  by  decreasing  length  

l  First  sequence  is  representa0ve  of  first  cluster:  ‘seen’  

l  Consider  each  remaining  sequence  in  turn:  compare  with  ‘seen’  set  

� Similarity  of  sequence  with  ‘seen’  sequence  >  threshold?  Merge  into  cluster  

� Otherwise  start  new  cluster:  ‘seen’  

l Fast,  but  can  be  sensi0ve  to  sequence  set  composi0on  (use  mul0-­‐step).  

Defining  a  ‘distance’:  CD-­‐HIT  clusters  l Clustering  tool,  online  at  h[p://weizhong-­‐lab.ucsd.edu/cd-­‐hit/  

l  Sequences  sorted  by  decreasing  length  

l  First  sequence  is  representa0ve  of  first  cluster:  ‘seen’  

l  Consider  each  remaining  sequence  in  turn:  compare  with  ‘seen’  set  

� Similarity  of  sequence  with  ‘seen’  sequence  >  threshold?  Merge  into  cluster  

� Otherwise  start  new  cluster:  ‘seen’  

l Fast,  but  can  be  sensi0ve  to  sequence  set  composi0on  (use  mul0-­‐step).  

Defining  a  ‘distance’:  CD-­‐HIT  clusters  l Clustering  tool,  online  at  h[p://weizhong-­‐lab.ucsd.edu/cd-­‐hit/  

l  Sequences  sorted  by  decreasing  length  

l  First  sequence  is  representa0ve  of  first  cluster:  ‘seen’  

l  Consider  each  remaining  sequence  in  turn:  compare  with  ‘seen’  set  

� Similarity  of  sequence  with  ‘seen’  sequence  >  threshold?  Merge  into  cluster  

� Otherwise  start  new  cluster:  ‘seen’  

l Fast,  but  can  be  sensi0ve  to  sequence  set  composi0on  (use  mul0-­‐step).  

Defining  a  ‘distance’:  CD-­‐HIT  clusters  l Clustering  tool,  online  at  h[p://weizhong-­‐lab.ucsd.edu/cd-­‐hit/  

l  Sequences  sorted  by  decreasing  length  

l  First  sequence  is  representa0ve  of  first  cluster:  ‘seen’  

l  Consider  each  remaining  sequence  in  turn:  compare  with  ‘seen’  set  

� Similarity  of  sequence  with  ‘seen’  sequence  >  threshold?  Merge  into  cluster  

� Otherwise  start  new  cluster:  ‘seen’  

l Fast,  but  can  be  sensi0ve  to  sequence  set  composi0on  (use  mul0-­‐step).  

Defining  a  ‘distance’:  CD-­‐HIT  clusters  l Clustering  tool,  online  at  h[p://weizhong-­‐lab.ucsd.edu/cd-­‐hit/  

l  Sequences  sorted  by  decreasing  length  

l  First  sequence  is  representa0ve  of  first  cluster:  ‘seen’  

l  Consider  each  remaining  sequence  in  turn:  compare  with  ‘seen’  set  

� Similarity  of  sequence  with  ‘seen’  sequence  >  threshold?  Merge  into  cluster  

� Otherwise  start  new  cluster:  ‘seen’  

l Fast,  but  can  be  sensi0ve  to  sequence  set  composi0on  (use  mul0-­‐step).  

Defining  a  ‘distance’:  CD-­‐HIT  clusters  l Clustering  tool,  online  at  h[p://weizhong-­‐lab.ucsd.edu/cd-­‐hit/  

l  Sequences  sorted  by  decreasing  length  

l  First  sequence  is  representa0ve  of  first  cluster:  ‘seen’  

l  Consider  each  remaining  sequence  in  turn:  compare  with  ‘seen’  set  

� Similarity  of  sequence  with  ‘seen’  sequence  >  threshold?  Merge  into  cluster  

� Otherwise  start  new  cluster:  ‘seen’  

l Fast,  but  can  be  sensi0ve  to  sequence  set  composi0on  (use  mul0-­‐step).  

need  to  test  clusters  for  robustness  

Defining  a  ‘distance’:  CD-­‐HIT  clusters  l Clustering  tool,  online  at  h[p://weizhong-­‐lab.ucsd.edu/cd-­‐hit/  

l  Sequences  sorted  by  decreasing  length  

l  First  sequence  is  representa0ve  of  first  cluster:  ‘seen’  

l  Consider  each  remaining  sequence  in  turn:  compare  with  ‘seen’  set  

� Similarity  of  sequence  with  ‘seen’  sequence  >  threshold?  Merge  into  cluster  

� Otherwise  start  new  cluster:  ‘seen’  

l Fast,  but  can  be  sensi0ve  to  sequence  set  composi0on  (use  mul0-­‐step).  

need  to  test  clusters  for  robustness  

Defining  a  ‘distance’:  MCL  clustering  l Clustering  algorithm  (used  in  TribeMCL,  OrthoMCL)  

l  Markov  Clustering  Algorithm  

l  Finds  clusters  in  networks  

l Use  BLAST  to  generate  all-­‐vs-­‐all  pairwise  comparisons  

l  Results  are  a  network  (similarity  graph)  

l Given  such  a  network:  l  Expansion  (raise  to  power)  –  ‘spreads  links’  

l  Infla0on  (scaling)  –  ‘thickens  strong  links’  

Defining  a  ‘distance’:  MCL  clustering  l Clustering  algorithm  (used  in  TribeMCL,  OrthoMCL)  

l  Markov  Clustering  Algorithm  

l  Finds  clusters  in  networks  

l Use  BLAST  to  generate  all-­‐vs-­‐all  pairwise  comparisons  

l  Results  are  a  network  (similarity  graph)  

l Given  such  a  network:  l  Expansion  (raise  to  power)  –  ‘spreads  links’  

l  Infla0on  (scaling)  –  ‘thickens  strong  links’  

Defining  a  ‘distance’:  MCL  clustering  l Clustering  algorithm  (used  in  TribeMCL,  OrthoMCL)  

l  Markov  Clustering  Algorithm  

l  Finds  clusters  in  networks  

l Use  BLAST  to  generate  all-­‐vs-­‐all  pairwise  comparisons  

l  Results  are  a  network  (similarity  graph)  

l Given  such  a  network:  l  Expansion  (raise  to  power)  –  ‘spreads  links’  

l  Infla0on  (scaling)  –  ‘thickens  strong  links’  

Repeated  applicaIon  of  the  expansion/inflaIon  cycle  results  in  the  formaIon  of  clusters.  

Defining  a  ‘distance’:  MCL  clustering  Expansion   InflaIon  

…  

…  

…   …  

→  

→  

Input  

Clustering  

Defining  a  ‘distance’:  MCL  clustering  l Clustering  algorithm  (used  in  TribeMCL,  OrthoMCL)  

l  Markov  Clustering  Algorithm  

l  Finds  clusters  in  networks  

l Use  BLAST  to  generate  all-­‐vs-­‐all  pairwise  comparisons  

l  Results  are  a  network  (similarity  graph)  

l Given  such  a  network:  l  Expansion  (raise  to  power)  –  ‘spreads  links’  

l  Infla0on  (scaling)  –  ‘thickens  strong  links’  

l One  key  parameter:  inflaIon  value  

l  Need  to  cluster  over  several  infla0on  values  to  confirm  robustness  (consistency  of  clustering)  

InflaIon  value   clusters  

1.4   3  

2.0   6  

4.0   18  

6.0   33  

Defining  a  distance  l Sequence  iden0ty  –  scores  alignment  (symmetry?)  

l Derived  score  (based  on  sequence  iden0ty/alignment)  

l  Bit  score  in  BLAST  –  scores  alignment  (subs0tu0on  matrix)  

l  E-­‐value  in  BLAST  –  scores  alignment  (sensi0ve  to  query/db  size,  subn  matrix)  

l Derived  score  (based  on  other  measures)  

l  Bit  score  in  HMMer  –  scores  sequence  rela0ve  to  model  (null  model?)  

l Clustering  l  Sequence  iden0ty  (e.g.  CD-­‐HIT)  –  can  be  sensi0ve  to  sequence  order  (mul0-­‐

step?  test  for  robustness?  CD-­‐HIT  uses  sequence  iden0ty)  

l  MCL  –  needs  all-­‐v-­‐all  pairwise  (test  for  robustness;  uses  BLAST  E-­‐value  by  default)  

Many  definiIons  of  distance  

l How  do  we  define  ‘distance’?  

l How  large  a  ‘distance’  (or  what  clustering  resolu0on)  do  we  take?  

l How  do  we  know  we’ve  chosen  a  sensible  ‘distance’?  

How  large  a  distance  do  we  allow?  

l How  do  we  define  ‘distance’?  

l How  large  a  ‘distance’  (or  what  clustering  resoluIon)  do  we  take?  

l How  do  we  know  we’ve  chosen  a  sensible  ‘distance’?  

Confusion  Matrix  

l Our  distance/boundary  classifies  sequences  as  ‘in’  or  ‘out’  l  ‘red’  or  ‘blue’  

l Changing  distance/bound  results  in  various  degrees  of  success…  

IN   OUT  

Red   1   5  

Blue   1   36  

Confusion  matrix:  

Confusion  Matrix  

l Our  distance/boundary  classifies  sequences  as  ‘in’  or  ‘out’  l  ‘red’  or  ‘blue’  

l Changing  distance/bound  results  in  various  degrees  of  success…  

IN   OUT  

Red   1   5  

Blue   1   36  

True  posiIve  

False  posiIve   True  negaIve  

False  negaIve  

Confusion  Matrix  

l Our  distance/boundary  classifies  sequences  as  ‘in’  or  ‘out’  l  ‘red’  or  ‘blue’  

l Changing  distance/bound  results  in  various  degrees  of  success…  

IN   OUT  

Red   1   5  

Blue   1   36  

False  posi0ve  rate   FP/(FP+TN)  

False  nega0ve  rate   FN/(TP+FN)  

Sensi0vity   TP/(TP+FN)  

Specificity   TN/(FP+TN)  

False  discovery  rate   FP/(FP+TP)  

Confusion  Matrix  

l Our  distance/boundary  classifies  sequences  as  ‘in’  or  ‘out’  l  ‘red’  or  ‘blue’  

l Changing  distance/bound  results  in  various  degrees  of  success…  

IN   OUT  

Red   1   5  

Blue   1   36  

False  posi0ve  rate   1/37  =  0.03  

False  nega0ve  rate   5/6  =  0.83  

Sensi0vity   1/6  =  0.17  

Specificity   36/37  =  0.97  

Confusion  Matrix  

l Our  distance/boundary  classifies  sequences  as  ‘in’  or  ‘out’  l  ‘red’  or  ‘blue’  

l Changing  distance/bound  results  in  various  degrees  of  success…  

IN   OUT  

Red   5   2  

Blue   4   33  

False  posi0ve  rate   0.11  

False  nega0ve  rate   0.29  

Sensi0vity   0.81  

Specificity   0.89  

Confusion  Matrix  

l Our  distance/boundary  classifies  sequences  as  ‘in’  or  ‘out’  l  ‘red’  or  ‘blue’  

l Changing  distance/bound  results  in  various  degrees  of  success…  

IN   OUT  

Red   7   0  

Blue   14   23  

False  posi0ve  rate   0.38  

False  nega0ve  rate   0  

Sensi0vity   1  

Specificity   0.62  

ROC  Curve  

l  To  assess  how  well  a  method  performs,  can  use  ROC  (Receiver  Opera0ng  Characteris0c)  curve  

l  Typically,  we  use  area  under  the  curve  (AUC)  to  choose  between  methods  

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1  

SensiIvity  

False  PosiIve  Rate  

ROC  Curve  

Classifier  

Random  

ROC  Curve  

l  To  assess  how  well  a  method  performs,  can  use  ROC  (Receiver  Opera0ng  Characteris0c)  curve  

l  Typically,  we  use  area  under  the  curve  (AUC)  to  choose  between  methods  

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1  

SensiIvity  

False  PosiIve  Rate  

ROC  Curve  

Classifier  

Random  

be[er  performance  

ROC  Curve  

l  To  assess  how  well  a  method  performs,  can  use  ROC  (Receiver  Opera0ng  Characteris0c)  curve  

l  The  ‘best’  parameter  se}ng  for  a  method  is  typically  near  the  apex.  

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1  

SensiIvity  

False  PosiIve  Rate  

ROC  Curve  

Classifier  

Random  

F-­‐measure  

IN   OUT  

Red   1   5  

Blue   1   36  

False  posi0ve  rate   FP/(FP+TN)  

False  nega0ve  rate   FN/(TP+FN)  

Sensi0vity   TP/(TP+FN)  

Specificity   TN/(FP+TN)  

l We  can  ‘game’  ROC  sta0s0cs  by  increasing  irrelevant  ‘nega0ve’  examples  

l  Increasing  TN  ‘improves’  false  posi0ve  rate  and  specificity  

l Can  use  precision  and  recall  instead  

F-­‐measure  

l We  can  ‘game’  ROC  sta0s0cs  by  increasing  irrelevant  ‘nega0ve’  examples  

l  Increasing  TN  ‘improves’  false  posi0ve  rate  and  specificity  

l Can  use  precision  and  recall  instead  

IN   OUT  

Red   1   5  

Blue   1   36  

Precision  (PPV)   TP/(TP+FP)  

Recall  =  sensi0vity   TP/(TP+FN)  

FDR  =  1-­‐PPV   FP/(TP+FP)  

F-­‐measure  

l Precision:  Propor0on  of  accurate  posi0ve  predic0ons  

l Recall:  Propor0on  of  posi0ve  examples  recovered  (sensiCvity)  

l F1  =  2  (precision  x  recall)/(precision  +  recall)  

IN   OUT  

Red   1   5  

Blue   1   36  

Precision  (PPV)   TP/(TP+FP)  

Recall  =  sensi0vity   TP/(TP+FN)  

FDR  =  1-­‐PPV   FP/(TP+FP)  

F-­‐measure  

l The  F-­‐measure  indicates  which  set  of  parameters  (which  distance)  ‘best’  

l Several  F-­‐measures  available  that  weight  precision  and  recall  differently  

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

1   2   3  

F-­‐measure  

How  large  a  distance  do  we  allow?  

l Assign  known  ‘posi0ve’  and    ‘nega0ve’  examples  

l Vary  distances  and  take  F-­‐measure  

l Choose  distance  that  gives  the  best  performance  

How  large  a  distance  do  we  allow?  

l Assign  known  ‘posi0ve’  and    ‘nega0ve’  examples  

l Vary  distances  and  take  F-­‐measure  

l Choose  distance  that  gives  the  best  performance  

How  large  a  distance  do  we  allow?  

l Assign  known  ‘posi0ve’  and    ‘nega0ve’  examples  

l Vary  distances  and  take  F-­‐measure  

l Choose  distance  that  gives  the  best  performance  

Confusion  Matrix  

l BUT:  how  do  we  know  that  we’ve  chosen  a  suitable  distance?  l  Training  set  choice  is  cri0cal  

IN   OUT  

Red   5   2  

Blue   4   33  

False  posi0ve  rate   0.11  

False  nega0ve  rate   0.29  

Sensi0vity   0.81  

Specificity   0.89  

Training  set  choice  

Train  classifier  on  known  examples:  looks  good…  

UnrepresentaIve  examples  

…but  training  set  biased/unrepresenta0ve  sample…  

Overfiong  

…or  ‘fits’  known  posi0ves  unfeasibly  0ghtly  

How  large  a  distance  do  we  allow?  

l How  do  we  define  ‘distance’?  

l How  large  a  ‘distance’  (or  what  clustering  resoluIon)  do  we  take?  

l How  do  we  know  we’ve  chosen  a  sensible  ‘distance’?  

How  do  we  know  we’ve  chosen  a  suitable  distance?  

l How  do  we  define  ‘distance’?  

l How  large  a  ‘distance’  (or  what  clustering  resolu0on)  do  we  take?  

l How  do  we  know  we’ve  chosen  a  sensible  ‘distance’?  

A  trip  to  the  doctor,  part  I  l Rou0ne  medical  checkup  

l Test  for  disease  X  (horrible,  unpleasant,  poten0ally  suppura0ng)  

l Test  has  sensi)vity  (i.e.  predicts  disease  where  there  is  disease)  of  95%    

l Test  has  false  posi)ve  rate  (i.e.  predicts  disease  where  there  is  no  disease)  of  1%  

l Your  test  is  posi0ve  

l What  is  the  probability  that  you  have  disease  X?  

A  trip  to  the  doctor,  part  I  l Rou0ne  medical  checkup  

l Test  for  disease  X  (horrible,  unpleasant,  poten0ally  suppura0ng)  

l Test  has  sensi)vity  (i.e.  predicts  disease  where  there  is  disease)  of  95%    

l Test  has  false  posi)ve  rate  (i.e.  predicts  disease  where  there  is  no  disease)  of  1%  

l Your  test  is  posi0ve  

l What  is  the  probability  that  you  have  disease  X?  

A  trip  to  the  doctor,  part  I  l Rou0ne  medical  checkup  

l Test  for  disease  X  (horrible,  unpleasant,  poten0ally  suppura0ng)  

l Test  has  sensi)vity  (i.e.  predicts  disease  where  there  is  disease)  of  95%    

l Test  has  false  posi)ve  rate  (i.e.  predicts  disease  where  there  is  no  disease)  of  1%  

l Your  test  is  posi0ve  

l What  is  the  probability  that  you  have  disease  X?  

A  trip  to  the  doctor,  part  I  l Rou0ne  medical  checkup  

l Test  for  disease  X  (horrible,  unpleasant,  poten0ally  suppura0ng)  

l Test  has  sensi)vity  (i.e.  predicts  disease  where  there  is  disease)  of  95%    

l Test  has  false  posi)ve  rate  (i.e.  predicts  disease  where  there  is  no  disease)  of  1%  

l Your  test  is  posiIve  

l What  is  the  probability  that  you  have  disease  X?  

A  trip  to  the  doctor,  part  I  l Rou0ne  medical  checkup  

l Test  for  disease  X  (horrible,  unpleasant,  poten0ally  suppura0ng)  

l Test  has  sensi)vity  (i.e.  predicts  disease  where  there  is  disease)  of  95%    

l Test  has  false  posi)ve  rate  (i.e.  predicts  disease  where  there  is  no  disease)  of  1%  

l Your  test  is  posiIve  

l What  is  the  probability  that  you  have  disease  X?  

0.01   0.05   0.95   0.99  0.50  

How  do  we  know  we’ve  chosen  a  suitable  distance?  

l How  do  we  define  ‘distance’?  

l How  large  a  ‘distance’  do  we  take?  

l How  do  we  know  we’ve  chosen  a  sensible  ‘distance’?  

Cross-­‐validaIon  l Es0ma0on  of  classifier  performance  depends  on  

l  distance  measure  

l  composi0on  of  training  set  (‘posi0ves’  and  ‘nega0ves’)  

l Cross-­‐valida0on  gives  objec0ve  measure  of  performance  

l Many  strategies  available,  including:  

l  leave-­‐one-­‐out  (LOO)  

l  k-­‐fold  crossvalida0on  

l  repeated  (random)  subsampling  

l Essen0ally:  always  keep  a  hold-­‐out  set  (not  used  to  train)  

Cross-­‐validaIon  l Es0ma0on  of  classifier  performance  depends  on  

l  distance  measure  

l  composi0on  of  training  set  (‘posi0ves’  and  ‘nega0ves’)  

l Cross-­‐valida0on  gives  objec0ve  measure  of  performance  

l Many  strategies  available,  including:  

l  leave-­‐one-­‐out  (LOO)  

l  k-­‐fold  crossvalida0on  

l  repeated  (random)  subsampling  

l Essen0ally:  always  keep  a  hold-­‐out  set  (not  used  to  train)  

Cross-­‐validaIon  l Es0ma0on  of  classifier  performance  depends  on  

l  distance  measure  

l  composi0on  of  training  set  (‘posi0ves’  and  ‘nega0ves’)  

l Cross-­‐valida0on  gives  objec0ve  measure  of  performance  

l Many  strategies  available,  including:  

l  leave-­‐one-­‐out  (LOO)  

l  k-­‐fold  crossvalida0on  

l  repeated  (random)  subsampling  

l Essen0ally:  always  keep  a  hold-­‐out  set  (not  used  to  train)  

Cross-­‐validaIon  l Es0ma0on  of  classifier  performance  depends  on  

l  distance  measure  

l  composi0on  of  training  set  (‘posi0ves’  and  ‘nega0ves’)  

l Cross-­‐valida0on  gives  objec0ve  measure  of  performance  

l Many  strategies  available,  including:  

l  leave-­‐one-­‐out  (LOO)  

l  k-­‐fold  crossvalida0on  

l  repeated  (random)  subsampling  

l Essen0ally:  always  keep  a  hold-­‐out  set  (not  used  to  train)  

k-­‐fold  crossvalidaIon  l No  crossvalida0on:  

l  One  training  set  

l  No  test  (hold-­‐out/valida0on)  set  

l  Risks  overfi}ng  

Training  Set  

Test  Set  

k-­‐fold  crossvalidaIon  l Valida0on:  

l  One  training  set,  one  test  (hold-­‐out/valida0on)  set  

l  Test  performance  of  classifier  on  unseen  data  

Training  Set  

Test  Set  

k-­‐fold  crossvalidaIon  l 2-­‐fold  crossvalida0on:  

l  Two  runs,  each  with  one  training  set,  one  test  set  

l  Swap  training  and  test  sets,  collate  results  

Training  Set  

Test  Set  

run1  

run2  

k-­‐fold  crossvalidaIon  l 3-­‐fold  crossvalida0on:  

l  Three  runs,  each  with  one  training  set,  one  test  set  

Training  Set  

Test  Set  

run1  

run2  

run3  

k-­‐fold  crossvalidaIon  l k-­‐fold  crossvalida0on:  

l  k  runs,  each  with  one  training  set,  one  test  set  (n  items  in  dataset,  k>1)  

Training  Set   Test  Set  

run1  

run2  

runk  

n/k  n-­‐(n/k)  

…  

Arer  crossvalidaIon  

False  posi0ve  rate   0.11  

False  nega0ve  rate   0.29  

Sensi0vity   0.81  

Specificity   0.89  

Precision   0.56  

•  Use  crossvalida0on  to  find  ‘best’  method  &  parameters  •  Crossvalida0on  gives  you  es0mated  performance  metrics  on  

unseen  data  •  Apply  ‘best’  method  to  complete  dataset  for  predic0on  

A  trip  to  the  doctor,  part  II  l Test  for  disease  X  (horrible,  unpleasant,  poten0ally  suppura0ng)  

l Test  has  sensi)vity  (i.e.  predicts  disease  where  there  is  disease)  of  95%    

l Test  has  false  posi)ve  rate  (i.e.  predicts  disease  where  there  is  no  disease)  of  1%  

l Your  test  is  posiIve  

l To  calculate  the  probability  that  the  test  correctly  determines  whether  you  have  the  disease,  you  need  to  know  the  baseline  occurrence.  

A  trip  to  the  doctor,  part  II  l Test  for  disease  X  (horrible,  unpleasant,  poten0ally  suppura0ng)  

l Test  has  sensi)vity  (i.e.  predicts  disease  where  there  is  disease)  of  95%    

l Test  has  false  posi)ve  rate  (i.e.  predicts  disease  where  there  is  no  disease)  of  1%  

l Your  test  is  posiIve  

l To  calculate  the  probability  that  the  test  correctly  determines  whether  you  have  the  disease,  you  need  to  know  the  baseline  occurrence.  

A  trip  to  the  doctor,  part  II  l Test  for  disease  X  (horrible,  unpleasant,  poten0ally  suppura0ng)  

l Test  has  sensi)vity  (i.e.  predicts  disease  where  there  is  disease)  of  95%    

l Test  has  false  posi)ve  rate  (i.e.  predicts  disease  where  there  is  no  disease)  of  1%  

l Your  test  is  posi0ve  

l To  calculate  the  probability  that  the  test  correctly  determines  whether  you  have  the  disease,  you  need  to  know  the  baseline  occurrence.  

Baseline  occurrence:  1%  ⇒  P(disease|+ve)=0.490      Baseline  occurrence:  80%  ⇒  P(disease|+ve)=0.997  

What  is  the  baseline  occurrence  for  effectors?  

l Usually  rely  on  predic0ons  for  expected  baseline  

l Bacterial  genomes:  ≈4500  genes  

l  Type  III  effectors:  1-­‐10%  (Arnold  et  al.  2009);  1-­‐2%  (Collmer  et  al.  2002);  1%  (Boch  and  Bonas,  2010)  

l Oomycete/fungal  genomes:  ≈20000  genes  

l  RxLRs:  120-­‐460  (1-­‐2%;  Whisson  et  al.  2007);  ≤563  (≲2%  Haas  et  al.  2009);    

l  CRNs:  19-­‐196  (≲1%;  Haas  et  al.  2009)  

l  CHxC:  ≈30  (<1%;  Kemen  et  al.  2011)  

l We  need  to  take  care  over  result  interpreta0on:  

l  Predic0on  method  with  5%  false  nega0ve  rate  and  1%  false  posi0ve  rate,  with  1%  baseline,  predic0ng  500  effectors:  

� P(effector|posiIve  test)≈0.5  

What  is  the  baseline  occurrence  for  effectors?  

l Usually  rely  on  predic0ons  for  expected  baseline  

l Bacterial  genomes:  ≈4500  genes  

l  Type  III  effectors:  1-­‐10%  (Arnold  et  al.  2009);  1-­‐2%  (Collmer  et  al.  2002);  1%  (Boch  and  Bonas,  2010)  

l Oomycete/fungal  genomes:  ≈20000  genes  

l  RxLRs:  120-­‐460  (1-­‐2%;  Whisson  et  al.  2007);  ≤563  (≲2%  Haas  et  al.  2009);    

l  CRNs:  19-­‐196  (≲1%;  Haas  et  al.  2009)  

l  CHxC:  ≈30  (<1%;  Kemen  et  al.  2011)  

l We  need  to  take  care  over  result  interpreta0on:  

l  Predic0on  method  with  5%  false  nega0ve  rate  and  1%  false  posi0ve  rate,  with  1%  baseline,  predic0ng  500  effectors:  

� P(effector|posiIve  test)≈0.5  

What  is  the  baseline  occurrence  for  effectors?  

l Usually  rely  on  predic0ons  for  expected  baseline  

l Bacterial  genomes:  ≈4500  genes  

l  Type  III  effectors:  1-­‐10%  (Arnold  et  al.  2009);  1-­‐2%  (Collmer  et  al.  2002);  1%  (Boch  and  Bonas,  2010)  

l Oomycete/fungal  genomes:  ≈20000  genes  

l  RxLRs:  120-­‐460  (1-­‐2%;  Whisson  et  al.  2007);  ≤563  (≲2%  Haas  et  al.  2009);    

l  CRNs:  19-­‐196  (≲1%;  Haas  et  al.  2009)  

l  CHxC:  ≈30  (<1%;  Kemen  et  al.  2011)  

l We  need  to  take  care  over  result  interpreta0on:  

l  Predic0on  method  with  5%  false  nega0ve  rate  and  1%  false  posi0ve  rate,  with  1%  baseline,  predic0ng  500  effectors:  

� P(effector|posiIve  test)≈0.5  

What  is  the  baseline  occurrence  for  effectors?  

l Usually  rely  on  predic0ons  for  expected  baseline  

l Bacterial  genomes:  ≈4500  genes  

l  Type  III  effectors:  1-­‐10%  (Arnold  et  al.  2009);  1-­‐2%  (Collmer  et  al.  2002);  1%  (Boch  and  Bonas,  2010)  

l Oomycete/fungal  genomes:  ≈20000  genes  

l  RxLRs:  120-­‐460  (1-­‐2%;  Whisson  et  al.  2007);  ≤563  (≲2%  Haas  et  al.  2009);    

l  CRNs:  19-­‐196  (≲1%;  Haas  et  al.  2009)  

l  CHxC:  ≈30  (<1%;  Kemen  et  al.  2011)  

l We  need  to  take  care  over  result  interpreta0on:  

l  Predic0on  method  with  5%  false  nega0ve  rate  and  1%  false  posi0ve  rate,  with  1%  baseline,  predic0ng  500  effectors:  

� P(effector|posiIve  test)≈0.5  

A  lesson  from  the  literature?  

l “The  resul0ng  computa0onal  model  revealed  a  strong  type  III  secre0on  signal  in  the  N-­‐terminus  that  can  be  used  to  detect  effectors  with  sensi0vity  of  71%  and  [specificity]  of  85%.”  

l  Sensi0vity  [P(+ve|T3E)]  =  0.71;  FPR  [1-­‐Specificity;  P(+ve|not  T3E)]  =  0.15  

l  Base  rate  [P(T3E)]  ≈  3%;  Genes  =  4500    

l  We  expect  P(T3E|+ve)  ≈  0.13  

l  (and  a  significant  number,  up  to  15%  of  the  genome,  of  false  posi0ves…)  

P (T3E|+ve) =P (+ve|T3E)P (T3E)

P (+ve|T3E)P (T3E) + P (+ve|T3E)P (T3E)

A  lesson  from  the  literature?  

l “The  resul0ng  computa0onal  model  revealed  a  strong  type  III  secre0on  signal  in  the  N-­‐terminus  that  can  be  used  to  detect  effectors  with  sensi0vity  of  71%  and  [specificity]  of  85%.”  

l  Sensi0vity  [P(+ve|T3E)]  =  0.71;  FPR  [1-­‐Specificity;  P(+ve|not  T3E)]  =  0.15  

l  Base  rate  [P(T3E)]  ≈  3%;  Genes  =  4500    

l  We  expect  P(T3E|+ve)  ≈  0.13  

l  (and  a  significant  number,  up  to  15%  of  the  genome,  of  false  posi0ves…)  

P (T3E|+ve) =P (+ve|T3E)P (T3E)

P (+ve|T3E)P (T3E) + P (+ve|T3E)P (T3E)

A  lesson  from  the  literature?  

l “The  resul0ng  computa0onal  model  revealed  a  strong  type  III  secre0on  signal  in  the  N-­‐terminus  that  can  be  used  to  detect  effectors  with  sensi0vity  of  71%  and  [specificity]  of  85%.”  

l  Sensi0vity  [P(+ve|T3E)]  =  0.71;  FPR  [1-­‐Specificity;  P(+ve|not  T3E)]  =  0.15  

l  Base  rate  [P(T3E)]  ≈  3%;  Genes  =  4500    

l  We  expect  P(T3E|+ve)  ≈  0.13  

l  (and  a  significant  number,  up  to  15%  of  the  genome,  of  false  posi0ves…)  

P (T3E|+ve) =P (+ve|T3E)P (T3E)

P (+ve|T3E)P (T3E) + P (+ve|T3E)P (T3E)

A  lesson  from  the  literature?  l “The  surprisingly  high  number  of  (false)  posi0ves  in  genomes    without  TTSS  exceeds  the  expected  false  posi0ve  rate  (Table  1)”  

 

0.038  x  5169  x  0.13  ≈  26  [No.  +ve  x  P(T3E|+ve)]    

Director’s  Commentary:  Finding  RxLRs  

l Supplementary  from  Whisson  et  al.  (2007)  l  Whisson  SC,  Boevink  PC,  Moleleki  L,  Avrova  AO,  Morales  JG,  et  al.  (2007)  A  transloca0on  

signal  for  delivery  of  oomycete  effector  proteins  into  host  plant  cells.  Nature  450:  115–118.  doi:10.1038/nature06203.  

l Not  perfect  

l Detail  of  one  way  to  construct  a  classifier  

Director’s  Commentary:  Finding  RxLRs  

l Supplementary  from  Whisson  et  al.  (2007)  l  Whisson  SC,  Boevink  PC,  Moleleki  L,  Avrova  AO,  Morales  JG,  et  al.  (2007)  A  transloca0on  

signal  for  delivery  of  oomycete  effector  proteins  into  host  plant  cells.  Nature  450:  115–118.  doi:10.1038/nature06203.  

l Not  perfect  

l Detail  of  one  way  to  construct  a  classifier  

Director’s  Commentary:  Finding  RxLRs  

l Supplementary  from  Whisson  et  al.  (2007)  l  Whisson  SC,  Boevink  PC,  Moleleki  L,  Avrova  AO,  Morales  JG,  et  al.  (2007)  A  transloca0on  

signal  for  delivery  of  oomycete  effector  proteins  into  host  plant  cells.  Nature  450:  115–118.  doi:10.1038/nature06203.  

l Not  perfect  

l Detail  of  one  way  to  construct  a  classifier  

Building  a  training  set  l Star0ng  point:  49  candidate  sequences  (reference  set)  

l Known:  l  Contain  (puta0vely)  RxLR-­‐EER  mo0f  

l  All  but  one  transcribed  (i.e.  not  bad  gene  calls)  

l Assumed:  

l  Presence  of  signal  pep0de  and  RxLR-­‐EER  categorises  effectors  

Building  a  training  set  l Star0ng  point:  49  candidate  sequences  (reference  set)  

l Known:  l  Contain  (puta0vely)  RxLR-­‐EER  mo0f  

l  All  but  one  transcribed  (i.e.  not  bad  gene  calls)  

l Assumed:  

l  Presence  of  signal  pep0de  and  RxLR-­‐EER  categorises  effectors    

Building  a  training  set  l Star0ng  point:  49  candidate  sequences  (reference  set)  

l Known:  l  Contain  (puta0vely)  RxLR-­‐EER  mo0f  

l  All  but  one  transcribed  (i.e.  not  bad  gene  calls)  

l Assumed:  

l  Presence  of  signal  pep0de  and  RxLR-­‐EER  categorises  effectors  

Building  a  training  set  l SignalP  3.0  (Bendtsen  et  al.  2004)  to  predict  loca0ons  of  signal  pep0des.  

l  SignalP  also  has  sta0s0cal  performance  es0mates:  

l Se}ngs:  

l  HMM  cutoff  probability  =  0.9  

l  Cleavage  site  between  posi0ons  10  and  40  inclusive  

l  Jus0fica0on:  use  in  previous  studies  by  others  

Building  a  training  set  l SignalP  3.0  (Bendtsen  et  al.  2004)  to  predict  loca0ons  of  signal  pep0des.  

l  SignalP  also  has  sta0s0cal  performance  es0mates:  

l Se}ngs:  

l  HMM  cutoff  probability  =  0.9  

l  Cleavage  site  between  posi0ons  10  and  40  inclusive  

l  Jus0fica0on:  use  in  previous  studies  by  others  

Building  a  training  set  l Of  49,  four  sequences  failed  

l  One  carried  forward  on  experimental  grounds  (highly-­‐expressed)  

l Training  set  now  has  46  sequences  

l But  seven  of  these  actually  have  no  recognisable  RxLR-­‐EER  mo0f,  so  are  discarded  

l Training  set  now  has  39  sequences  

Building  a  training  set  l Of  49,  four  sequences  failed  

l  One  carried  forward  on  experimental  grounds  (highly-­‐expressed)  

l Training  set  now  has  46  sequences  

l But  seven  of  these  actually  have  no  recognisable  RxLR-­‐EER  mo0f,  so  are  discarded  

l Training  set  now  has  39  sequences  

Building  a  training  set  l Of  49,  four  sequences  failed  

l  One  carried  forward  on  experimental  grounds  (highly-­‐expressed)  

l Training  set  now  has  46  sequences  

l But  seven  of  these  actually  have  no  recognisable  RxLR-­‐EER  mo0f,  so  are  discarded  

l Training  set  now  has  39  sequences  

Building  a  training  set  l Of  49,  four  sequences  failed  

l  One  carried  forward  on  experimental  grounds  (highly-­‐expressed)  

l Training  set  now  has  46  sequences  

l But  seven  of  these  actually  have  no  recognisable  RxLR-­‐EER  mo0f,  so  are  discarded  

l Training  set  now  has  39  sequences  

Building  a  classifier  l We  have  a  recognisable  mo0f,  with  substan0al  local  varia0on  and  indels  

l  Therefore  chose  profile  HMM  

l  Use  HMMer  socware  

l Profile  HMMs  sensi0ve  to    quality  of  alignment  

l Therefore  treat  alignment  as  a  parameter  of  the  HMM  (much  difference  between  alignments!)  

Building  a  classifier  l We  have  a  recognisable  mo0f,  with  substan0al  local  varia0on  and  indels  

l  Therefore  chose  profile  HMM  

l  Use  HMMer  socware  

l Profile  HMMs  sensi0ve  to    quality  of  alignment  

l Therefore  treat  alignment  as  a  parameter  of  the  HMM  (much  difference  between  alignments!)  

Building  a  classifier  l We  have  a  recognisable  mo0f,  with  substan0al  local  varia0on  and  indels  

l  Therefore  chose  profile  HMM  

l  Use  HMMer  socware  

l Profile  HMMs  sensi0ve  to    quality  of  alignment  

l Therefore  treat  alignment  as  a  parameter  of  the  HMM  (much  difference  between  alignments!)  

Building  a  classifier  l Anchored  at  RxLR  and  EER  

Building  a  classifier  l ClustalW  

Building  a  classifier  l T-­‐Coffee  

Building  a  classifier  l Parameters  modified  for  HMM  

l  Alignment  package  (no  alignment,  anchored,  Clustal,  DiAlign,  T-­‐Coffee)  on  default  se}ngs  

l  Full-­‐length  and  truncated  (no  signal  pep0de)  alignments  to  test  for  influence  of  signal  pep0de  region  on  classifier  

� Plus  one  alignment  of  RxLR-­‐EER  plus  flanking  region  only  (‘cropped’)  

l HMM  built  for  each  of  eleven  alignments  

l  Default  parameters  

l Once  built,  the  HMM  is  the  classifier.  

Building  a  classifier  

Trunca0ng  sequences  reshapes  sequence  space  

Building  a  classifier  l Parameters  modified  for  HMM  

l  Alignment  package  (no  alignment,  anchored,  Clustal,  DiAlign,  T-­‐Coffee)  on  default  se}ngs  

l  Full-­‐length  and  truncated  (no  signal  pep0de)  alignments  to  test  for  influence  of  signal  pep0de  region  on  classifier  

� Plus  one  alignment  of  RxLR-­‐EER  plus  flanking  region  only  (‘cropped’)  

l HMM  built  for  each  of  eleven  alignments  

l  Default  parameters  

l Once  built,  the  HMM  is  the  classifier.  

Building  a  classifier  l Parameters  modified  for  HMM  

l  Alignment  package  (no  alignment,  anchored,  Clustal,  DiAlign,  T-­‐Coffee)  on  default  se}ngs  

l  Full-­‐length  and  truncated  (no  signal  pep0de)  alignments  to  test  for  influence  of  signal  pep0de  region  on  classifier  

� Plus  one  alignment  of  RxLR-­‐EER  plus  flanking  region  only  (‘cropped’)  

l HMM  built  for  each  of  eleven  alignments  

l  Default  parameters  

l Once  built,  the  HMM  is  the  classifier.  

hmmbuild --amino <output> <alignment>!

TesIng  the  classifiers  

Only  posiIve  examples:  How  well  does  a  classifier  cover  them?  

TesIng  the  classifiers  l Eleven  classifiers  to  test  

l Step  1:  Consistency  test  l  Does  the  classifier  correctly  call  as  posi0ve  the  sequences  used  to  train  it?  

l  Es0mates  recovery  of  the  informa0on  in  the  training  set  

l Step  2:  Recovery  of  full  sequences    l  Es0mates  performance  of  classifier  on  complete  sequence  data  

SigP-­‐RxLR-­‐Cterm  

RxLR-­‐Cterm  

RxLR  

TesIng  the  classifiers  

Only  posiIve  examples:  How  well  does  a  classifier  recover  unseen  sequence?  

TesIng  the  classifiers  

Only  posiIve  examples:  How  well  does  a  classifier  recover  unseen  sequence?  

TesIng  the  classifiers  l Step  3:  Leave-­‐One-­‐Out  Crossvalida0on  

l  But  only  have  posi0ve  examples!  

l  Removes  possibility  that  classifier  matches  on  basis  of  having  ‘seen’  a  sequence  before  

TesIng  the  classifiers  l Leave-­‐one-­‐out  (LOO)  crossvalida0on:  

l  k  runs,  each  with  one  training  set,  one  test  set  (n  items  in  dataset,  k=n)  

Training  Set   Test  Set  

run1  

run2  

runk  

…  

TesIng  the  classifiers  l Step  3:  Leave-­‐One-­‐Out  Crossvalida0on  

l  But  only  have  posi0ve  examples!  

l  Removes  possibility  that  classifier  matches  on  basis  of  having  ‘seen’  a  sequence  before  

SigP-­‐  RxLR-­‐  Cterm  

RxLR-­‐  Cterm  

RxLR  

Beier  match  to  classifier  than  to  control  

TesIng  the  classifiers  l Step  4:  Tests  on  nega0ve  samples  

l  Completely  shuffled  sequences  

l  Shuffled  downstream  of  the  signal  pep0de  only  

l  Replace  RxLR-­‐EER  with  AAAA-­‐AAA  

No  classifier  idenIfies  a  false  posiIve    (no  classifier  matches  on  sequence        composi0on  alone)  

TesIng  the  classifiers  l Step  4:  Tests  on  nega0ve  samples  

l  Completely  shuffled  sequences  

l  Shuffled  downstream  of  the  signal  pep0de  only  

l  Replace  RxLR-­‐EER  with  AAAA-­‐AAA  

(some  recogni0on  on  basis  of  signal  pep0de)  

SigP-­‐  RxLR-­‐  Cterm  

RxLR-­‐  Cterm  

RxLR  

TesIng  the  classifiers  l Step  4:  Tests  on  nega0ve  samples  

l  Completely  shuffled  sequences  

l  Shuffled  downstream  of  the  signal  pep0de  only  

l  Replace  RxLR-­‐EER  with  AAAA-­‐AAA  

(some  recogni0on  on  sequence  other  than  mo0f)  

SigP-­‐  RxLR-­‐  Cterm  

RxLR-­‐  Cterm  

RxLR  

Choosing  a  classifier  l The  ‘cropped’  classifier  has:  

l  100%  recovery  of  posi0ve  training  sequences  

l  0%  recovery  of  nega0ve  test  sequences  

l Some  varia0on  in  classifier  performance  on  whole  genome:  

Choosing  a  classifier  l The  ‘cropped’  classifier  has:  

l  100%  recovery  of  posi0ve  training  sequences  

l  0%  recovery  of  nega0ve  test  sequences  

l Some  varia0on  in  classifier  performance  on  whole  genome:  

Oranges  are  not  the  only  fruit  l Other  classifiers  had  been  proposed,  e.g.  Bha[acharjee  et  al.  (2006):  

l  Presence  of  signal  pep0de,  with  cleavage  site  in  first  40aa  

l  Regular  expression  test:  

� R.LR.{,40}[ED][ED][KR]in  first  100aa    acer  cleavage  site  

l Can  choose  between  methods,  or  report  range  of  predic0ons  

Oranges  are  not  the  only  fruit  l Other  classifiers  had  been  proposed,  e.g.  Bha[acharjee  et  al.  (2006):  

l  Presence  of  signal  pep0de,  with  cleavage  site  in  first  40aa  

l  Regular  expression  test:  

� R.LR.{,40}[ED][ED][KR]in  first  100aa    acer  cleavage  site  

l Can  choose  between  methods,  or  report  range  of  predic0ons  

So  how  did  it  work  out…?  l Refined  all  RxLR  predic0ons  to  ‘priority  set’  of  ≈200  for  cloning  

l First  set  of  46  candidate  effectors  (07/11):  l  25  host  interactors  detected  by  Y2H  

l  Localisa0on  data  for  41  candidates  

l  Silencing  phenotypes  for  19  candidates  

l  22  puta0ve  orthologues  with  P.  capsici  

l Currently:  l  44  silencing  phenotypes  

Transient  expression  in  leaf  of  GFP-­‐fused  RxLR  candidate,  showing  plasma  membrane  localisa0on  

Acknowledgements  l Phytophthora  groups  at  JHI  

l  (Paul  Birch,  Steve  Whisson,  Dave  Cooke)  

l Bacteriology  groups  at  JHI  l  (Ian  Toth,  Nicola  Holden)  

l  Imaging  at  JHI  

l  (Petra  Boevink)  

l Numerous  sta0s0cians  

l  (David  Broadhurst,  Andy  Woodward,  BioSS)  

Sequence  space  

CD-­‐Hit  sequence  ordering  l  “Algorithm  limita0ons:  […]  

Let  say,  there  are  two  clusters:  cluster  #1  has  A,  X  and  Y  where  A  is  the  representa0ve,  and  cluster  #2  has  B  and  Z  where  B  is  the  representa0ve.    The  problem  is  that  even  if  Y  is  more  similar  to  B  than  to  A,  it  can  s0ll  be  in  cluster  #1  because  Y  first  hits  A  during  the  clustering  process.”  

l  h[p://weizhong-­‐lab.ucsd.edu/cd-­‐hit/wiki/doku.php?id=cd-­‐hit_user_guide