Mining Plant Pathogen Genomes for Effectors

178
Mining pathogen genomes for effectors Leighton Pritchard

description

Presentation given as part of the EMBO Workshop on Plant-Microbe Interactions, at The Sainsbury Laboratory, Norwich, 20th June 2012. This presentation describes bioinformatic and statistical considerations for the prediction of plant pathogen effectors from genome sequences and annotation, with several literature examples.

Transcript of Mining Plant Pathogen Genomes for Effectors

Page 1: Mining Plant Pathogen Genomes for Effectors

Mining  pathogen  genomes  for  effectors  

Leighton  Pritchard  

Page 2: Mining Plant Pathogen Genomes for Effectors

The  overall  goal  l Star0ng  from  a  genome  sequence,  iden0fy  genes  that  code  for    candidate  effectors  (or,  star0ng  from  gene  product  complement,  iden0fy  candidate  effectors)  

Page 3: Mining Plant Pathogen Genomes for Effectors

The  overall  goal  l Star0ng  from  a  genome  sequence,  iden0fy  genes  that  code  for    candidate  effectors  (or,  star0ng  from  gene  product  complement,  iden0fy  candidate  effectors)  

Page 4: Mining Plant Pathogen Genomes for Effectors

The  overall  goal  l Star0ng  from  a  genome  sequence,  iden0fy  genes  that  code  for    candidate  effectors  (or,  star0ng  from  gene  product  complement,  iden0fy  candidate  effectors)  

Page 5: Mining Plant Pathogen Genomes for Effectors

The  overall  goal  l Star0ng  from  a  genome  sequence,  iden0fy  genes  that  code  for    candidate  effectors  (or,  star0ng  from  gene  product  complement,  iden0fy  candidate  effectors)  

Page 6: Mining Plant Pathogen Genomes for Effectors

What  is  an  effector?  l Molecule  produced  by  pathogen  that  (directly?)  modifies  host    molecular/biochemical  ‘behaviour’,  e.g.  

l  Inhibits  enzyme  ac0on  (Cladosporium  fulvum  AVR2,  AVR4;  Phytophthora  infestans  EPIC1,  EPIC2B;  P.  sojae  glucanase  inhibitors)  

l  Cleaves  protein  target  (Pseudomonas  syringae  AvrRpt2)  

l  (De-­‐)phosphorylates  protein  target  (Pseudomonas  syringae  AvrRPM1,  AvrB)  

l  Addi0onal  component  in/retarge0ng  host  system,  e.g.  E3  ligase  ac0vity  (P.  syringae  AvrPtoB;  P.  infestans  Avr3a)  

l  Regulatory  control  (Xanthomonas  campestris  AvrBs3,  TAL  effectors)  

Page 7: Mining Plant Pathogen Genomes for Effectors

What  is  an  effector?  l Molecule  produced  by  pathogen  that  (directly?)  modifies  host    molecular/biochemical  ‘behaviour’,  e.g.  

l  Inhibits  enzyme  ac0on  (Cladosporium  fulvum  AVR2,  AVR4;  Phytophthora  infestans  EPIC1,  EPIC2B;  P.  sojae  glucanase  inhibitors)  

l  Cleaves  protein  target  (Pseudomonas  syringae  AvrRpt2)  

l  (De-­‐)phosphorylates  protein  target  (Pseudomonas  syringae  AvrRPM1,  AvrB)  

l  Addi0onal  component  in/retarge0ng  host  system,  e.g.  E3  ligase  ac0vity  (P.  syringae  AvrPtoB;  P.  infestans  Avr3a)  

l  Regulatory  control  (Xanthomonas  campestris  AvrBs3,  TAL  effectors)  

Page 8: Mining Plant Pathogen Genomes for Effectors

What  is  an  effector?  l No  unifying  biochemical  mechanism;  may  act  inside  or  outwith    host  cell  

l No  formal,  agreed  defini0on  (direct/indirect  ac0on;  structural  damage  –  PCWDEs,  etc.)  

l No  single  ‘test  for  candidate  effectors’  l  Really  tes0ng  for  protein  family  membership  and/or  evidence  of    

‘effector-­‐like  behaviour’  

l  A  general  sequence  classifica0on  problem  (func0onal  annota0on)  

l  Many  possible  bioinforma0c/computa0onal  approaches  

l  No  big  red  bu[on  

Page 9: Mining Plant Pathogen Genomes for Effectors

What  is  an  effector?  l No  unifying  biochemical  mechanism;  may  act  inside  or  outwith    host  cell  

l No  formal,  agreed  defini0on  (direct/indirect  ac0on;  structural  damage  –  PCWDEs,  etc.)  

l No  single  ‘test  for  candidate  effectors’  l  Really  tes0ng  for  protein  family  membership  and/or  evidence  of    

‘effector-­‐like  behaviour’  

l  A  general  sequence  classifica0on  problem  (func0onal  annota0on)  

l  Many  possible  bioinforma0c/computa0onal  approaches  

l  No  big  red  bu[on  

 

Page 10: Mining Plant Pathogen Genomes for Effectors

What  is  an  effector?  l No  unifying  biochemical  mechanism;  may  act  inside  or  outwith    host  cell  

l No  formal,  agreed  defini0on  (direct/indirect  ac0on;  structural  damage  –  PCWDEs,  etc.)  

l No  single  ‘test  for  candidate  effectors’  l  Really  tes0ng  for  protein  family  membership  and/or  evidence  of    

‘effector-­‐like  behaviour’  

l  A  general  sequence  classifica0on  problem  (func0onal  annota0on)  

l  Many  possible  bioinforma0c/computa0onal  approaches  

l  No  big  red  bu[on  

Page 11: Mining Plant Pathogen Genomes for Effectors

Surgery  without  knife  skills?  

Page 12: Mining Plant Pathogen Genomes for Effectors

Before  we  start…  

A   F   4   7  “If  a  card  has  a  vowel  on  one  side,  it  has  an  even  number  on  the  other  side.”  Which  card(s)  are  useful  to  turn  over  to  test  this  proposi0on?  

Page 13: Mining Plant Pathogen Genomes for Effectors

Before  we  start…  

A   F  

4   7  

A   7  

F   4  

A   4  

F   7  

Page 14: Mining Plant Pathogen Genomes for Effectors

Before  we  start…  

A   F  

4   7  

A   7  

F   4  

A   4  

F   7  Wason  SelecIon  Task:  confirma0on  bias,  context  

Page 15: Mining Plant Pathogen Genomes for Effectors

Why  is  this  relevant?  

effector   not  effector   RxLR   not  

RxLR  

“If  a  protein  has  an  RxLR  moIf,  it  is  an  effector.”  Which  experiments  are  useful  to  perform  to  test  this  proposi0on?  

Page 16: Mining Plant Pathogen Genomes for Effectors

Effector  Club  

The  first  rule  of  finding  effectors  is:  

You  are  not  finding  effectors  

Page 17: Mining Plant Pathogen Genomes for Effectors

Effector  Club  

l Classifica0on  of  sequences  is  modelling  

l  simplified  representa0on  of  reality  

l  criteria  based  on  known  effectors  

l  Iden0fies  candidate  effectors  l  experimental  verifica0on  required  

l General  bioinforma0c  problem  

l  specifics  vary  for  each  classifier  (model)  

Page 18: Mining Plant Pathogen Genomes for Effectors

Effector  Club  

l Classifica0on  of  sequences  is  modelling  

l  simplified  representa0on  of  reality  

l  criteria  based  on  known  effectors  

l  Iden0fies  candidate  effectors  l  experimental  verifica0on  required  

l General  bioinforma0c  problem  

l  specifics  vary  for  each  classifier  (model)  

Page 19: Mining Plant Pathogen Genomes for Effectors

Effector  Club  

l Classifica0on  of  sequences  is  modelling  

l  simplified  representa0on  of  reality  

l  criteria  based  on  known  effectors  

l  Iden0fies  candidate  effectors  l  experimental  verifica0on  required  

l General  bioinforma0c  problem  

l  specifics  vary  for  each  classifier  (model)  

Page 20: Mining Plant Pathogen Genomes for Effectors

Sequence  space  

An  abstract  concept  

Page 21: Mining Plant Pathogen Genomes for Effectors

Sequence  space  

Each  point  is  a  sequence  

Page 22: Mining Plant Pathogen Genomes for Effectors

Sequence  space  

d1  d2  

d1  <  d2  Distance  reflects  sequence  similarity  

Page 23: Mining Plant Pathogen Genomes for Effectors

Sequence  space  

Known  exemplar:  red  

Page 24: Mining Plant Pathogen Genomes for Effectors

Sequence  space  

Define  distance  from  the  example  ≈  ‘similar’  

Page 25: Mining Plant Pathogen Genomes for Effectors

Sequence  space  

‘similar’  sequences  are  same  class  (e.g.  func0on)  

Page 26: Mining Plant Pathogen Genomes for Effectors

Sequence  space  

Known  exemplars:  red  

Page 27: Mining Plant Pathogen Genomes for Effectors

Sequence  space  

Define  a  centre,  and  a  distance  that  includes  the  examples  

Page 28: Mining Plant Pathogen Genomes for Effectors

Sequence  space  

Classify  ‘similar’  sequences  

Page 29: Mining Plant Pathogen Genomes for Effectors

Finding  effectors  l Simple:  

1.  Have  one  or  more  examples  of  your  effector  (class)  

2.  Define  some  kind  of  appropriate  threshold  of  similarity  

3.  Check  all  the  gene/gene  product  sequences  in  the  genome  against  that  threshold  

Page 30: Mining Plant Pathogen Genomes for Effectors

Finding  effectors  l Simple:  

1.  Have  one  or  more  examples  of  your  effector  (class)  

2.  Define  some  kind  of  appropriate  threshold  of  similarity  

3.  Check  all  the  gene/gene  product  sequences  in  the  genome  against  that  threshold  

Page 31: Mining Plant Pathogen Genomes for Effectors

Finding  effectors  l Simple:  

1.  Have  one  or  more  examples  of  your  effector  (class)  

2.  Define  some  kind  of  appropriate  threshold  of  similarity  

3.  Check  all  the  gene/gene  product  sequences  in  the  genome  against  that  threshold  

Page 32: Mining Plant Pathogen Genomes for Effectors

Finding  effectors  l Simple:  

1.  Have  one  or  more  examples  of  your  effector  (class)  

2.  Define  some  kind  of  appropriate  threshold  of  similarity  

3.  Check  all  the  gene/gene  product  sequences  in  the  genome  against  that  threshold  

There  are  50  slides  to  go…  it’s  not  that  simple  

Page 33: Mining Plant Pathogen Genomes for Effectors

It’s  not  that  simple  

l How  do  we  define  ‘distance’?  

l How  large  a  ‘distance’  do  we  take?  

l How  do  we  know  we’ve  chosen  a  sensible  ‘distance’?  

Page 34: Mining Plant Pathogen Genomes for Effectors

It’s  not  that  simple  

l How  do  we  define  ‘distance’?  

l How  large  a  ‘distance’  do  we  take?  

l How  do  we  know  we’ve  chosen  a  sensible  ‘distance’?  

Page 35: Mining Plant Pathogen Genomes for Effectors

It’s  not  that  simple  

l How  do  we  define  ‘distance’?  

l How  large  a  ‘distance’  do  we  take?  

l How  do  we  know  we’ve  chosen  a  sensible  ‘distance’?  

Page 36: Mining Plant Pathogen Genomes for Effectors

CharacterisIcs  of  known  effectors  l Modularity  

l  Delivery:  localisa0on/transloca0on  domain(s)  

l  Ac0vity:  func0onal/interac0on  domain(s)  

l Sequence  mo0fs  

l  Localisa0on/transloca0on  domain(s)  ocen  common  to  effector  class  (e.g.  RxLR,  T3E)  

l  Func0onal  domain(s)  may  be  common  to  effector  class  (e.g.  TAL),  or  divergent  (e.g.  RxLR,  T3E)  

Page 37: Mining Plant Pathogen Genomes for Effectors

CharacterisIcs  of  known  effectors  l Modularity  

l  Delivery:  localisa0on/transloca0on  domain(s)  

l  Ac0vity:  func0onal/interac0on  domain(s)  

Greenberg  JT,  Vinatzer  BA  (2003)  Iden0fying  type  III  effectors  of  plant  pathogens  and  analyzing  their  interac0on  with  plant  cells.  Curr  Opin  Microbiol  6:  20–28.  Collmer  A,  Lindeberg  M,  Petnicki-­‐Ocwieja  T,  Schneider  DJ,  Alfano  JR  (2002)  Genomic  mining  type  III  secre0on  system  effectors  in  Pseudomonas  syringae  yields  new  picks  for  all  TTSS  prospectors.  Trends  in  Microbiology  10:  462–469.  

Page 38: Mining Plant Pathogen Genomes for Effectors

CharacterisIcs  of  known  effectors  l Modularity  

l  Delivery:  localisa0on/transloca0on  domain(s)  

l  Ac0vity:  func0onal/interac0on  domain(s)  

Dong  S,  Yu  D,  Cui  L,  Qutob  D,  Tedman-­‐Jones  J,  et  al.  (2011)  Sequence  Variants  of  the  Phytophthora  sojae  RXLR  Effector  Avr3a/5  Are  Differen0ally  Recognized  by  Rps3a  and  Rps5  in  Soybean.  PLoS  ONE  6:  e20172.  doi:10.1371/journal.pone.0020172.t004.  Bouwmeester  K,  Meijer  HJG,  Govers,  F  (2011)  At  the  fron0er;  RXLR  effectors  crossing  the  Phytophthora-­‐host  interface.  FronCers  in  Plant-­‐Microbe  InteracCons  10.3389  

Page 39: Mining Plant Pathogen Genomes for Effectors

CharacterisIcs  of  known  effectors  l Modularity  

l  Delivery:  localisa0on/transloca0on  domain(s)  

l  Ac0vity:  func0onal/interac0on  domain(s)  

l Sequence  mo0fs  

l  Localisa0on/transloca0on  domain(s)  typically  common  to  effector  class  (e.g.  RxLR,  T3E,  CHxC)  

l  Func0onal  domain(s)  may  be  common  to  effector  class  (e.g.  TAL),  or  divergent  (e.g.  RxLR,  T3E  in  general)  

Boch  J,  Scholze  H,  Schornack  S,  Landgraf  A,  Hahn  S,  et  al.  (2009)  Breaking  the  code  of  DNA  binding  specificity  of  TAL-­‐type  III  effectors.  Science  326:  1509–1512.  doi:10.1126/science.1178811.  

Page 40: Mining Plant Pathogen Genomes for Effectors

CharacterisIcs  of  known  effectors  l “Arms  Races”  occur:  

l  Host  defences  track  effector  evolu0on  

l  Effectors  evade  host  defences  

l Divergence  of  effectors  under  selec0on  pressure  l  Diversifying  selec0on;  divergence  may    

result  from  evasion  of  detec0on,  rather    than  change  of  biochemical  ‘func0on’  

l Effectors  may  be  found  preferen0ally  in    characteris0c  loca0ons    

l  P.  infestans  ‘gene  sparse’  regions  

Raffaele  S,  Win  J,  Cano  LM,  Kamoun  S  (2010)  Analyses  of  genome  architecture  and  gene  expression  reveal  novel  candidate  virulence  factors  in  the  secretome  of  Phytophthora  infestans.  BMC  Genomics  11:  637.  doi:10.1186/1471-­‐2164-­‐11-­‐637.  

Page 41: Mining Plant Pathogen Genomes for Effectors

CharacterisIcs  of  known  effectors    l Applica0on  of  ‘filters’:  reduce  the  number  of  sequences  to  check  

l  Presence/absence  filters:  

� SignalP  (export  signal)  

� RxLR/T3SS  (transloca0on  signal)  

� Expression  (used  by  pathogen)  

� Posi0ve  selec0on  (suggests  arms  race)  

� etc…  

l Workflows  (e.g.  Galaxy,  Taverna)  useful  here  

Fabro  G,  Steinbrenner  J,  Coates  M,  Ishaque  N,  Baxter  L,  et  al.  (2011)  Mul0ple  candidate  effectors  from  the  oomycete  pathogen  Hyaloperonospora  arabidopsidis  suppress  host  plant  immunity.  PLoS  Pathog  7:  e1002348.  doi:10.1371/journal.ppat.1002348.  

Page 42: Mining Plant Pathogen Genomes for Effectors

Redefining  sequence  space  l Effectors  may  share  common  module,  but  otherwise  be  dissimilar.  

l We  can  emphasise  sequence  similarity  by  focusing  on  the  common  region  

l  this  is  essen0ally  ‘redefining’  sequence  space  

l  brings  known  effectors  ‘together’  

l  may  bring  non-­‐effectors  with  similar  sequence  closer,  too  

SSMMMAAAAAAAA SSMMMBBBBBBBB SSMMMAAAAAAAA SSMMMBBBBBBBB SSMMMAAAAAAAA SSMMMBBBBBBBB

Page 43: Mining Plant Pathogen Genomes for Effectors

Sequence  space  

Comparing  whole  sequences  

AAAAAAAA  

BBBBBBB  

Page 44: Mining Plant Pathogen Genomes for Effectors

Redefining  sequence  space  l Effectors  may  share  common  module,  but  otherwise  be  dissimilar.  

l We  can  emphasise  similarity  by  focusing  on  regions  common  to  an  effector  class,  e.g.  T3SS,  L-­‐FLAK  

l  this  is  essen0ally  redefining  sequence  space  

l  brings  known  effectors  ‘closer  together’  

l  may  bring  non-­‐effectors  with  similar  sequence  closer,  too  

SSMMMAAAAAAAA SSMMMBBBBBBBB SSMMMAAAAAAAA SSMMMBBBBBBBB SSMMMAAAAAAAA SSMMMBBBBBBBB

Page 45: Mining Plant Pathogen Genomes for Effectors

Sequence  space  

Pull  domains  together,  push  non-­‐domains  away  

Page 46: Mining Plant Pathogen Genomes for Effectors

Building  a  classifier  

l How  do  we  define  ‘distance’?  

l How  large  a  ‘distance’  do  we  take?  

l How  do  we  know  we’ve  chosen  a  sensible  ‘distance’?  

Page 47: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance  l Sequence  iden0ty  (op0mal  alignment)  

l Derived  score  (based  on  sequence  iden0ty/alignment)  

l  Bit  score  in  BLAST  

l  E-­‐value  in  BLAST  

l Derived  score  (based  on  other  measures)  

l  Bit  score  in  HMMer  

l Clustering  l  Sequence  iden0ty  (e.g.  CD-­‐HIT)  

l  MCL  

Page 48: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance  l Sequence  iden0ty  

l Derived  score  (based  on  sequence  iden0ty/alignment)  

l  Bit  score  in  BLAST  

l  E-­‐value  in  BLAST  

l Derived  score  (based  on  other  measures)  

l  Bit  score  in  HMMer  

l Clustering  l  Sequence  iden0ty  (e.g.  CD-­‐HIT)  

l  MCL  

Page 49: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance  l Sequence  iden0ty  

l Derived  score  (based  on  sequence  iden0ty/alignment)  

l  Bit  score  in  BLAST  

l  E-­‐value  in  BLAST  

l Derived  score  (based  on  other  measures)  [not  alignment]  

l  Bit  score  in  HMMer  

l Clustering  l  Sequence  iden0ty  (e.g.  CD-­‐HIT)  

l  MCL  

Page 50: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance  l Sequence  iden0ty  

l Derived  score  (based  on  sequence  iden0ty/alignment)  

l  Bit  score  in  BLAST  

l  E-­‐value  in  BLAST  

l Derived  score  (based  on  other  measures)  

l  Bit  score  in  HMMer  

l Clustering  (not  strictly  a  distance)  l  Sequence  iden0ty  (e.g.  CD-­‐HIT)  

l  MCL  

(we’re  really  assessing  criteria  for  class  membership)  

Page 51: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance:  sequence  idenIty  l  Distance  between  sequences  ≈  difference  between  sequences  

l  sequence  iden0ty:  propor0on  of  iden0cal  symbols  

l  e.g.  BLAST  output    

l  Gotchas:  not  always  symmetrical;  dependent  on  alignment  parameters!

Score = 95.3 bits (51), Expect = 3e-24 ! Identities = 161/212 (76%), Gaps = 15/212 (7%) ! Strand=Plus/Plus !!Query 1 GCCGCCGGGGTTCAGGTTCCACCCGACGGACGAGGAGCTGGTGATGC-ACTACCT-CTGC 58 ! ||||||||||||||||||||||||||||||||||||||| | | | ||||||| | || !Sbjct 1 GCCGCCGGGGTTCAGGTTCCACCCGACGGACGAGGAGCTCATCAC-CTACTACCTGCGGC 59 !!Query 59 CGGCGGT-GC-GCCGGCCTCCCCATCGCCGTCCCCATCATCGCCGAGGTCGACCTCTACA 116 ! | | | || | |||| | || || || | ||||||||||||||| ||| ||| !Sbjct 60 AGAAGATCGCCGACGGCGGCTTCA-CGGCGAGGGC--CATCGCCGAGGTCGATCTCAACA 116 !!Query 117 AGTTCGATCCATGGCATCTCCCA-AGAATGGCGCTGTACGGC-GAGAAGGAGTGGTACTT 174 ! ||| ||| |||||| |||||||| |||| ||| | || |||||||| |||||||| !Sbjct 117 AGTGCGAGCCATGGGATCTCCCAGAGAA-GGCAAAA-ATGGGAGAGAAGGAATGGTACTT 174 !!Query 175 CTTCTCCCCTC-GGGACCGCAAGTACCCGAAC 205 ! ||| | ||| |||| || |||||||| ||| !Sbjct 175 CTT-TAGCCTAAGGGATCGAAAGTACCC-AAC 204 !  

Page 52: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance:  sequence  idenIty  l  Distance  between  sequences  ≈  difference  between  sequences  

l  sequence  iden0ty:  propor0on  of  iden0cal  symbols  

l  e.g.  BLAST  output  

l  Gotchas:  not  always  symmetrical;  dependent  on  alignment  parameters!

Score = 4970 !Length of alignment = 533 !Sequence Solyc11g008000.1.1 : 1 - 529 (Sequence length = 529) !Sequence Solyc11g005920.1.1 : 1 - 688 (Sequence length = 688) !Percentage ID = 32.83 !!Score = 5040 !Length of alignment = 533 !Sequence Solyc11g005920.1.1 : 1 - 688 (Sequence length = 688) !Sequence Solyc11g008000.1.1 : 1 - 529 (Sequence length = 529) !Percentage ID = 32.46 !  

(pairwise  alignment  in  Jalview)  

Page 53: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance:  sequence  idenIty  l  Distance  between  sequences  ≈  difference  between  sequences  

l  sequence  iden0ty:  propor0on  of  iden0cal  symbols  

l  e.g.  BLAST  output  

l  Gotchas:  not  always  symmetrical;  dependent  on  alignment  parameters!

Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 31.4 bits (64), Expect = 4e-05, Method: Composition-based stats. ! Identities = 37/134 (28%), Positives = 53/134 (40%), Gaps = 12/134 (8%) !!Query 34 GFRFHPTDEELVLYYLKRKICRRRILLDA---IAETDVY-KWEPEDLPDLSKLKTGD--- 86 ! GFRF PTD E V + L + + + D+ D Y + EP D+ D !Sbjct 7 GFRFSPTDAEAVTFLL--RFIAGKFMDDSGFITTHVDTYSEQEPWDIYSHGVPCCNDDND 64 !!Query 87 -RQWFFFSPRDRKYPNGARSNRASKHGYWKATGKDRIITCNSRAV-GVKKTLVFYKGRAP 144 ! Q+ FF +K S G WK K + + V G KK++ YK + !Sbjct 65 CTQYRFFITTLKKKSESRYSRNVGNKGSWKQQDKSKPVRKKGGPVIGYKKSMC-YKNKGY 123 !!Query 145 VGERTDWVMHEYTM 158 ! E W+M EY + !Sbjct 124 KQEDGHWLMKEYDL 137 !!

(BLASTP,  BLOSUM80  matrix)  

Page 54: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance:  sequence  idenIty  l  Distance  between  sequences  ≈  difference  between  sequences  

l  sequence  iden0ty:  propor0on  of  iden0cal  symbols  

l  e.g.  BLAST  output  

l  Gotchas:  not  always  symmetrical;  dependent  on  alignment  parameters!

Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 7e-07, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) !!Query 31 FPPGFRFHPTDEELVLYYLKRKICRRRILLDAIAETDVYKW---EPEDLPDLSKLKTGDR 87 ! + GFRF PTD E V + L R I + + T V + EP D+ D !Sbjct 4 LEEGFRFSPTDAEAVTFLL-RFIAGKFMDDSGFITTHVDTYSEQEPWDIYSHGVPCCNDD 62 !!Query 88 ----QWFFFSPRDRKYPNGARSNRASKHGYWKATGKDRIITCNSRAVGVKKTLVFYKGRA 143 ! Q+ FF +K S G WK K + + V K + YK + !Sbjct 63 NDCTQYRFFITTLKKKSESRYSRNVGNKGSWKQQDKSKPVRKKGGPVIGYKKSMCYKNKG 122 !!Query 144 PVGERTDWVMHEYTMDEEELKRCQNAQDYYALYKVFKKS 182 ! E W+M EY + L + L + K++ !Sbjct 123 YKQEDGHWLMKEYDLSTYILDKFDKDCRDIVLCAIKKRT 161 !!

(BLASTP,  BLOSUM45  matrix)  

Page 55: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance:  beyond  idenIty  

Query 1 GCCGCCGGGGTTCAGGTTCCACCCGACGGACGAGGAGCTGGTGATGC-ACTACCT-CTGC 58 ! ||||||||||||||||||||||||||||||||||||||| | | | ||||||| | || !Sbjct 1 GCCGCCGGGGTTCAGGTTCCACCCGACGGACGAGGAGCTCATCAC-CTACTACCTGCGGC 59 !!Query 59 CGGCGGT-GC-GCCGGCCTCCCCATCGCCGTCCCCATCATCGCCGAGGTCGACCTCTACA 116 ! | | | || | |||| | || || || | ||||||||||||||| ||| ||| !Sbjct 60 AGAAGATCGCCGACGGCGGCTTCA-CGGCGAGGGC--CATCGCCGAGGTCGATCTCAACA 116 !!Query 117 AGTTCGATCCATGGCATCTCCCA-AGAATGGCGCTGTACGGC-GAGAAGGAGTGGTACTT 174 ! ||| ||| |||||| |||||||| |||| ||| | || |||||||| |||||||| !Sbjct 117 AGTGCGAGCCATGGGATCTCCCAGAGAA-GGCAAAA-ATGGGAGAGAAGGAATGGTACTT 174 !!Query 175 CTTCTCCCCTC-GGGACCGCAAGTACCCGAAC 205 ! ||| | ||| |||| || |||||||| ||| !Sbjct 175 CTT-TAGCCTAAGGGATCGAAAGTACCC-AAC 204 !  

Iden0ty  ≈  yes/no  We  can  quan0fy  similarity  in  ‘bits’  

Page 56: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance:  bit  score  and  E-­‐value  l  Bit  score  and  E-­‐value  can  be  used  as  distance  measures.  

l  I  prefer  (normalised)  bit  scores  

l  Small  changes  in  score  →  large  changes  in  E  

l  E  varies  linearly  with  database  size;  λS  independent  of  database  size  

E  =  kmne-­‐λS  

Page 57: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance:  bit  score  and  E-­‐value  l  Bit  score  and  E-­‐value  can  be  used  as  distance  measures.  

l  I  prefer  (normalised)  bit  scores  

l  Small  changes  in  score  →  large  changes  in  E  

l  E  varies  linearly  with  database  size  and  query  length;  λS  independent  of  database  size  

E  =  kmne-­‐λS  

Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 31.4 bits (64), Expect = 4e-05, Method: Composition-based stats. ! Identities = 37/134 (28%), Positives = 53/134 (40%), Gaps = 12/134 (8%) !!!!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 7e-07, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) !

BLOSUM80  

BLOSUM45  

Page 58: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance:  bit  score  and  E-­‐value  l  Bit  score  and  E-­‐value  can  be  used  as  distance  measures.  

l  I  prefer  (normalised)  bit  scores  

l  Small  changes  in  score  →  large  changes  in  E  

l  E  varies  linearly  with  database  size  and  query  length;  λS  independent  of  database  size  

E  =  kmne-­‐λS  

Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 7e-07, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) ! Score = 30.0 bits (89), Expect = 8e-05, Method: Composition-based stats. ! Identities = 23/114 (21%), Positives = 36/114 (32%), Gaps = 0/114 (0%)!(db size: 5 sequences) !!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 1e-06, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) ! Score = 30.0 bits (89), Expect = 1e-04, Method: Composition-based stats. ! Identities = 23/114 (21%), Positives = 36/114 (32%), Gaps = 0/114 (0%) !(db size: 483 sequences) !!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !***** No hits found ***** !(db size: 644 sequences) !!

Page 59: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance:  bit  score  and  E-­‐value  l  Bit  score  and  E-­‐value  can  be  used  as  distance  measures.  

l  I  prefer  (normalised)  bit  scores  

l  Small  changes  in  score  →  large  changes  in  E  

l  E  varies  linearly  with  database  size  and  query  length;  λS  independent  of  database  size   Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !

Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 7e-07, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) ! Score = 30.0 bits (89), Expect = 8e-05, Method: Composition-based stats. ! Identities = 23/114 (21%), Positives = 36/114 (32%), Gaps = 0/114 (0%)!(db size: 5 sequences) !!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !! Score = 36.8 bits (113), Expect = 1e-06, Method: Composition-based stats. ! Identities = 39/159 (25%), Positives = 57/159 (36%), Gaps = 8/159 (5%) ! Score = 30.0 bits (89), Expect = 1e-04, Method: Composition-based stats. ! Identities = 23/114 (21%), Positives = 36/114 (32%), Gaps = 0/114 (0%) !(db size: 483 sequences) !!Query= PGSC0003DMP400054265 PGSC0003DMT400079995 Length=660 !Subject= PGSC0003DMP400054263 PGSC0003DMT400079992 Length=182 !***** No hits found ***** !(db size: 644 sequences) !!

E  =  kmne-­‐λS  

Page 60: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance:  alignment  v  profile  ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!!

Alignments  compare  two  sequences  Profiles  capture  informaIon  from  several    sequences  

Page 61: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance:  alignment  v  profile  ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!ACAAAT!!

consensus  

Alignments  compare  two  sequences  Profiles  capture  informaIon  from  several    sequences  

Page 62: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance:  alignment  v  profile  ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!ACAAAT!![AT][CG]A[ACGT][ACGT][TCA]![AT]-[CG]-A-X(2)-{G}!

regular  expression  

Alignments  compare  two  sequences  Profiles  capture  informaIon  from  several    sequences  

Page 63: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance:  alignment  v  profile  ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!ACAAAT!!

PSSM  

123456!A405221!C040112!G010110!T100112!!

[AT][CG]A[ACGT][ACGT][TCA]![AT]-[CG]-A-X(2)-{G}!

Alignments  compare  two  sequences  Profiles  capture  informaIon  from  several    sequences  

Page 64: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance:  alignment  v  profile  ACATAT!TCAACT!ACACGC!AGAATC!ACAGAA!ACAAAT!![AT][CG]A[ACGT][ACGT][TCA]![AT]-[CG]-A-X(2)-{G}!

123456!A405221!C040112!G010110!T100112!!

hidden  Markov  model  (HMM)  

Alignments  compare  two  sequences  Profiles  capture  informaIon  from  several    sequences  

Page 65: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance:  bit  scores  in  HMMer  l  HMMer  works  differently  to  BLAST:  profile  HMMs  

l  Sta0s0cal  model  of  mul0ple  sequence  alignment  (not  pairwise  sequence  alignment)  

l  phmmer  and  jackhmmer  equivalents  of  BLASTP  and  PSIBLAST  

l Explicit  sta0s0cal  representa0on  of  alignment  uncertainty  

l Sequence  scores,  not  alignment  scores  

l Bit  score  is  ‘log-­‐odds’  bit  score:  log-odds = log

✓P (sequence matches alignment)

P (sequence matches null model)

Goritschnig  S,  Krasileva  KV,  Dahlbeck  D,  Staskawicz  BJ  (2012)  Computa0onal  predic0on  and  molecular  characteriza0on  of  an  oomycete  effector  and  the  cognate  Arabidopsis  resistance  gene.  PLoS  GeneCcs  8:  e1002502.  doi:10.1371/journal.pgen.1002502.  Haas  BJ,  Kamoun  S,  Zody  MC,  Jiang  RHY,  Handsaker  RE,  et  al.  (2009)  Genome  sequence  and  analysis  of  the  Irish  potato  famine  pathogen  Phytophthora  infestans.  Nature  461:  393–398.  doi:10.1038/nature08358.  Whisson  SC,  Boevink  PC,  Moleleki  L,  Avrova  AO,  Morales  JG,  et  al.  (2007)  A  transloca0on  signal  for  delivery  of  oomycete  effector  proteins  into  host  plant  cells.  Nature  450:  115–118.  doi:10.1038/nature06203.  

Page 66: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance:  bit  scores  in  HMMer  l  HMMer  works  differently  to  BLAST:  profile  HMMs  

l  Sta0s0cal  model  of  mul0ple  sequence  alignment  (not  pairwise  sequence  alignment)  

l  phmmer  and  jackhmmer  equivalents  of  BLASTP  and  PSIBLAST  

l Explicit  sta0s0cal  representa0on  of  alignment  uncertainty  

l Sequence  scores,  not  alignment  scores  

l Bit  score  is  ‘log-­‐odds’  bit  score:  log-odds = log

✓P (sequence matches alignment)

P (sequence matches null model)

Page 67: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance:  bit  scores  in  HMMer  l  HMMer  works  differently  to  BLAST:  profile  HMMs  

l  Sta0s0cal  model  of  mul0ple  sequence  alignment  (not  pairwise  sequence  alignment)  

l  phmmer  and  jackhmmer  equivalents  of  BLASTP  and  PSIBLAST  

l Explicit  sta0s0cal  representa0on  of  alignment  uncertainty  

l Sequence  scores,  not  alignment  scores  

l Bit  score  is  ‘log-­‐odds’  bit  score:  log-odds = log

✓P (sequence matches alignment)

P (sequence matches null model)

Page 68: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance:  bit  scores  in  HMMer  l  HMMer  works  differently  to  BLAST:  profile  HMMs  

l  Sta0s0cal  model  of  mul0ple  sequence  alignment  (not  pairwise  sequence  alignment)  

l  phmmer  and  jackhmmer  equivalents  of  BLASTP  and  PSIBLAST  

l Explicit  sta0s0cal  representa0on  of  alignment  uncertainty  

l Sequence  scores,  not  alignment  scores  

l Bit  score  is  ‘log-­‐odds’  bit  score:  log-odds = log

✓P (sequence matches alignment)

P (sequence matches null model)

Null  model  is  a  control  Choice  of  null  model  can  be  important  

Page 69: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance:  bit  scores  in  HMMer  l  HMMer  works  differently  to  BLAST:  profile  HMMs  

l  Sta0s0cal  model  of  mul0ple  sequence  alignment  (not  pairwise  sequence  alignment)  

l  phmmer  and  jackhmmer  equivalents  of  BLASTP  and  PSIBLAST  

l Explicit  sta0s0cal  representa0on  of  alignment  uncertainty  

l Sequence  scores,  not  alignment  scores  

l Bit  score  is  ‘log-­‐odds’  bit  score:  log-odds = log

✓P (sequence matches alignment)

P (sequence matches null model)

Sequence  matches  alignment  beier  than  control  (null)  →  log-­‐odds  >  0  Sequence  matches  control  (null)  beier  than  alignment  →  log-­‐odds  <  0  Sequence  matches  alignment  and  control  (null)  equally  →  log-­‐odds  ≈  0  

Page 70: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance:  bit  scores  in  HMMer  

Query: NAM [M=129] !Accession: PF02365.10 !Description: No apical meristem (NAM) protein !Scores for complete sequences (score includes all domains): ! --- full sequence --- --- best 1 domain --- -#dom- ! E-value score bias E-value score bias exp N Sequence Description ! ------- ------ ----- ------- ------ ----- ---- -- -------- ----------- ! 3.1e-54 171.0 0.1 5.3e-54 170.3 0.1 1.4 1 StNac1_5 ! 5.5e-54 170.2 0.1 8.8e-54 169.6 0.1 1.3 1 NbNac1_1 ! 4e-53 167.4 0.1 6.3e-53 166.8 0.1 1.3 1 NbNac2_1 ! 1.5e-52 165.6 0.1 3.3e-52 164.5 0.1 1.6 1 StNac2_5 !!!Domain annotation for each sequence (and alignments): !>> StNac1_5 ! # score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc! --- ------ ----- --------- --------- ------- ------- ------- ------- ------- ------- ---- ! 1 ! 170.3 0.1 5.3e-54 5.3e-54 1 128 [. 28 156 .. 28 157 .. 0.97 !! Alignments for each domain: ! == domain 1 score: 170.3 bits; conditional E-value: 5.3e-54 ! PF02365.10 1 lppGfrFhPtdeelvveyLkkkvegkkleleevikevdiykvePwdLp..akvkaeekewyfFskrdkkyatgkrknratksgyWkatgkdkevlskkg 97 ! lp+G+rF+Ptdeelv++yL+ k++g + ++ +vi+evdi+k+ePwdLp ++v+++++ew+fF+++d+ky++g+r nrat++gyWkatgkd+++++kkg! StNac1_5 28 LPVGYRFRPTDEELVNHYLRLKINGADSQV-SVIREVDICKLEPWDLPdlSVVESHDNEWFFFCPKDRKYQNGQRLNRATERGYWKATGKDRNIVTKKG 125 ! 699************************999.99***************888899999****************************************** PP !! PF02365.10 98 elvglkktLvfykgrapkgektdWvmheyrl 128 ! +++g+kktLv+y grap+g++t+Wv+heyr+ ! StNac1_5 126 AKIGMKKTLVYYIGRAPEGKRTHWVIHEYRA 156 ! *****************************96 PP !!

l Easy  to  read  bit  scores  from  HMMer  output  

Page 71: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance:  composiIon  l  Some0mes,  sequence  comparison  doesn’t  tell  you  much  (e.g.  T3  

effector  signals)  

l  Can  use  ‘bulk  proper0es’  of  sequence  composi0on  l  Many  ways  to  derive  a  ‘distance’  

Greenberg  JT,  Vinatzer  BA  (2003)  Iden0fying  type  III  effectors  of  plant  pathogens  and  analyzing  their  interac0on  with  plant  cells.  Curr  Opin  Microbiol  6:  20–28.  Arnold  R,  Brandmaier  S,  Kleine  F,  Tischler  P,  Heinz  E,  et  al.  (2009)  Sequence-­‐based  predic0on  of  type  III  secreted  proteins.  PLoS  Pathog  5:  e1000376.  doi:10.1371/journal.ppat.1000376.  

Page 72: Mining Plant Pathogen Genomes for Effectors

Defining  a  ‘distance’:  clustering  

Page 73: Mining Plant Pathogen Genomes for Effectors

Defining  a  ‘distance’:  clustering  

Not  really  a  distance,  more  a  bound  Sequences  that  cluster  with  your  known  examples  

Page 74: Mining Plant Pathogen Genomes for Effectors

Defining  a  ‘distance’:  CD-­‐HIT  clusters  l Clustering  tool,  online  at  h[p://weizhong-­‐lab.ucsd.edu/cd-­‐hit/  

l  Sequences  sorted  by  decreasing  length  

l  First  sequence  is  representa0ve  of  first  cluster:  ‘seen’  

l  Consider  each  remaining  sequence  in  turn:  compare  with  ‘seen’  set  

� Similarity  of  sequence  with  ‘seen’  sequence  >  threshold?  Merge  into  cluster  

� Otherwise  start  new  cluster:  ‘seen’  

l Fast,  but  can  be  sensi0ve  to  sequence  set  composi0on  (use  mul0-­‐step).  

Page 75: Mining Plant Pathogen Genomes for Effectors

Defining  a  ‘distance’:  CD-­‐HIT  clusters  l Clustering  tool,  online  at  h[p://weizhong-­‐lab.ucsd.edu/cd-­‐hit/  

l  Sequences  sorted  by  decreasing  length  

l  First  sequence  is  representa0ve  of  first  cluster:  ‘seen’  

l  Consider  each  remaining  sequence  in  turn:  compare  with  ‘seen’  set  

� Similarity  of  sequence  with  ‘seen’  sequence  >  threshold?  Merge  into  cluster  

� Otherwise  start  new  cluster:  ‘seen’  

l Fast,  but  can  be  sensi0ve  to  sequence  set  composi0on  (use  mul0-­‐step).  

Page 76: Mining Plant Pathogen Genomes for Effectors

Defining  a  ‘distance’:  CD-­‐HIT  clusters  l Clustering  tool,  online  at  h[p://weizhong-­‐lab.ucsd.edu/cd-­‐hit/  

l  Sequences  sorted  by  decreasing  length  

l  First  sequence  is  representa0ve  of  first  cluster:  ‘seen’  

l  Consider  each  remaining  sequence  in  turn:  compare  with  ‘seen’  set  

� Similarity  of  sequence  with  ‘seen’  sequence  >  threshold?  Merge  into  cluster  

� Otherwise  start  new  cluster:  ‘seen’  

l Fast,  but  can  be  sensi0ve  to  sequence  set  composi0on  (use  mul0-­‐step).  

Page 77: Mining Plant Pathogen Genomes for Effectors

Defining  a  ‘distance’:  CD-­‐HIT  clusters  l Clustering  tool,  online  at  h[p://weizhong-­‐lab.ucsd.edu/cd-­‐hit/  

l  Sequences  sorted  by  decreasing  length  

l  First  sequence  is  representa0ve  of  first  cluster:  ‘seen’  

l  Consider  each  remaining  sequence  in  turn:  compare  with  ‘seen’  set  

� Similarity  of  sequence  with  ‘seen’  sequence  >  threshold?  Merge  into  cluster  

� Otherwise  start  new  cluster:  ‘seen’  

l Fast,  but  can  be  sensi0ve  to  sequence  set  composi0on  (use  mul0-­‐step).  

Page 78: Mining Plant Pathogen Genomes for Effectors

Defining  a  ‘distance’:  CD-­‐HIT  clusters  l Clustering  tool,  online  at  h[p://weizhong-­‐lab.ucsd.edu/cd-­‐hit/  

l  Sequences  sorted  by  decreasing  length  

l  First  sequence  is  representa0ve  of  first  cluster:  ‘seen’  

l  Consider  each  remaining  sequence  in  turn:  compare  with  ‘seen’  set  

� Similarity  of  sequence  with  ‘seen’  sequence  >  threshold?  Merge  into  cluster  

� Otherwise  start  new  cluster:  ‘seen’  

l Fast,  but  can  be  sensi0ve  to  sequence  set  composi0on  (use  mul0-­‐step).  

Page 79: Mining Plant Pathogen Genomes for Effectors

Defining  a  ‘distance’:  CD-­‐HIT  clusters  l Clustering  tool,  online  at  h[p://weizhong-­‐lab.ucsd.edu/cd-­‐hit/  

l  Sequences  sorted  by  decreasing  length  

l  First  sequence  is  representa0ve  of  first  cluster:  ‘seen’  

l  Consider  each  remaining  sequence  in  turn:  compare  with  ‘seen’  set  

� Similarity  of  sequence  with  ‘seen’  sequence  >  threshold?  Merge  into  cluster  

� Otherwise  start  new  cluster:  ‘seen’  

l Fast,  but  can  be  sensi0ve  to  sequence  set  composi0on  (use  mul0-­‐step).  

Page 80: Mining Plant Pathogen Genomes for Effectors

Defining  a  ‘distance’:  CD-­‐HIT  clusters  l Clustering  tool,  online  at  h[p://weizhong-­‐lab.ucsd.edu/cd-­‐hit/  

l  Sequences  sorted  by  decreasing  length  

l  First  sequence  is  representa0ve  of  first  cluster:  ‘seen’  

l  Consider  each  remaining  sequence  in  turn:  compare  with  ‘seen’  set  

� Similarity  of  sequence  with  ‘seen’  sequence  >  threshold?  Merge  into  cluster  

� Otherwise  start  new  cluster:  ‘seen’  

l Fast,  but  can  be  sensi0ve  to  sequence  set  composi0on  (use  mul0-­‐step).  

need  to  test  clusters  for  robustness  

Page 81: Mining Plant Pathogen Genomes for Effectors

Defining  a  ‘distance’:  CD-­‐HIT  clusters  l Clustering  tool,  online  at  h[p://weizhong-­‐lab.ucsd.edu/cd-­‐hit/  

l  Sequences  sorted  by  decreasing  length  

l  First  sequence  is  representa0ve  of  first  cluster:  ‘seen’  

l  Consider  each  remaining  sequence  in  turn:  compare  with  ‘seen’  set  

� Similarity  of  sequence  with  ‘seen’  sequence  >  threshold?  Merge  into  cluster  

� Otherwise  start  new  cluster:  ‘seen’  

l Fast,  but  can  be  sensi0ve  to  sequence  set  composi0on  (use  mul0-­‐step).  

need  to  test  clusters  for  robustness  

Page 82: Mining Plant Pathogen Genomes for Effectors

Defining  a  ‘distance’:  MCL  clustering  l Clustering  algorithm  (used  in  TribeMCL,  OrthoMCL)  

l  Markov  Clustering  Algorithm  

l  Finds  clusters  in  networks  

l Use  BLAST  to  generate  all-­‐vs-­‐all  pairwise  comparisons  

l  Results  are  a  network  (similarity  graph)  

l Given  such  a  network:  l  Expansion  (raise  to  power)  –  ‘spreads  links’  

l  Infla0on  (scaling)  –  ‘thickens  strong  links’  

Page 83: Mining Plant Pathogen Genomes for Effectors

Defining  a  ‘distance’:  MCL  clustering  l Clustering  algorithm  (used  in  TribeMCL,  OrthoMCL)  

l  Markov  Clustering  Algorithm  

l  Finds  clusters  in  networks  

l Use  BLAST  to  generate  all-­‐vs-­‐all  pairwise  comparisons  

l  Results  are  a  network  (similarity  graph)  

l Given  such  a  network:  l  Expansion  (raise  to  power)  –  ‘spreads  links’  

l  Infla0on  (scaling)  –  ‘thickens  strong  links’  

Page 84: Mining Plant Pathogen Genomes for Effectors

Defining  a  ‘distance’:  MCL  clustering  l Clustering  algorithm  (used  in  TribeMCL,  OrthoMCL)  

l  Markov  Clustering  Algorithm  

l  Finds  clusters  in  networks  

l Use  BLAST  to  generate  all-­‐vs-­‐all  pairwise  comparisons  

l  Results  are  a  network  (similarity  graph)  

l Given  such  a  network:  l  Expansion  (raise  to  power)  –  ‘spreads  links’  

l  Infla0on  (scaling)  –  ‘thickens  strong  links’  

Repeated  applicaIon  of  the  expansion/inflaIon  cycle  results  in  the  formaIon  of  clusters.  

Page 85: Mining Plant Pathogen Genomes for Effectors

Defining  a  ‘distance’:  MCL  clustering  Expansion   InflaIon  

…  

…  

…   …  

→  

→  

Input  

Clustering  

Page 86: Mining Plant Pathogen Genomes for Effectors

Defining  a  ‘distance’:  MCL  clustering  l Clustering  algorithm  (used  in  TribeMCL,  OrthoMCL)  

l  Markov  Clustering  Algorithm  

l  Finds  clusters  in  networks  

l Use  BLAST  to  generate  all-­‐vs-­‐all  pairwise  comparisons  

l  Results  are  a  network  (similarity  graph)  

l Given  such  a  network:  l  Expansion  (raise  to  power)  –  ‘spreads  links’  

l  Infla0on  (scaling)  –  ‘thickens  strong  links’  

l One  key  parameter:  inflaIon  value  

l  Need  to  cluster  over  several  infla0on  values  to  confirm  robustness  (consistency  of  clustering)  

InflaIon  value   clusters  

1.4   3  

2.0   6  

4.0   18  

6.0   33  

Page 87: Mining Plant Pathogen Genomes for Effectors

Defining  a  distance  l Sequence  iden0ty  –  scores  alignment  (symmetry?)  

l Derived  score  (based  on  sequence  iden0ty/alignment)  

l  Bit  score  in  BLAST  –  scores  alignment  (subs0tu0on  matrix)  

l  E-­‐value  in  BLAST  –  scores  alignment  (sensi0ve  to  query/db  size,  subn  matrix)  

l Derived  score  (based  on  other  measures)  

l  Bit  score  in  HMMer  –  scores  sequence  rela0ve  to  model  (null  model?)  

l Clustering  l  Sequence  iden0ty  (e.g.  CD-­‐HIT)  –  can  be  sensi0ve  to  sequence  order  (mul0-­‐

step?  test  for  robustness?  CD-­‐HIT  uses  sequence  iden0ty)  

l  MCL  –  needs  all-­‐v-­‐all  pairwise  (test  for  robustness;  uses  BLAST  E-­‐value  by  default)  

Page 88: Mining Plant Pathogen Genomes for Effectors

Many  definiIons  of  distance  

l How  do  we  define  ‘distance’?  

l How  large  a  ‘distance’  (or  what  clustering  resolu0on)  do  we  take?  

l How  do  we  know  we’ve  chosen  a  sensible  ‘distance’?  

Page 89: Mining Plant Pathogen Genomes for Effectors

How  large  a  distance  do  we  allow?  

l How  do  we  define  ‘distance’?  

l How  large  a  ‘distance’  (or  what  clustering  resoluIon)  do  we  take?  

l How  do  we  know  we’ve  chosen  a  sensible  ‘distance’?  

Page 90: Mining Plant Pathogen Genomes for Effectors

Confusion  Matrix  

l Our  distance/boundary  classifies  sequences  as  ‘in’  or  ‘out’  l  ‘red’  or  ‘blue’  

l Changing  distance/bound  results  in  various  degrees  of  success…  

IN   OUT  

Red   1   5  

Blue   1   36  

Confusion  matrix:  

Page 91: Mining Plant Pathogen Genomes for Effectors

Confusion  Matrix  

l Our  distance/boundary  classifies  sequences  as  ‘in’  or  ‘out’  l  ‘red’  or  ‘blue’  

l Changing  distance/bound  results  in  various  degrees  of  success…  

IN   OUT  

Red   1   5  

Blue   1   36  

True  posiIve  

False  posiIve   True  negaIve  

False  negaIve  

Page 92: Mining Plant Pathogen Genomes for Effectors

Confusion  Matrix  

l Our  distance/boundary  classifies  sequences  as  ‘in’  or  ‘out’  l  ‘red’  or  ‘blue’  

l Changing  distance/bound  results  in  various  degrees  of  success…  

IN   OUT  

Red   1   5  

Blue   1   36  

False  posi0ve  rate   FP/(FP+TN)  

False  nega0ve  rate   FN/(TP+FN)  

Sensi0vity   TP/(TP+FN)  

Specificity   TN/(FP+TN)  

False  discovery  rate   FP/(FP+TP)  

Page 93: Mining Plant Pathogen Genomes for Effectors

Confusion  Matrix  

l Our  distance/boundary  classifies  sequences  as  ‘in’  or  ‘out’  l  ‘red’  or  ‘blue’  

l Changing  distance/bound  results  in  various  degrees  of  success…  

IN   OUT  

Red   1   5  

Blue   1   36  

False  posi0ve  rate   1/37  =  0.03  

False  nega0ve  rate   5/6  =  0.83  

Sensi0vity   1/6  =  0.17  

Specificity   36/37  =  0.97  

Page 94: Mining Plant Pathogen Genomes for Effectors

Confusion  Matrix  

l Our  distance/boundary  classifies  sequences  as  ‘in’  or  ‘out’  l  ‘red’  or  ‘blue’  

l Changing  distance/bound  results  in  various  degrees  of  success…  

IN   OUT  

Red   5   2  

Blue   4   33  

False  posi0ve  rate   0.11  

False  nega0ve  rate   0.29  

Sensi0vity   0.81  

Specificity   0.89  

Page 95: Mining Plant Pathogen Genomes for Effectors

Confusion  Matrix  

l Our  distance/boundary  classifies  sequences  as  ‘in’  or  ‘out’  l  ‘red’  or  ‘blue’  

l Changing  distance/bound  results  in  various  degrees  of  success…  

IN   OUT  

Red   7   0  

Blue   14   23  

False  posi0ve  rate   0.38  

False  nega0ve  rate   0  

Sensi0vity   1  

Specificity   0.62  

Page 96: Mining Plant Pathogen Genomes for Effectors

ROC  Curve  

l  To  assess  how  well  a  method  performs,  can  use  ROC  (Receiver  Opera0ng  Characteris0c)  curve  

l  Typically,  we  use  area  under  the  curve  (AUC)  to  choose  between  methods  

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1  

SensiIvity  

False  PosiIve  Rate  

ROC  Curve  

Classifier  

Random  

Page 97: Mining Plant Pathogen Genomes for Effectors

ROC  Curve  

l  To  assess  how  well  a  method  performs,  can  use  ROC  (Receiver  Opera0ng  Characteris0c)  curve  

l  Typically,  we  use  area  under  the  curve  (AUC)  to  choose  between  methods  

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1  

SensiIvity  

False  PosiIve  Rate  

ROC  Curve  

Classifier  

Random  

be[er  performance  

Page 98: Mining Plant Pathogen Genomes for Effectors

ROC  Curve  

l  To  assess  how  well  a  method  performs,  can  use  ROC  (Receiver  Opera0ng  Characteris0c)  curve  

l  The  ‘best’  parameter  se}ng  for  a  method  is  typically  near  the  apex.  

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1  

SensiIvity  

False  PosiIve  Rate  

ROC  Curve  

Classifier  

Random  

Page 99: Mining Plant Pathogen Genomes for Effectors

F-­‐measure  

IN   OUT  

Red   1   5  

Blue   1   36  

False  posi0ve  rate   FP/(FP+TN)  

False  nega0ve  rate   FN/(TP+FN)  

Sensi0vity   TP/(TP+FN)  

Specificity   TN/(FP+TN)  

l We  can  ‘game’  ROC  sta0s0cs  by  increasing  irrelevant  ‘nega0ve’  examples  

l  Increasing  TN  ‘improves’  false  posi0ve  rate  and  specificity  

l Can  use  precision  and  recall  instead  

Page 100: Mining Plant Pathogen Genomes for Effectors

F-­‐measure  

l We  can  ‘game’  ROC  sta0s0cs  by  increasing  irrelevant  ‘nega0ve’  examples  

l  Increasing  TN  ‘improves’  false  posi0ve  rate  and  specificity  

l Can  use  precision  and  recall  instead  

IN   OUT  

Red   1   5  

Blue   1   36  

Precision  (PPV)   TP/(TP+FP)  

Recall  =  sensi0vity   TP/(TP+FN)  

FDR  =  1-­‐PPV   FP/(TP+FP)  

Page 101: Mining Plant Pathogen Genomes for Effectors

F-­‐measure  

l Precision:  Propor0on  of  accurate  posi0ve  predic0ons  

l Recall:  Propor0on  of  posi0ve  examples  recovered  (sensiCvity)  

l F1  =  2  (precision  x  recall)/(precision  +  recall)  

IN   OUT  

Red   1   5  

Blue   1   36  

Precision  (PPV)   TP/(TP+FP)  

Recall  =  sensi0vity   TP/(TP+FN)  

FDR  =  1-­‐PPV   FP/(TP+FP)  

Page 102: Mining Plant Pathogen Genomes for Effectors

F-­‐measure  

l The  F-­‐measure  indicates  which  set  of  parameters  (which  distance)  ‘best’  

l Several  F-­‐measures  available  that  weight  precision  and  recall  differently  

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

1   2   3  

F-­‐measure  

Page 103: Mining Plant Pathogen Genomes for Effectors

How  large  a  distance  do  we  allow?  

l Assign  known  ‘posi0ve’  and    ‘nega0ve’  examples  

l Vary  distances  and  take  F-­‐measure  

l Choose  distance  that  gives  the  best  performance  

Page 104: Mining Plant Pathogen Genomes for Effectors

How  large  a  distance  do  we  allow?  

l Assign  known  ‘posi0ve’  and    ‘nega0ve’  examples  

l Vary  distances  and  take  F-­‐measure  

l Choose  distance  that  gives  the  best  performance  

Page 105: Mining Plant Pathogen Genomes for Effectors

How  large  a  distance  do  we  allow?  

l Assign  known  ‘posi0ve’  and    ‘nega0ve’  examples  

l Vary  distances  and  take  F-­‐measure  

l Choose  distance  that  gives  the  best  performance  

Page 106: Mining Plant Pathogen Genomes for Effectors

Confusion  Matrix  

l BUT:  how  do  we  know  that  we’ve  chosen  a  suitable  distance?  l  Training  set  choice  is  cri0cal  

IN   OUT  

Red   5   2  

Blue   4   33  

False  posi0ve  rate   0.11  

False  nega0ve  rate   0.29  

Sensi0vity   0.81  

Specificity   0.89  

Page 107: Mining Plant Pathogen Genomes for Effectors

Training  set  choice  

Train  classifier  on  known  examples:  looks  good…  

Page 108: Mining Plant Pathogen Genomes for Effectors

UnrepresentaIve  examples  

…but  training  set  biased/unrepresenta0ve  sample…  

Page 109: Mining Plant Pathogen Genomes for Effectors

Overfiong  

…or  ‘fits’  known  posi0ves  unfeasibly  0ghtly  

Page 110: Mining Plant Pathogen Genomes for Effectors

How  large  a  distance  do  we  allow?  

l How  do  we  define  ‘distance’?  

l How  large  a  ‘distance’  (or  what  clustering  resoluIon)  do  we  take?  

l How  do  we  know  we’ve  chosen  a  sensible  ‘distance’?  

Page 111: Mining Plant Pathogen Genomes for Effectors

How  do  we  know  we’ve  chosen  a  suitable  distance?  

l How  do  we  define  ‘distance’?  

l How  large  a  ‘distance’  (or  what  clustering  resolu0on)  do  we  take?  

l How  do  we  know  we’ve  chosen  a  sensible  ‘distance’?  

Page 112: Mining Plant Pathogen Genomes for Effectors

A  trip  to  the  doctor,  part  I  l Rou0ne  medical  checkup  

l Test  for  disease  X  (horrible,  unpleasant,  poten0ally  suppura0ng)  

l Test  has  sensi)vity  (i.e.  predicts  disease  where  there  is  disease)  of  95%    

l Test  has  false  posi)ve  rate  (i.e.  predicts  disease  where  there  is  no  disease)  of  1%  

l Your  test  is  posi0ve  

l What  is  the  probability  that  you  have  disease  X?  

Page 113: Mining Plant Pathogen Genomes for Effectors

A  trip  to  the  doctor,  part  I  l Rou0ne  medical  checkup  

l Test  for  disease  X  (horrible,  unpleasant,  poten0ally  suppura0ng)  

l Test  has  sensi)vity  (i.e.  predicts  disease  where  there  is  disease)  of  95%    

l Test  has  false  posi)ve  rate  (i.e.  predicts  disease  where  there  is  no  disease)  of  1%  

l Your  test  is  posi0ve  

l What  is  the  probability  that  you  have  disease  X?  

Page 114: Mining Plant Pathogen Genomes for Effectors

A  trip  to  the  doctor,  part  I  l Rou0ne  medical  checkup  

l Test  for  disease  X  (horrible,  unpleasant,  poten0ally  suppura0ng)  

l Test  has  sensi)vity  (i.e.  predicts  disease  where  there  is  disease)  of  95%    

l Test  has  false  posi)ve  rate  (i.e.  predicts  disease  where  there  is  no  disease)  of  1%  

l Your  test  is  posi0ve  

l What  is  the  probability  that  you  have  disease  X?  

Page 115: Mining Plant Pathogen Genomes for Effectors

A  trip  to  the  doctor,  part  I  l Rou0ne  medical  checkup  

l Test  for  disease  X  (horrible,  unpleasant,  poten0ally  suppura0ng)  

l Test  has  sensi)vity  (i.e.  predicts  disease  where  there  is  disease)  of  95%    

l Test  has  false  posi)ve  rate  (i.e.  predicts  disease  where  there  is  no  disease)  of  1%  

l Your  test  is  posiIve  

l What  is  the  probability  that  you  have  disease  X?  

Page 116: Mining Plant Pathogen Genomes for Effectors

A  trip  to  the  doctor,  part  I  l Rou0ne  medical  checkup  

l Test  for  disease  X  (horrible,  unpleasant,  poten0ally  suppura0ng)  

l Test  has  sensi)vity  (i.e.  predicts  disease  where  there  is  disease)  of  95%    

l Test  has  false  posi)ve  rate  (i.e.  predicts  disease  where  there  is  no  disease)  of  1%  

l Your  test  is  posiIve  

l What  is  the  probability  that  you  have  disease  X?  

0.01   0.05   0.95   0.99  0.50  

Page 117: Mining Plant Pathogen Genomes for Effectors

How  do  we  know  we’ve  chosen  a  suitable  distance?  

l How  do  we  define  ‘distance’?  

l How  large  a  ‘distance’  do  we  take?  

l How  do  we  know  we’ve  chosen  a  sensible  ‘distance’?  

Page 118: Mining Plant Pathogen Genomes for Effectors

Cross-­‐validaIon  l Es0ma0on  of  classifier  performance  depends  on  

l  distance  measure  

l  composi0on  of  training  set  (‘posi0ves’  and  ‘nega0ves’)  

l Cross-­‐valida0on  gives  objec0ve  measure  of  performance  

l Many  strategies  available,  including:  

l  leave-­‐one-­‐out  (LOO)  

l  k-­‐fold  crossvalida0on  

l  repeated  (random)  subsampling  

l Essen0ally:  always  keep  a  hold-­‐out  set  (not  used  to  train)  

Page 119: Mining Plant Pathogen Genomes for Effectors

Cross-­‐validaIon  l Es0ma0on  of  classifier  performance  depends  on  

l  distance  measure  

l  composi0on  of  training  set  (‘posi0ves’  and  ‘nega0ves’)  

l Cross-­‐valida0on  gives  objec0ve  measure  of  performance  

l Many  strategies  available,  including:  

l  leave-­‐one-­‐out  (LOO)  

l  k-­‐fold  crossvalida0on  

l  repeated  (random)  subsampling  

l Essen0ally:  always  keep  a  hold-­‐out  set  (not  used  to  train)  

Page 120: Mining Plant Pathogen Genomes for Effectors

Cross-­‐validaIon  l Es0ma0on  of  classifier  performance  depends  on  

l  distance  measure  

l  composi0on  of  training  set  (‘posi0ves’  and  ‘nega0ves’)  

l Cross-­‐valida0on  gives  objec0ve  measure  of  performance  

l Many  strategies  available,  including:  

l  leave-­‐one-­‐out  (LOO)  

l  k-­‐fold  crossvalida0on  

l  repeated  (random)  subsampling  

l Essen0ally:  always  keep  a  hold-­‐out  set  (not  used  to  train)  

Page 121: Mining Plant Pathogen Genomes for Effectors

Cross-­‐validaIon  l Es0ma0on  of  classifier  performance  depends  on  

l  distance  measure  

l  composi0on  of  training  set  (‘posi0ves’  and  ‘nega0ves’)  

l Cross-­‐valida0on  gives  objec0ve  measure  of  performance  

l Many  strategies  available,  including:  

l  leave-­‐one-­‐out  (LOO)  

l  k-­‐fold  crossvalida0on  

l  repeated  (random)  subsampling  

l Essen0ally:  always  keep  a  hold-­‐out  set  (not  used  to  train)  

Page 122: Mining Plant Pathogen Genomes for Effectors

k-­‐fold  crossvalidaIon  l No  crossvalida0on:  

l  One  training  set  

l  No  test  (hold-­‐out/valida0on)  set  

l  Risks  overfi}ng  

Training  Set  

Test  Set  

Page 123: Mining Plant Pathogen Genomes for Effectors

k-­‐fold  crossvalidaIon  l Valida0on:  

l  One  training  set,  one  test  (hold-­‐out/valida0on)  set  

l  Test  performance  of  classifier  on  unseen  data  

Training  Set  

Test  Set  

Page 124: Mining Plant Pathogen Genomes for Effectors

k-­‐fold  crossvalidaIon  l 2-­‐fold  crossvalida0on:  

l  Two  runs,  each  with  one  training  set,  one  test  set  

l  Swap  training  and  test  sets,  collate  results  

Training  Set  

Test  Set  

run1  

run2  

Page 125: Mining Plant Pathogen Genomes for Effectors

k-­‐fold  crossvalidaIon  l 3-­‐fold  crossvalida0on:  

l  Three  runs,  each  with  one  training  set,  one  test  set  

Training  Set  

Test  Set  

run1  

run2  

run3  

Page 126: Mining Plant Pathogen Genomes for Effectors

k-­‐fold  crossvalidaIon  l k-­‐fold  crossvalida0on:  

l  k  runs,  each  with  one  training  set,  one  test  set  (n  items  in  dataset,  k>1)  

Training  Set   Test  Set  

run1  

run2  

runk  

n/k  n-­‐(n/k)  

…  

Page 127: Mining Plant Pathogen Genomes for Effectors

Arer  crossvalidaIon  

False  posi0ve  rate   0.11  

False  nega0ve  rate   0.29  

Sensi0vity   0.81  

Specificity   0.89  

Precision   0.56  

•  Use  crossvalida0on  to  find  ‘best’  method  &  parameters  •  Crossvalida0on  gives  you  es0mated  performance  metrics  on  

unseen  data  •  Apply  ‘best’  method  to  complete  dataset  for  predic0on  

Page 128: Mining Plant Pathogen Genomes for Effectors

A  trip  to  the  doctor,  part  II  l Test  for  disease  X  (horrible,  unpleasant,  poten0ally  suppura0ng)  

l Test  has  sensi)vity  (i.e.  predicts  disease  where  there  is  disease)  of  95%    

l Test  has  false  posi)ve  rate  (i.e.  predicts  disease  where  there  is  no  disease)  of  1%  

l Your  test  is  posiIve  

l To  calculate  the  probability  that  the  test  correctly  determines  whether  you  have  the  disease,  you  need  to  know  the  baseline  occurrence.  

Page 129: Mining Plant Pathogen Genomes for Effectors

A  trip  to  the  doctor,  part  II  l Test  for  disease  X  (horrible,  unpleasant,  poten0ally  suppura0ng)  

l Test  has  sensi)vity  (i.e.  predicts  disease  where  there  is  disease)  of  95%    

l Test  has  false  posi)ve  rate  (i.e.  predicts  disease  where  there  is  no  disease)  of  1%  

l Your  test  is  posiIve  

l To  calculate  the  probability  that  the  test  correctly  determines  whether  you  have  the  disease,  you  need  to  know  the  baseline  occurrence.  

Page 130: Mining Plant Pathogen Genomes for Effectors

A  trip  to  the  doctor,  part  II  l Test  for  disease  X  (horrible,  unpleasant,  poten0ally  suppura0ng)  

l Test  has  sensi)vity  (i.e.  predicts  disease  where  there  is  disease)  of  95%    

l Test  has  false  posi)ve  rate  (i.e.  predicts  disease  where  there  is  no  disease)  of  1%  

l Your  test  is  posi0ve  

l To  calculate  the  probability  that  the  test  correctly  determines  whether  you  have  the  disease,  you  need  to  know  the  baseline  occurrence.  

Baseline  occurrence:  1%  ⇒  P(disease|+ve)=0.490      Baseline  occurrence:  80%  ⇒  P(disease|+ve)=0.997  

Page 131: Mining Plant Pathogen Genomes for Effectors

What  is  the  baseline  occurrence  for  effectors?  

l Usually  rely  on  predic0ons  for  expected  baseline  

l Bacterial  genomes:  ≈4500  genes  

l  Type  III  effectors:  1-­‐10%  (Arnold  et  al.  2009);  1-­‐2%  (Collmer  et  al.  2002);  1%  (Boch  and  Bonas,  2010)  

l Oomycete/fungal  genomes:  ≈20000  genes  

l  RxLRs:  120-­‐460  (1-­‐2%;  Whisson  et  al.  2007);  ≤563  (≲2%  Haas  et  al.  2009);    

l  CRNs:  19-­‐196  (≲1%;  Haas  et  al.  2009)  

l  CHxC:  ≈30  (<1%;  Kemen  et  al.  2011)  

l We  need  to  take  care  over  result  interpreta0on:  

l  Predic0on  method  with  5%  false  nega0ve  rate  and  1%  false  posi0ve  rate,  with  1%  baseline,  predic0ng  500  effectors:  

� P(effector|posiIve  test)≈0.5  

Page 132: Mining Plant Pathogen Genomes for Effectors

What  is  the  baseline  occurrence  for  effectors?  

l Usually  rely  on  predic0ons  for  expected  baseline  

l Bacterial  genomes:  ≈4500  genes  

l  Type  III  effectors:  1-­‐10%  (Arnold  et  al.  2009);  1-­‐2%  (Collmer  et  al.  2002);  1%  (Boch  and  Bonas,  2010)  

l Oomycete/fungal  genomes:  ≈20000  genes  

l  RxLRs:  120-­‐460  (1-­‐2%;  Whisson  et  al.  2007);  ≤563  (≲2%  Haas  et  al.  2009);    

l  CRNs:  19-­‐196  (≲1%;  Haas  et  al.  2009)  

l  CHxC:  ≈30  (<1%;  Kemen  et  al.  2011)  

l We  need  to  take  care  over  result  interpreta0on:  

l  Predic0on  method  with  5%  false  nega0ve  rate  and  1%  false  posi0ve  rate,  with  1%  baseline,  predic0ng  500  effectors:  

� P(effector|posiIve  test)≈0.5  

Page 133: Mining Plant Pathogen Genomes for Effectors

What  is  the  baseline  occurrence  for  effectors?  

l Usually  rely  on  predic0ons  for  expected  baseline  

l Bacterial  genomes:  ≈4500  genes  

l  Type  III  effectors:  1-­‐10%  (Arnold  et  al.  2009);  1-­‐2%  (Collmer  et  al.  2002);  1%  (Boch  and  Bonas,  2010)  

l Oomycete/fungal  genomes:  ≈20000  genes  

l  RxLRs:  120-­‐460  (1-­‐2%;  Whisson  et  al.  2007);  ≤563  (≲2%  Haas  et  al.  2009);    

l  CRNs:  19-­‐196  (≲1%;  Haas  et  al.  2009)  

l  CHxC:  ≈30  (<1%;  Kemen  et  al.  2011)  

l We  need  to  take  care  over  result  interpreta0on:  

l  Predic0on  method  with  5%  false  nega0ve  rate  and  1%  false  posi0ve  rate,  with  1%  baseline,  predic0ng  500  effectors:  

� P(effector|posiIve  test)≈0.5  

Page 134: Mining Plant Pathogen Genomes for Effectors

What  is  the  baseline  occurrence  for  effectors?  

l Usually  rely  on  predic0ons  for  expected  baseline  

l Bacterial  genomes:  ≈4500  genes  

l  Type  III  effectors:  1-­‐10%  (Arnold  et  al.  2009);  1-­‐2%  (Collmer  et  al.  2002);  1%  (Boch  and  Bonas,  2010)  

l Oomycete/fungal  genomes:  ≈20000  genes  

l  RxLRs:  120-­‐460  (1-­‐2%;  Whisson  et  al.  2007);  ≤563  (≲2%  Haas  et  al.  2009);    

l  CRNs:  19-­‐196  (≲1%;  Haas  et  al.  2009)  

l  CHxC:  ≈30  (<1%;  Kemen  et  al.  2011)  

l We  need  to  take  care  over  result  interpreta0on:  

l  Predic0on  method  with  5%  false  nega0ve  rate  and  1%  false  posi0ve  rate,  with  1%  baseline,  predic0ng  500  effectors:  

� P(effector|posiIve  test)≈0.5  

Page 135: Mining Plant Pathogen Genomes for Effectors

A  lesson  from  the  literature?  

l “The  resul0ng  computa0onal  model  revealed  a  strong  type  III  secre0on  signal  in  the  N-­‐terminus  that  can  be  used  to  detect  effectors  with  sensi0vity  of  71%  and  [specificity]  of  85%.”  

l  Sensi0vity  [P(+ve|T3E)]  =  0.71;  FPR  [1-­‐Specificity;  P(+ve|not  T3E)]  =  0.15  

l  Base  rate  [P(T3E)]  ≈  3%;  Genes  =  4500    

l  We  expect  P(T3E|+ve)  ≈  0.13  

l  (and  a  significant  number,  up  to  15%  of  the  genome,  of  false  posi0ves…)  

P (T3E|+ve) =P (+ve|T3E)P (T3E)

P (+ve|T3E)P (T3E) + P (+ve|T3E)P (T3E)

Page 136: Mining Plant Pathogen Genomes for Effectors

A  lesson  from  the  literature?  

l “The  resul0ng  computa0onal  model  revealed  a  strong  type  III  secre0on  signal  in  the  N-­‐terminus  that  can  be  used  to  detect  effectors  with  sensi0vity  of  71%  and  [specificity]  of  85%.”  

l  Sensi0vity  [P(+ve|T3E)]  =  0.71;  FPR  [1-­‐Specificity;  P(+ve|not  T3E)]  =  0.15  

l  Base  rate  [P(T3E)]  ≈  3%;  Genes  =  4500    

l  We  expect  P(T3E|+ve)  ≈  0.13  

l  (and  a  significant  number,  up  to  15%  of  the  genome,  of  false  posi0ves…)  

P (T3E|+ve) =P (+ve|T3E)P (T3E)

P (+ve|T3E)P (T3E) + P (+ve|T3E)P (T3E)

Page 137: Mining Plant Pathogen Genomes for Effectors

A  lesson  from  the  literature?  

l “The  resul0ng  computa0onal  model  revealed  a  strong  type  III  secre0on  signal  in  the  N-­‐terminus  that  can  be  used  to  detect  effectors  with  sensi0vity  of  71%  and  [specificity]  of  85%.”  

l  Sensi0vity  [P(+ve|T3E)]  =  0.71;  FPR  [1-­‐Specificity;  P(+ve|not  T3E)]  =  0.15  

l  Base  rate  [P(T3E)]  ≈  3%;  Genes  =  4500    

l  We  expect  P(T3E|+ve)  ≈  0.13  

l  (and  a  significant  number,  up  to  15%  of  the  genome,  of  false  posi0ves…)  

P (T3E|+ve) =P (+ve|T3E)P (T3E)

P (+ve|T3E)P (T3E) + P (+ve|T3E)P (T3E)

Page 138: Mining Plant Pathogen Genomes for Effectors

A  lesson  from  the  literature?  l “The  surprisingly  high  number  of  (false)  posi0ves  in  genomes    without  TTSS  exceeds  the  expected  false  posi0ve  rate  (Table  1)”  

 

0.038  x  5169  x  0.13  ≈  26  [No.  +ve  x  P(T3E|+ve)]    

Page 139: Mining Plant Pathogen Genomes for Effectors

Director’s  Commentary:  Finding  RxLRs  

l Supplementary  from  Whisson  et  al.  (2007)  l  Whisson  SC,  Boevink  PC,  Moleleki  L,  Avrova  AO,  Morales  JG,  et  al.  (2007)  A  transloca0on  

signal  for  delivery  of  oomycete  effector  proteins  into  host  plant  cells.  Nature  450:  115–118.  doi:10.1038/nature06203.  

l Not  perfect  

l Detail  of  one  way  to  construct  a  classifier  

Page 140: Mining Plant Pathogen Genomes for Effectors

Director’s  Commentary:  Finding  RxLRs  

l Supplementary  from  Whisson  et  al.  (2007)  l  Whisson  SC,  Boevink  PC,  Moleleki  L,  Avrova  AO,  Morales  JG,  et  al.  (2007)  A  transloca0on  

signal  for  delivery  of  oomycete  effector  proteins  into  host  plant  cells.  Nature  450:  115–118.  doi:10.1038/nature06203.  

l Not  perfect  

l Detail  of  one  way  to  construct  a  classifier  

Page 141: Mining Plant Pathogen Genomes for Effectors

Director’s  Commentary:  Finding  RxLRs  

l Supplementary  from  Whisson  et  al.  (2007)  l  Whisson  SC,  Boevink  PC,  Moleleki  L,  Avrova  AO,  Morales  JG,  et  al.  (2007)  A  transloca0on  

signal  for  delivery  of  oomycete  effector  proteins  into  host  plant  cells.  Nature  450:  115–118.  doi:10.1038/nature06203.  

l Not  perfect  

l Detail  of  one  way  to  construct  a  classifier  

Page 142: Mining Plant Pathogen Genomes for Effectors

Building  a  training  set  l Star0ng  point:  49  candidate  sequences  (reference  set)  

l Known:  l  Contain  (puta0vely)  RxLR-­‐EER  mo0f  

l  All  but  one  transcribed  (i.e.  not  bad  gene  calls)  

l Assumed:  

l  Presence  of  signal  pep0de  and  RxLR-­‐EER  categorises  effectors  

Page 143: Mining Plant Pathogen Genomes for Effectors

Building  a  training  set  l Star0ng  point:  49  candidate  sequences  (reference  set)  

l Known:  l  Contain  (puta0vely)  RxLR-­‐EER  mo0f  

l  All  but  one  transcribed  (i.e.  not  bad  gene  calls)  

l Assumed:  

l  Presence  of  signal  pep0de  and  RxLR-­‐EER  categorises  effectors    

Page 144: Mining Plant Pathogen Genomes for Effectors

Building  a  training  set  l Star0ng  point:  49  candidate  sequences  (reference  set)  

l Known:  l  Contain  (puta0vely)  RxLR-­‐EER  mo0f  

l  All  but  one  transcribed  (i.e.  not  bad  gene  calls)  

l Assumed:  

l  Presence  of  signal  pep0de  and  RxLR-­‐EER  categorises  effectors  

Page 145: Mining Plant Pathogen Genomes for Effectors

Building  a  training  set  l SignalP  3.0  (Bendtsen  et  al.  2004)  to  predict  loca0ons  of  signal  pep0des.  

l  SignalP  also  has  sta0s0cal  performance  es0mates:  

l Se}ngs:  

l  HMM  cutoff  probability  =  0.9  

l  Cleavage  site  between  posi0ons  10  and  40  inclusive  

l  Jus0fica0on:  use  in  previous  studies  by  others  

Page 146: Mining Plant Pathogen Genomes for Effectors

Building  a  training  set  l SignalP  3.0  (Bendtsen  et  al.  2004)  to  predict  loca0ons  of  signal  pep0des.  

l  SignalP  also  has  sta0s0cal  performance  es0mates:  

l Se}ngs:  

l  HMM  cutoff  probability  =  0.9  

l  Cleavage  site  between  posi0ons  10  and  40  inclusive  

l  Jus0fica0on:  use  in  previous  studies  by  others  

Page 147: Mining Plant Pathogen Genomes for Effectors

Building  a  training  set  l Of  49,  four  sequences  failed  

l  One  carried  forward  on  experimental  grounds  (highly-­‐expressed)  

l Training  set  now  has  46  sequences  

l But  seven  of  these  actually  have  no  recognisable  RxLR-­‐EER  mo0f,  so  are  discarded  

l Training  set  now  has  39  sequences  

Page 148: Mining Plant Pathogen Genomes for Effectors

Building  a  training  set  l Of  49,  four  sequences  failed  

l  One  carried  forward  on  experimental  grounds  (highly-­‐expressed)  

l Training  set  now  has  46  sequences  

l But  seven  of  these  actually  have  no  recognisable  RxLR-­‐EER  mo0f,  so  are  discarded  

l Training  set  now  has  39  sequences  

Page 149: Mining Plant Pathogen Genomes for Effectors

Building  a  training  set  l Of  49,  four  sequences  failed  

l  One  carried  forward  on  experimental  grounds  (highly-­‐expressed)  

l Training  set  now  has  46  sequences  

l But  seven  of  these  actually  have  no  recognisable  RxLR-­‐EER  mo0f,  so  are  discarded  

l Training  set  now  has  39  sequences  

Page 150: Mining Plant Pathogen Genomes for Effectors

Building  a  training  set  l Of  49,  four  sequences  failed  

l  One  carried  forward  on  experimental  grounds  (highly-­‐expressed)  

l Training  set  now  has  46  sequences  

l But  seven  of  these  actually  have  no  recognisable  RxLR-­‐EER  mo0f,  so  are  discarded  

l Training  set  now  has  39  sequences  

Page 151: Mining Plant Pathogen Genomes for Effectors

Building  a  classifier  l We  have  a  recognisable  mo0f,  with  substan0al  local  varia0on  and  indels  

l  Therefore  chose  profile  HMM  

l  Use  HMMer  socware  

l Profile  HMMs  sensi0ve  to    quality  of  alignment  

l Therefore  treat  alignment  as  a  parameter  of  the  HMM  (much  difference  between  alignments!)  

Page 152: Mining Plant Pathogen Genomes for Effectors

Building  a  classifier  l We  have  a  recognisable  mo0f,  with  substan0al  local  varia0on  and  indels  

l  Therefore  chose  profile  HMM  

l  Use  HMMer  socware  

l Profile  HMMs  sensi0ve  to    quality  of  alignment  

l Therefore  treat  alignment  as  a  parameter  of  the  HMM  (much  difference  between  alignments!)  

Page 153: Mining Plant Pathogen Genomes for Effectors

Building  a  classifier  l We  have  a  recognisable  mo0f,  with  substan0al  local  varia0on  and  indels  

l  Therefore  chose  profile  HMM  

l  Use  HMMer  socware  

l Profile  HMMs  sensi0ve  to    quality  of  alignment  

l Therefore  treat  alignment  as  a  parameter  of  the  HMM  (much  difference  between  alignments!)  

Page 154: Mining Plant Pathogen Genomes for Effectors

Building  a  classifier  l Anchored  at  RxLR  and  EER  

Page 155: Mining Plant Pathogen Genomes for Effectors

Building  a  classifier  l ClustalW  

Page 156: Mining Plant Pathogen Genomes for Effectors

Building  a  classifier  l T-­‐Coffee  

Page 157: Mining Plant Pathogen Genomes for Effectors

Building  a  classifier  l Parameters  modified  for  HMM  

l  Alignment  package  (no  alignment,  anchored,  Clustal,  DiAlign,  T-­‐Coffee)  on  default  se}ngs  

l  Full-­‐length  and  truncated  (no  signal  pep0de)  alignments  to  test  for  influence  of  signal  pep0de  region  on  classifier  

� Plus  one  alignment  of  RxLR-­‐EER  plus  flanking  region  only  (‘cropped’)  

l HMM  built  for  each  of  eleven  alignments  

l  Default  parameters  

l Once  built,  the  HMM  is  the  classifier.  

Page 158: Mining Plant Pathogen Genomes for Effectors

Building  a  classifier  

Trunca0ng  sequences  reshapes  sequence  space  

Page 159: Mining Plant Pathogen Genomes for Effectors

Building  a  classifier  l Parameters  modified  for  HMM  

l  Alignment  package  (no  alignment,  anchored,  Clustal,  DiAlign,  T-­‐Coffee)  on  default  se}ngs  

l  Full-­‐length  and  truncated  (no  signal  pep0de)  alignments  to  test  for  influence  of  signal  pep0de  region  on  classifier  

� Plus  one  alignment  of  RxLR-­‐EER  plus  flanking  region  only  (‘cropped’)  

l HMM  built  for  each  of  eleven  alignments  

l  Default  parameters  

l Once  built,  the  HMM  is  the  classifier.  

Page 160: Mining Plant Pathogen Genomes for Effectors

Building  a  classifier  l Parameters  modified  for  HMM  

l  Alignment  package  (no  alignment,  anchored,  Clustal,  DiAlign,  T-­‐Coffee)  on  default  se}ngs  

l  Full-­‐length  and  truncated  (no  signal  pep0de)  alignments  to  test  for  influence  of  signal  pep0de  region  on  classifier  

� Plus  one  alignment  of  RxLR-­‐EER  plus  flanking  region  only  (‘cropped’)  

l HMM  built  for  each  of  eleven  alignments  

l  Default  parameters  

l Once  built,  the  HMM  is  the  classifier.  

hmmbuild --amino <output> <alignment>!

Page 161: Mining Plant Pathogen Genomes for Effectors

TesIng  the  classifiers  

Only  posiIve  examples:  How  well  does  a  classifier  cover  them?  

Page 162: Mining Plant Pathogen Genomes for Effectors

TesIng  the  classifiers  l Eleven  classifiers  to  test  

l Step  1:  Consistency  test  l  Does  the  classifier  correctly  call  as  posi0ve  the  sequences  used  to  train  it?  

l  Es0mates  recovery  of  the  informa0on  in  the  training  set  

l Step  2:  Recovery  of  full  sequences    l  Es0mates  performance  of  classifier  on  complete  sequence  data  

SigP-­‐RxLR-­‐Cterm  

RxLR-­‐Cterm  

RxLR  

Page 163: Mining Plant Pathogen Genomes for Effectors

TesIng  the  classifiers  

Only  posiIve  examples:  How  well  does  a  classifier  recover  unseen  sequence?  

Page 164: Mining Plant Pathogen Genomes for Effectors

TesIng  the  classifiers  

Only  posiIve  examples:  How  well  does  a  classifier  recover  unseen  sequence?  

Page 165: Mining Plant Pathogen Genomes for Effectors

TesIng  the  classifiers  l Step  3:  Leave-­‐One-­‐Out  Crossvalida0on  

l  But  only  have  posi0ve  examples!  

l  Removes  possibility  that  classifier  matches  on  basis  of  having  ‘seen’  a  sequence  before  

Page 166: Mining Plant Pathogen Genomes for Effectors

TesIng  the  classifiers  l Leave-­‐one-­‐out  (LOO)  crossvalida0on:  

l  k  runs,  each  with  one  training  set,  one  test  set  (n  items  in  dataset,  k=n)  

Training  Set   Test  Set  

run1  

run2  

runk  

…  

Page 167: Mining Plant Pathogen Genomes for Effectors

TesIng  the  classifiers  l Step  3:  Leave-­‐One-­‐Out  Crossvalida0on  

l  But  only  have  posi0ve  examples!  

l  Removes  possibility  that  classifier  matches  on  basis  of  having  ‘seen’  a  sequence  before  

SigP-­‐  RxLR-­‐  Cterm  

RxLR-­‐  Cterm  

RxLR  

Beier  match  to  classifier  than  to  control  

Page 168: Mining Plant Pathogen Genomes for Effectors

TesIng  the  classifiers  l Step  4:  Tests  on  nega0ve  samples  

l  Completely  shuffled  sequences  

l  Shuffled  downstream  of  the  signal  pep0de  only  

l  Replace  RxLR-­‐EER  with  AAAA-­‐AAA  

No  classifier  idenIfies  a  false  posiIve    (no  classifier  matches  on  sequence        composi0on  alone)  

Page 169: Mining Plant Pathogen Genomes for Effectors

TesIng  the  classifiers  l Step  4:  Tests  on  nega0ve  samples  

l  Completely  shuffled  sequences  

l  Shuffled  downstream  of  the  signal  pep0de  only  

l  Replace  RxLR-­‐EER  with  AAAA-­‐AAA  

(some  recogni0on  on  basis  of  signal  pep0de)  

SigP-­‐  RxLR-­‐  Cterm  

RxLR-­‐  Cterm  

RxLR  

Page 170: Mining Plant Pathogen Genomes for Effectors

TesIng  the  classifiers  l Step  4:  Tests  on  nega0ve  samples  

l  Completely  shuffled  sequences  

l  Shuffled  downstream  of  the  signal  pep0de  only  

l  Replace  RxLR-­‐EER  with  AAAA-­‐AAA  

(some  recogni0on  on  sequence  other  than  mo0f)  

SigP-­‐  RxLR-­‐  Cterm  

RxLR-­‐  Cterm  

RxLR  

Page 171: Mining Plant Pathogen Genomes for Effectors

Choosing  a  classifier  l The  ‘cropped’  classifier  has:  

l  100%  recovery  of  posi0ve  training  sequences  

l  0%  recovery  of  nega0ve  test  sequences  

l Some  varia0on  in  classifier  performance  on  whole  genome:  

Page 172: Mining Plant Pathogen Genomes for Effectors

Choosing  a  classifier  l The  ‘cropped’  classifier  has:  

l  100%  recovery  of  posi0ve  training  sequences  

l  0%  recovery  of  nega0ve  test  sequences  

l Some  varia0on  in  classifier  performance  on  whole  genome:  

Page 173: Mining Plant Pathogen Genomes for Effectors

Oranges  are  not  the  only  fruit  l Other  classifiers  had  been  proposed,  e.g.  Bha[acharjee  et  al.  (2006):  

l  Presence  of  signal  pep0de,  with  cleavage  site  in  first  40aa  

l  Regular  expression  test:  

� R.LR.{,40}[ED][ED][KR]in  first  100aa    acer  cleavage  site  

l Can  choose  between  methods,  or  report  range  of  predic0ons  

Page 174: Mining Plant Pathogen Genomes for Effectors

Oranges  are  not  the  only  fruit  l Other  classifiers  had  been  proposed,  e.g.  Bha[acharjee  et  al.  (2006):  

l  Presence  of  signal  pep0de,  with  cleavage  site  in  first  40aa  

l  Regular  expression  test:  

� R.LR.{,40}[ED][ED][KR]in  first  100aa    acer  cleavage  site  

l Can  choose  between  methods,  or  report  range  of  predic0ons  

Page 175: Mining Plant Pathogen Genomes for Effectors

So  how  did  it  work  out…?  l Refined  all  RxLR  predic0ons  to  ‘priority  set’  of  ≈200  for  cloning  

l First  set  of  46  candidate  effectors  (07/11):  l  25  host  interactors  detected  by  Y2H  

l  Localisa0on  data  for  41  candidates  

l  Silencing  phenotypes  for  19  candidates  

l  22  puta0ve  orthologues  with  P.  capsici  

l Currently:  l  44  silencing  phenotypes  

Transient  expression  in  leaf  of  GFP-­‐fused  RxLR  candidate,  showing  plasma  membrane  localisa0on  

Page 176: Mining Plant Pathogen Genomes for Effectors

Acknowledgements  l Phytophthora  groups  at  JHI  

l  (Paul  Birch,  Steve  Whisson,  Dave  Cooke)  

l Bacteriology  groups  at  JHI  l  (Ian  Toth,  Nicola  Holden)  

l  Imaging  at  JHI  

l  (Petra  Boevink)  

l Numerous  sta0s0cians  

l  (David  Broadhurst,  Andy  Woodward,  BioSS)  

Page 177: Mining Plant Pathogen Genomes for Effectors

Sequence  space  

Page 178: Mining Plant Pathogen Genomes for Effectors

CD-­‐Hit  sequence  ordering  l  “Algorithm  limita0ons:  […]  

Let  say,  there  are  two  clusters:  cluster  #1  has  A,  X  and  Y  where  A  is  the  representa0ve,  and  cluster  #2  has  B  and  Z  where  B  is  the  representa0ve.    The  problem  is  that  even  if  Y  is  more  similar  to  B  than  to  A,  it  can  s0ll  be  in  cluster  #1  because  Y  first  hits  A  during  the  clustering  process.”  

l  h[p://weizhong-­‐lab.ucsd.edu/cd-­‐hit/wiki/doku.php?id=cd-­‐hit_user_guide