Gathering Sequences: BLAST - T-CoffeeSelecting Diverse Sequences (Opus II) Selecting Diverse...

10
Gathering Sequences: BLAST Common Mistake: Sequences Too Closely Related PRVA_MACFU SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE PRVA_HUMAN SMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIEE PRVA_GERSP SMTDLLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKTPDDVKKVFHILDKDKSGFIEE PRVA_MOUSE SMTDVLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKNPDEVKKVFHILDKDKSGFIEE PRVA_RAT SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE PRVA_RABIT AMTELLNAEDIKKAIGAFAAAESFDHKKFFQMVGLKKKSTEDVKKVFHILDKDKSGFIEE :**::*.*******:***:* :****************..::******:*********** PRVA_MACFU DELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES PRVA_HUMAN DELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES PRVA_GERSP DELGFILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES PRVA_MOUSE DELGSILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES PRVA_RAT DELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES PRVA_RABIT EELGFILKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES :*** ******.******.**** *:************.:******:** -IDENTICAL SEQUENCES BRING NO INFORMATION FOR THE MULTIPLE SEQUENCE ALIGNMENT -MULTIPLE SEQUENCE ALIGNMENTS THRIVE ON DIVERSITY…

Transcript of Gathering Sequences: BLAST - T-CoffeeSelecting Diverse Sequences (Opus II) Selecting Diverse...

  • Gath

    ering

    Seq

    uen

    ces: BL

    AS

    TC

    om

    mo

    n M

    istake:S

    equ

    ences T

    oo

    Clo

    sely Related

    PRVA_MACFU SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE

    PRVA_HUMAN SMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIEE

    PRVA_GERSP SMTDLLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKTPDDVKKVFHILDKDKSGFIEE

    PRVA_MOUSE SMTDVLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKNPDEVKKVFHILDKDKSGFIEE

    PRVA_RAT SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE

    PRVA_RABIT AMTELLNAEDIKKAIGAFAAAESFDHKKFFQMVGLKKKSTEDVKKVFHILDKDKSGFIEE

    :**::*.*******:***:* :****************..::******:***********

    PRVA_MACFU DELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES

    PRVA_HUMAN DELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES

    PRVA_GERSP DELGFILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES

    PRVA_MOUSE DELGSILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES

    PRVA_RAT DELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES

    PRVA_RABIT EELGFILKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES

    :*** ******.******.**** *:************.:******:**

    -IDEN

    TICAL SEQ

    UEN

    CES BRING

    NO

    INFO

    RMA

    TION

    FOR TH

    EM

    ULTIPLE SEQ

    UEN

    CE ALIG

    NM

    ENT

    -MU

    LTIPLE SEQU

    ENCE A

    LIGN

    MEN

    TS THRIVE O

    N D

    IVERSITY…

  • Selectin

    g D

    iverse Seq

    uen

    ces (Op

    us I)

    Resp

    ect Info

    rmatio

    n!

    -This Alignm

    ent Is not Informative about the relation

    Betww

    en TPCC MO

    USE and the rest of the sequences.

    -A better Spread of the Sequences is needed

    PRVA_MACFU ------------------------------------------SMTDLLN----AEDIKKA

    PRVA_HUMAN ------------------------------------------SMTDLLN----AEDIKKA

    PRVA_GERSP ------------------------------------------SMTDLLS----AEDIKKA

    PRVA_MOUSE ------------------------------------------SMTDVLS----AEDIKKA

    PRVA_RAT ------------------------------------------SMTDLLS----AEDIKKA

    PRVA_RABIT ------------------------------------------AMTELLN----AEDIKKA

    TPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM

    : :*. .*::::

    PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI

    PRVA_HUMAN VGAFSATDS--FDHKKFFQMVG------LKKKSADDVKKVFHMLDKDKSGFIEEDELGFI

    PRVA_GERSP IGAFAAADS--FDHKKFFQMVG------LKKKTPDDVKKVFHILDKDKSGFIEEDELGFI

    PRVA_MOUSE IGAFAAADS--FDHKKFFQMVG------LKKKNPDEVKKVFHILDKDKSGFIEEDELGSI

    PRVA_RAT IGAFTAADS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGSI

    PRVA_RABIT IGAFAAAES--FDHKKFFQMVG------LKKKSTEDVKKVFHILDKDKSGFIEEEELGFI

    TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM

    :. . * .*..:*: *: * *. :::..:*:::**: .*:*: :** :

    PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-

    PRVA_HUMAN LKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES-

    PRVA_GERSP LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES-

    PRVA_MOUSE LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES-

    PRVA_RAT LKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES-

    PRVA_RABIT LKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES-

    TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE

    *: . .. :: .: : *: ***:.**:*. :** ::

  • Selectin

    g D

    iverse Seq

    uen

    ces (Op

    us II)

    Selectin

    g D

    iverse Seq

    uen

    ces (Op

    us II)

    PRVB_CYPCA -AFAGVLNDADIAAALEACKAADSFNHKAFFAKVGLTSKSADDVKKAFAIIDQDKSGFIE

    PRVB_BOACO -AFAGILSDADIAAGLQSCQAADSFSCKTFFAKSGLHSKSKDQLTKVFGVIDRDKSGYIE

    PRV1_SALSA MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIE

    PRVB_LATCH -AVAKLLAAADVTAALEGCKADDSFNHKVFFQKTGLAKKSNEELEAIFKILDQDKSGFIE

    PRVB_RANES -SITDIVSEKDIDAALESVKAAGSFNYKIFFQKVGLAGKSAADAKKVFEILDRDKSGFIE

    PRVA_MACFU -SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIE

    PRVA_ESOLU --AKDLLKADDIKKALDAVKAEGSFNHKKFFALVGLKAMSANDVKKVFKAIDADASGFIE

    : *: .: . .* .:*. * ** *: * : * :* * **:**

    PRVB_CYPCA EDELKLFLQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA-

    PRVB_BOACO EDELKKFLQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG

    PRV1_SALSA VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ-

    PRVB_LATCH DEELELFLQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA-

    PRVB_RANES QDELGLFLQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA-

    PRVA_MACFU EDELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES

    PRVA_ESOLU EEELKFVLKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA

    :** .*:.* .* *: ** :: .* **** **::** **

    -A REA

    SON

    ABLE M

    odel Now

    Exists.

    -Going Further:Rem

    ote Hom

    ologues.

  • Alig

    nin

    g R

    emo

    te Ho

    mo

    log

    ues

    PRVA_MACFU ------------------------------------------SMTDLLNA----EDIKKA

    PRVA_ESOLU -------------------------------------------AKDLLKA----DDIKKA

    PRVB_CYPCA ------------------------------------------AFAGVLND----ADIAAA

    PRVB_BOACO ------------------------------------------AFAGILSD----ADIAAG

    PRV1_SALSA -----------------------------------------MACAHLCKE----ADIKTA

    PRVB_LATCH ------------------------------------------AVAKLLAA----ADVTAA

    PRVB_RANES ------------------------------------------SITDIVSE----KDIDAA

    TPCS_RABIT -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI

    TPCS_PIG -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI

    TPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM

    : ::

    PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI

    PRVA_ESOLU LDAVKAEGS--FNHKKFFALVG------LKAMSANDVKKVFKAIDADASGFIEEEELKFV

    PRVB_CYPCA LEACKAADS--FNHKAFFAKVG------LTSKSADDVKKAFAIIDQDKSGFIEEDELKLF

    PRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF

    PRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF

    PRVB_LATCH LEGCKADDS--FNHKVFFQKTG------LAKKSNEELEAIFKILDQDKSGFIEDEELELF

    PRVB_RANES LESVKAAGS--FNYKIFFQKVG------LAGKSAADAKKVFEILDRDKSGFIEQDELGLF

    TPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI

    TPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI

    TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM

    : . .: .. . *: * : * :* : .*:*: :** .

    PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-

    PRVA_ESOLU LKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA-

    PRVB_CYPCA LQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA--

    PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG-

    PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ--

    PRVB_LATCH LQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA--

    PRVB_RANES LQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA--

    TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ

    TPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ

    TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE

    :: .. :: : :: .* :.** *. :** ::

    Go

    ing

    Fu

    rther…

    PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI

    PRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF

    PRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF

    TPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI

    TPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI

    TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM

    TPC_PATYE SDEMDEEATGRLNCDAWIQLFER---KLKEDLDERELKEAFRVLDKEKKGVIKVDVLRWI

    . : .. . :: . : * :* : .* *. : * .

    PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES--

    PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG--

    PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ---

    TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ-

    TPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ-

    TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE-

    TPC_PATYE LS---SLGDELTEEEIENMIAETDTDGSGTVDYEEFKCLMMSSDA

    : . :: : :: * :..* :. :** ::

  • WH

    AT

    MA

    KE

    S A

    GO

    OD

    AL

    IGN

    ME

    NT

    -THE M

    ORE D

    IVERGEA

    NT TH

    E SEQU

    ENCES, TH

    E BETTER

    -THE FEW

    ER IND

    ELS, THE BETTER

    -NICE U

    NG

    APPED

    BLOCKS SEPA

    RATED

    WITH

    IND

    ELS

    -DIFFEREN

    T CLASSES O

    F RESIDU

    ES WITH

    IN A

    BLOCK:

    •Completely Conserved

    •Conserved For Size and Hydropathy

    •Conserved For Size or Hydropathy

    -THE U

    LTIMA

    TE EVALU

    ATIO

    N IS A

    MA

    TTER OF PERSO

    NN

    AL JU

    DG

    EMEN

    TA

    ND

    KNO

    WLED

    GE.

    DO

    NO

    T O

    VE

    RT

    UN

    E!!!

    chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD

    wheat --DPNKPKRAPSA

    FFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE

    trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP

    mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

    ***. ::: .: .. . : . . * . *: *

    chite AATAKQNYIRALQEYERNGG-

    wheat ANKLKGEYNKAIAAYNKGESA

    trybr AEKDKERYKREM---------

    mouse AKDDRIRYDNEMKSWEEQMAE

    * : .* . :

    DO

    NO

    T PLAY W

    ITH

    PARA

    METERS IF YO

    U KN

    OW

    THE A

    LIGN

    MEN

    TYO

    U W

    AN

    T: MA

    KE IT YOU

    RSELF!

    chite ---ADKPKRPL-SAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD

    wheat --DPNKPKRAP-SAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE

    trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP

    mouse -----KPKRPR-SAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

    ***. :*: .: .. . : . . * . *: *

    chite AATAKQNYIRALQEYERNGG-

    wheat ANKLKGEYNKAIAAYNKGESA

    trybr AEKDKERYKREM---------

    mouse AKDDRIRYDNEMKSWEEQMAE

    * : .* . :

  • TU

    NIN

    G o

    r NO

    T T

    UN

    ING

    ?

    -MO

    ST METH

    OD

    S ARE TU

    NED

    FOR W

    ORKIN

    G W

    ELL ON

    AVERA

    GE

    -PARA

    METERS BEH

    AVIO

    UR D

    O N

    OT N

    ECESSARILY FO

    LLOW

    THE

    THEO

    RY (i.e. Substitution Matrices).

    -A G

    OO

    D A

    LIGN

    MEN

    T IS USU

    ALLY RO

    BUST(i.e. Changes little).

    -TUN

    E IF YOU

    WA

    NT TO

    CON

    VINCE YO

    URSELF.

    -PARA

    METERS TO

    TUN

    E USU

    ALLY IN

    CLUD

    E:•G

    OP/ G

    EP•M

    ATRIX

    •SENSITIVITY Vs SPEED

    GO

    P

    GEP

    Substitution Matrices

    (Etzold and al. 1993)

    Gonnet

    61.7Blosum

    5059.7

    Pam250

    59.2

    KE

    EP

    A B

    IOL

    OG

    ICA

    L P

    ER

    SP

    EC

    TIV

    E

    chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD

    wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE

    trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP

    mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

    ***. ::: .: .. . : . . * . *: *

    chite AATAKQNYIRALQEYERNGG-

    wheat ANKLKGEYNKAIAAYNKGESA

    trybr AEKDKERYKREM---------

    mouse AKDDRIRYDNEMKSWEEQMAE

    * : .* . :

    chite AD--K----PKR-PLYMLWLNS-ARESIKRENPDFK-VT-EVAKKGGELWRGL-

    wheat -DPNK----PKRAP-FFVFMGE-FREEFKQKNPKNKSVA-AVGKAAGERWKSLS

    trybr -K--KDSNAPKR-AMT-MFFSSDFR-S-KH-S-DLS-IV-EMSKAAGAAWKELG

    mouse ----K----PKR-PRYNIYVSESFQEA-K--D-D-S-AQGKL-KLVNEAWKNLS

    * *** .:: ::... : * . . . : * . *: *

    chite KSEWEAKAATAKQNY-I--RALQE-YERNG-G-

    wheat KAPYVAKANKLKGEY-N--KAIAA-YNK-GESA

    trybr RKVYEEMAEKDKERY----K--RE-M-------

    mouse KQAYIQLAKDDRIRYDNEMKSWEEQMAE-----

    : : * : .* :

    DIFFEREN

    T PARA

    METERS

  • RE

    PE

    AT

    S

    THERE IS A

    PROBLEM

    WH

    EN TW

    O SEQ

    UEN

    CES DO

    NO

    T CON

    TAIN

    THE SA

    ME N

    UM

    BER OF REPEA

    TS

    IT IS THEN

    BETTER TO M

    AN

    UA

    LLY EXTRACT TH

    E REPEATS A

    ND

    TO A

    LIGN

    THEM

    . IND

    IVIDU

    AL REPEA

    TS CAN

    BE RECOG

    NIZED

    USIN

    G D

    OTTER

    Ch

    oo

    sing

    Th

    e Rig

    ht M

    etho

    d

    PROBLEM

    PROG

    RAM

    ClustalW

    ClustalW

    MSA

    DIA

    LIGN

    II

    DIA

    LIGN

    II

    METH

    OD

    Source: BaliBase, Thompson et al, N

    AR, 1999

  • Exam

    ples o

    f Mistakes

    Playin

    g W

    ith B

    locks: M

    bh

    1

  • Playin

    g W

    ith B

    locks: tR

    NA

    Syn

    thases

    Playin

    g W

    ith B

    locks:R

    Tase

  • Co

    nclu

    sion

    The Best Alignm

    ent Method:

    •Your Brain•The Right D

    ata

    The Best Evaluation:•Your Eyes•Experim

    ental Information (Sw

    issProt)

    What Can I Conclude:•H

    omology=> Inform

    ation Extrapolation

    How

    Can I go Further?:•PrositePatterns.•PrositeProfiles.