biopywork at uga

download biopywork at uga

of 55

Transcript of biopywork at uga

  • 8/9/2019 biopywork at uga

    1/55

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    IOB Workshop: BiopythonA programming toolkit for bioinformatics

    Eric Talevich

    Institute of Bioinformatics, University of Georgia

    Mar. 29, 2012

    Eric Talevich IOB Workshop: Biopython

    http://find/
  • 8/9/2019 biopywork at uga

    2/55

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    Getting startedwith

    Eric Talevich IOB Workshop: Biopython

    http://find/http://goback/
  • 8/9/2019 biopywork at uga

    3/55

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    Installing Python

    Biopython is a library for the Python programming language.

    First, youll need these installed:

    Python 2.7 from http://python.org. It may already beinstalled on your computer. (Version 2.6 is OK, too.)

    IDLE, a simple Integrated DeveLopment Environment.Usually bundled with the Python distribution.

    Now, start an interactive session in IDLE. 1

    1On your own, check out IPython (http://ipython.scipy.org/). Its an

    enhanced Python interpreter that feels somewhat likeR.Eric Talevich IOB Workshop: Biopython

    http://python.org/http://ipython.scipy.org/http://ipython.scipy.org/http://ipython.scipy.org/http://python.org/http://find/
  • 8/9/2019 biopywork at uga

    4/55

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    Installing Python packages

    Biopython is a Python package. There are a few standard ways toinstall Python packages:

    From source: Download from PyPI 2, unpack and install with the

    included setup.py script.easy install: Install from source 3, then use the easy install

    command to fetch install all other packages by name:$ easy install

    pip: Like easy install, use pip 4 to manage packages:$ pip install

    2http://pypi.python.org/pypi/3http://pypi.python.org/pypi/setuptools

    4http://pypi.python.org/pypi/pipEric Talevich IOB Workshop: Biopython

    S d li

    http://pypi.python.org/pypi/http://pypi.python.org/pypi/setuptoolshttp://pypi.python.org/pypi/piphttp://pypi.python.org/pypi/piphttp://pypi.python.org/pypi/setuptoolshttp://pypi.python.org/pypi/http://find/
  • 8/9/2019 biopywork at uga

    5/55

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    Installing NumPy, matplotlib and Biopython

    Biopython relies on a few other Python packages for extrafunctionality. Well use these:

    numpy efficient numerical functions and data structures(for Bio.PDB)

    matplotlib plotting (for Bio.Phylo)

    Then finally:

    biopython the reason were here today

    (Biopython, NumPy, matplotlib, setuptools and pip are also packaged for

    many Linux distributions.)

    Eric Talevich IOB Workshop: Biopython

    S d li t

    http://find/
  • 8/9/2019 biopywork at uga

    6/55

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    Testing

    Check your Biopython installation:

    >>> import Bio

    >>> print Bio. version

    Import a NumPy-based component:

    >>> from Bio import PDB

    Show a simple plot:

    >>> from matplotlib import pyplot

    >>> pyplot.plot(range(5), range(5))

    >>> pyplot.show()

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignments

    http://find/
  • 8/9/2019 biopywork at uga

    7/55

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    Lets start using

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignments

    http://find/
  • 8/9/2019 biopywork at uga

    8/55

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    Biopython1 Sequences and alignments

    The Seq objectSeqIO and the SeqRecord object

    2 NCBI EUtils and BLASTEUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

    3 Phylogenetics

    4 Protein structures

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignments

    http://find/
  • 8/9/2019 biopywork at uga

    9/55

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    The Seq objectSeqIO and the SeqRecord object

    Sequencesand

    Alignments

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignments

    http://goforward/http://find/
  • 8/9/2019 biopywork at uga

    10/55

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    The Seq objectSeqIO and the SeqRecord object

    The Seq object

    >>> from Bio.Seq import Seq

    >>> myseq = Seq(AGTACACTGGT)

    >>> myseq

    Seq(AGTACACTGGT, Alphabet())>>> print myseq

    AGTACACTGGT

    >>> myseq.transcribe()

    Seq(AGUACACUGGU, RNAAlphabet())>>> myseq.translate()

    Seq(STL, ExtendedIUPACProtein())

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignments

    http://find/
  • 8/9/2019 biopywork at uga

    11/55

    q gNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    The Seq objectSeqIO and the SeqRecord object

    A Seq object consists of:

    data the underlying Python character string

    alphabet DNA, RNA, protein, etc.

    It supports most Python string methods:

    >>> myseq.count(GT)2

    And some biology-specific methods, too:>>> myseq.reverse complement()

    Seq(ACCAGTGTACT, Alphabet())

    Intrigued? Read on:>>> help(Seq)

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignments

    http://find/
  • 8/9/2019 biopywork at uga

    12/55

    q gNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    The Seq objectSeqIO and the SeqRecord object

    SeqIO: Sequence Input/Output

    Sequence data is stored in many different file formats.Bio.SeqIO supports:

    abi fastq phylip swissace genbank pir tab

    clustal ig qual uniprot-xmlembl imgt seqxml

    emboss nexus sff fasta phd stockholm

    Manually fetch some data from the PDB website: 5

    1ATP.fasta two protein sequences, FASTA format

    1ATP.pdb the 3D structure, for later

    5

    http://www.rcsb.org/pdb/explore/explore.do?structureId=1ATPEric Talevich IOB Workshop: Biopython

    Sequences and alignments

    http://www.rcsb.org/pdb/explore/explore.do?structureId=1ATPhttp://www.rcsb.org/pdb/explore/explore.do?structureId=1ATPhttp://find/
  • 8/9/2019 biopywork at uga

    13/55

    NCBI EUtils and BLASTPhylogenetics

    Protein structures

    The Seq objectSeqIO and the SeqRecord object

    The SeqIO API

    SeqIO provides four functions:

    parse: Iteratively parse all elements in the file

    read: Parse a one-element file and return the elementwrite: Write elements to a file

    convert: Parse one format and immediately write another

    Biopython uses the same I/O conventions for alignments(AlignIO), BLAST results (Blast), and phylogenetic trees(Phylo).

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EU l d BLAST Th S b

    http://find/
  • 8/9/2019 biopywork at uga

    14/55

    NCBI EUtils and BLASTPhylogenetics

    Protein structures

    The Seq objectSeqIO and the SeqRecord object

    The SeqRecord object

    SeqIO.parse returns SeqRecords.SeqRecord wraps a Seqobject and attaches metadata.

    1 Pass the file name to the SeqIO parser; specify FASTA format:

    from Bio import SeqIOseqrecs = SeqIO.parse("1ATP.fasta", "fasta")

    print seqrecs

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EU il d BLAST Th S bj

    http://find/http://goback/
  • 8/9/2019 biopywork at uga

    15/55

    NCBI EUtils and BLASTPhylogenetics

    Protein structures

    The Seq objectSeqIO and the SeqRecord object

    The SeqRecord object

    SeqIO.parse returns SeqRecords.SeqRecord wraps a Seqobject and attaches metadata.

    1 Pass the file name to the SeqIO parser; specify FASTA format:

    from Bio import SeqIOseqrecs = SeqIO.parse("1ATP.fasta", "fasta")

    print seqrecs

    2 To see all records at once, convert the iterator to a list:

    allrecs = list(seqrecs)print allrecs[0]

    print allrecs[0].seq

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils a d BLAST The Se object

    http://find/
  • 8/9/2019 biopywork at uga

    16/55

    NCBI EUtils and BLASTPhylogenetics

    Protein structures

    The Seq objectSeqIO and the SeqRecord object

    Example: Shuffled sequences

    Given a real DNA sequence, create a background set ofrandomized sequences with the same composition.

    Procedure:

    1 Read the source sequence from a file Use Bio.SeqIO

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST The Seq object

    http://find/http://goback/
  • 8/9/2019 biopywork at uga

    17/55

    NCBI EUtils and BLASTPhylogenetics

    Protein structures

    The Seq objectSeqIO and the SeqRecord object

    Example: Shuffled sequences

    Given a real DNA sequence, create a background set ofrandomized sequences with the same composition.

    Procedure:

    1 Read the source sequence from a file Use Bio.SeqIO

    2 In a loop:

    Shuffle the sequence

    Use random.shuffle from Pythons standard libraryCreate a new SeqRecord from the shuffled sequence

    Because SeqIO.write works with SeqRecords

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST The Seq object

    http://find/http://goback/
  • 8/9/2019 biopywork at uga

    18/55

    NCBI EUtils and BLASTPhylogenetics

    Protein structures

    The Seq objectSeqIO and the SeqRecord object

    Example: Shuffled sequences

    Given a real DNA sequence, create a background set ofrandomized sequences with the same composition.

    Procedure:

    1 Read the source sequence from a file Use Bio.SeqIO

    2 In a loop:

    Shuffle the sequence

    Use random.shuffle from Pythons standard libraryCreate a new SeqRecord from the shuffled sequence

    Because SeqIO.write works with SeqRecords

    3 Write the shuffled SeqRecords to another file

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST The Seq object

    http://find/
  • 8/9/2019 biopywork at uga

    19/55

    NCBI EUtils and BLASTPhylogenetics

    Protein structures

    The Seq objectSeqIO and the SeqRecord object

    import randomfrom Bi o import SeqIO

    from Bi o . Seq import Seqfrom Bio . SeqRe cord import SeqRecord

    o r i g r e c = SeqIO . r e a d ( "gi2.gb", "genbank" )a l p h a b e t = o r i g r e c . s eq . a l p h a b e to u t r e c s = [ ]f o r i i n x r a n g e ( 1 , 3 1 ) :

    n u c l e o t i d e s = l i s t ( o r i g r e c . s eq )random . s h u f f l e ( n u c l e o t i d e s )n e w s e q = S e q ( "" . j o i n ( n u c l e o t i d e s ) , a l p h a b e t )n e w r e c = S e q Re c o rd ( n e w s e q ,

    i d="shuffle" + s t r ( i ) )o u t r e c s . a pp en d ( n e w r e c )

    S eq IO . w r i t e ( o u t r e c s , "gi2_shuffled.fasta", "fasta" )

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST The Seq object

    http://find/
  • 8/9/2019 biopywork at uga

    20/55

    C Ut s a d SPhylogenetics

    Protein structures

    e Seq objectSeqIO and the SeqRecord object

    Example: ORF translation

    Split a set of unannotated DNA sequences into uniqueORFs, translating in all 6 frames.

    Biopython can help with each piece of this problem:

    1 Parse the given unannotated DNA sequences (SeqIO.parse)2 Get the template strands sequence (Seq.reverse complement)

    3 Translate both strands into protein sequences (Seq.translate)

    4 Shift each strand by +1 and +2 for alternate reading frames

    (string-like Seq slicing)5 Split sequences at stop codons (Seq.split(*))

    6 Write translated sequences to a new file (SeqIO.write)

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST The Seq object

    http://find/
  • 8/9/2019 biopywork at uga

    21/55

    PhylogeneticsProtein structures

    q jSeqIO and the SeqRecord object

    d e f t r a n s l a t e s i x f r a m e s ( s eq , t a b l e =1):

    T r a n s l a t e a n u c l e o t i d e s eq ue nc e i n 6 f ra me s .

    R e t u r n s an i t e r a b l e o f 6 t r a n s l a t e d p r o t e i ns e q u e n c e s .r e v = s e q . r e v e r s e c o m p l e m e n t ( )f o r i i n r a n ge ( 3 ) :

    # C od in g ( C r i c k ) s t r a n d y i e l d s eq [ i : ] . t r a n s l a t e ( t a b l e )

    # Te mp la te ( W atson ) s t r a n d y i e l d r e v [ i : ] . t r a n s l a t e ( t a b l e )

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST The Seq object

    http://find/
  • 8/9/2019 biopywork at uga

    22/55

    PhylogeneticsProtein structures

    SeqIO and the SeqRecord object

    d e f t r a n s l a t e o r f s ( s eq ue nc es , m i n p r o t l e n =60):

    F i nd and t r a n s l a t e a l l ORFs i n s e qu e nc e s .

    T r a n s l a t e s e a c h s eq ue nc e i n a l l 6 r e a d i n g f r a m e s ,s p l i t s s e q u e n c es on s t o p c odo ns , and p r o d uc e s ani t e r a b l e o f a l l p r o t e i n s e q u e n c e s o f l e n g t h a t l e a s t m i n p r o t l e n .

    f o r s e q i n s e q u e n c e s :

    f o r f r a m e i n t r a n s l a t e s i x f r a m e s ( s e q ) :f o r p r o t i n f r a me . s p l i t ( "*") :

    i f l e n ( pr ot ) >= m i n p r o t l e n :

    y i e l d p r ot

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST The Seq object

    http://find/http://goback/
  • 8/9/2019 biopywork at uga

    23/55

    PhylogeneticsProtein structures

    SeqIO and the SeqRecord object

    from Bi o import SeqIO

    from Bio . SeqRe cord import SeqRecord

    i f n a m e == "__main__" :import s y si n f i l e = s y s . s t d i no u t f i l e = s y s . s t d o u t

    r e c o r d s = SeqIO . pa r se ( i n f i l e , "fasta" )s e q s = ( r e c . s e q f o r r e c i n r e c o r d s )p r o t e i n s = t r a n s l a t e o r f s ( s eq s )s e q r e c s = ( S eq Re co rd ( se q , i d="orf"+ s t r ( i ) )

    f o r i , s e q i n e n u m e r a t e ( o r f s ) )

    SeqIO . w r i t e ( s r e c s , o u t f i l e , "fasta" )

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST

    Ph l iThe Seq objectS IO d h S R d bj

    http://find/http://goback/
  • 8/9/2019 biopywork at uga

    24/55

    PhylogeneticsProtein structures

    SeqIO and the SeqRecord object

    AlignIO and the Alignment object

    Alignment: a set of sequences with the same length and alphabet.

    Use AlignIO just like SeqIO:>>> from Bio import AlignIO

    >>> aln = AlignIO.read("PF01601.sto", "stockholm")>>> print aln

    SingleLetterAlphabet() alignment with 22 rows and 730 columnsNCTDAV-----LTYSSFGVCADGSIIA-VQPRNV-----SYDSV...HIQ Q1HVL3 CVH22/539-1170NCTTAV-----MTYSNFGICADGSLIP-VRPRNS-----SDNGI...HVQ SPIKE CVHNL/723-1356NCTEPV-----LVYSNIGVCKSGSIGY-VPSQS------GQVKI...HVQ Q692M1 9CORO/740-1383NCTEPA-----LVYSNIGVCKNGAIGL-VGIRN------TQPKI...HIQ Q0Q4F4 9CORO/729-1360NCTSPR-----LVYSNIGVCTSGAIGL-LSPKX------AQPQI...HVQ Q0Q4F6 9CORO/743-1371

    NCTNPV-----LTYSSYGVCPDGSITR-LGLTD------VQPHF...--T A4ULL0 9CORO/726-1328NCTKPV-----LSYGPISVCSDGAIAG-TSTLQN-----TRPSI...KEW A6N263 9CORO/406-1035ECDIPIGAGICASYHTVSLLRSTSQKSIVAYTMS------LGAD...HYT Q6T7X8 CVHSA/647-1255...

    DCE-PV-----ITYSNIGVCKNGAFVF-INVTH------SDGDV...HVH Q0PKZ5 CVPPU/797-1449

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST

    Ph l tiThe Seq objectS IO d th S R d bj t

    http://find/
  • 8/9/2019 biopywork at uga

    25/55

    PhylogeneticsProtein structures

    SeqIO and the SeqRecord object

    Snack Time

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST

    Phylogenetics

    EUtils: Entrez Programming UtilitiesNCBI Blast

    http://find/
  • 8/9/2019 biopywork at uga

    26/55

    PhylogeneticsProtein structures

    External programs

    EUtils and BLAST

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST

    Phylogenetics

    EUtils: Entrez Programming UtilitiesNCBI Blast

    http://find/
  • 8/9/2019 biopywork at uga

    27/55

    PhylogeneticsProtein structures

    External programs

    EUtils: Entrez Programming Utilities

    Access NCBIs online services:from Bio import Entrez

    Entrez.email = "[email protected]"

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST

    Phylogenetics

    EUtils: Entrez Programming UtilitiesNCBI Blast

    http://find/
  • 8/9/2019 biopywork at uga

    28/55

    PhylogeneticsProtein structures

    External programs

    EUtils: Entrez Programming Utilities

    Access NCBIs online services:from Bio import Entrez

    Entrez.email = "[email protected]"

    Request a GenBank record:handle = Entrez.efetch(db="protein", id="69316",

    rettype="gb", retmode="text")

    record = SeqIO.read(handle, "gb")

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST

    Phylogenetics

    EUtils: Entrez Programming UtilitiesNCBI BlastE l

    http://find/
  • 8/9/2019 biopywork at uga

    29/55

    PhylogeneticsProtein structures

    External programs

    EUtils: Entrez Programming Utilities

    Access NCBIs online services:from Bio import Entrez

    Entrez.email = "[email protected]"

    Request a GenBank record:handle = Entrez.efetch(db="protein", id="69316",

    rettype="gb", retmode="text")

    record = SeqIO.read(handle, "gb")

    Specify multiple IDs in one query:handle = Entrez.efetch(db="protein",

    id="349839,349840",

    rettype="fasta", retmode="text")

    records = SeqIO.parse(handle, "fasta")

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST

    Phylogenetics

    EUtils: Entrez Programming UtilitiesNCBI BlastE t l

    http://find/
  • 8/9/2019 biopywork at uga

    30/55

    y gProtein structures

    External programs

    Interlude: SeqRecord attributes

    seq: the sequence (Seq) itselfid: primary ID for the sequence, e.g. accession number

    (string)

    name: common name/id for the sequence, like GenBank

    LOCUS iddescription: human-readible description of the sequence

    letter annotations: restricted dictionary of additional info aboutindividual letters in the sequence, e.g. quality scores

    annotations: dictionary of additional unstructured info

    features: list ofSeqFeature objects with more structuredinformation e.g. position of genes on a genome,domains on a protein sequence.

    dbxrefs: list of database cross-references (strings)

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST

    Phylogenetics

    EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

    http://find/http://goback/
  • 8/9/2019 biopywork at uga

    31/55

    y gProtein structures

    External programs

    from Bi o import E n t r e z , Se qI OE n t r e z . e m a i l = "[email protected]"

    h a n d l e = E n t r e z . e f e t c h ( d b="nucleotide", i d="M95169",r e t t y p e ="gb", r e t mo d e="text" )

    r e c o r d = S eq IO . r e a d ( h a n d l e , "genbank" )h a n d l e . c l o s e ( )p r i n t r e c o r d

    p r i n t r e c o r d . f e a t u r e s [ 1 0 ]s l i c e d = r e co r d [ 2 0 0 0 0 : ] # L a s t 25% o f t h e genome p r i n t s l i c e d

    from Bi o . Seq import Seq

    from Bi o . Al p hab et import g e n e r i c p r o t e i nt r a n s l a t i o n s = [ f . q u a l i f i e r s [ "translation" ]

    f o r f i n r e c o r d . f e a t u r e s [ 1 : ] ]p r o t e i n s = [ Seq ( t [ 0 ] , g e n e r i c p r o t e i n )

    f o r t i n t r a n s l a t i o n s ]

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST

    Phylogenetics

    EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

    http://find/
  • 8/9/2019 biopywork at uga

    32/55

    Protein structuresExternal programs

    NCBI Blast

    BLAST can be used either standalone or through NCBIs server.

    Online: >>> from Bio.Blast import NCBIWWW>>> result handle = NCBIWWW.qblast(

    blastp, nr, query string)

    Standalone: Legacy (blastall):>>> from Bio.Blast.Applications import

    BlastallCommandline

    >>> help(BlastallCommandline)

    New hotness (Blast+):>>> from Bio.Blast.Applications importNcbiblastpCommandline

    >>> help(NcbiblastpCommandline)

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsP i

    EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

    http://find/
  • 8/9/2019 biopywork at uga

    33/55

    Protein structuresExternal programs

    Parsing BLAST output

    BLAST produces reports in plain-text and XML format.

    Biopython requests XML by default.

    >>> from Bio.Blast import NCBIWWW, NCBIXML

    >>> result handle = NCBIWWW.qblast(blastp,

    ... nr, query string)

    >>> blast record = NCBIXML.read(result handle)

    >>> print blast record

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsP t i t t

    EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

    http://find/
  • 8/9/2019 biopywork at uga

    34/55

    Protein structuresExternal programs

    # S e ar c h f o r hom ologs o f a p r o t e i n s eq ue nc e

    from Bi o import SeqIOfrom B i o . B l a s t import NBCIWWW, NCBIXML

    # Read and r e f o r ma t t he q ue ry s eq ue nc e s e q r e c = Se qI O . r e a d ( gi2.gb, gb )q u e r y = s e q r e c . f o r ma t ( fasta )

    # Su bm it an o n l i n e BLAST q u e r y # ( T hi s t a k e s some t im e t o r un )r e s u l t h a n d l e = NCBIWWW. q b l a s t ( blastx, nr, q u er y )

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

    http://find/http://goback/
  • 8/9/2019 biopywork at uga

    35/55

    Protein structuresp g

    # 1 . Sa ve t h e BLAST r e s u l t s a s an XML f i l e

    w i th open ( aprotinin_blast.xml , w ) a s s a v e f i l e :s a v e f i l e . w r i t e ( r e s u l t h a n d l e . r e a d ( ) )

    r e s u l t h a n d l e . c l o s e ( )

    # NB : The BLAST r e s u l t h a nd l e can o n l y be r e ad on ce # R el oa d i t from t he f i l e

    w i th open ( aprotinin_blast.xml ) a s r e s u l t h a n d l e :b l a s t r e c o r d = NCBIXML . r e a d ( r e s u l t h a n d l e )

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics

    Protein structures

    EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

    http://find/http://goback/
  • 8/9/2019 biopywork at uga

    36/55

    Protein structures

    # 2 . D i s p l a y a h i s to gr a m o f BLAST h i t s c o r e s

    d e f g e t s c o r e s ( a l i g n m e n t s ) :f o r a l n i n a l i g n m e n t s :

    f o r hsp i n a l n . hsp s :y i e l d hsp . s c o r e

    s c o r e s = l i s t ( g e t s c o r e s ( b l a s t r e c o r d . a l i g n m e n t s ) )

    # Draw t h e h i s t o gr a mimport p y l a bp y l a b . h i s t ( s c o r e s , b i n s =20)p y l a b . t i t l e ( " S co re s of % d B LA ST hi ts " % l e n ( s c o r e s ) )

    p y l a b . x l a b e l ( "BLAST score" )p y l a b . y l a b e l ( " # hi ts " )p y la b . show ( )

    # Save a copy f o r l a t e r p y l a b . s a v e f i g ( aprotinin_scores.png )

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics

    Protein structures

    EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

    http://find/http://goback/
  • 8/9/2019 biopywork at uga

    37/55

    Protein structures

    Figure: Histogram of BLAST scores generatedbypylab

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics

    Protein structures

    EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

    http://find/
  • 8/9/2019 biopywork at uga

    38/55

    Protein structures

    # 3 . E x t r a c t t h e s eq ue nc es o f h i g hs c o r i n g BLAST h i t s

    from Bi o . Seq import Seqfrom Bio . SeqRe cord import SeqRecord

    d e f g e t h s p s ( a l ig n me n ts , t h r e s h o l d ) :f o r a l n i n a l i g n m e n t s :

    f o r hsp i n a l n . hsp s :

    i f h s p . s c o r e >= t h r e s h o l d :y i e l d S eq R ec or d ( Seq ( h sp . s b j c t ) ,

    i d =a l n . a c c e s s i o n )break

    b e s t s e q s = g e t h s p s ( b l a s t r e c o r d . a l i gn me n t s , 3 2 1)S eq IO . w r i t e ( b e s t s e q s , aprotinin.fasta, fasta )

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics

    Protein structures

    EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

    http://find/http://goback/
  • 8/9/2019 biopywork at uga

    39/55

    u u

    Calling other external programs

    Biopython has wrappers for other command-line programs in:

    Bio.Blast.Applications the Blast+ suite

    Bio.Align.Applications Muscle, ClustalW, . . .Bio.Emboss.Applications needle, water, . . .

    Lets re-align our BLAST results using Muscle, and format the

    alignment for use with stand-alone Phylip.

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics

    Protein structures

    http://find/http://goback/
  • 8/9/2019 biopywork at uga

    40/55

    from Bi o import A l i g n I Ofrom B io . A l i g n . A p p l i c a t i o n s import MuscleCommandlinefrom S t r i n g I O import S t r i n g I O

    # C o n st r u ct t he s h e l l command m u s cl e c m d = M u sc l eC o mm a nd l in e ( i n p u t="aprotinin.fasta" )

    # E x e c u t e t h e command # Get o ut pu t ( t he a l i gn m e nt ) and any e r r o r m es sa ge s

    m u sc l e o u t , m u s c l e e r r = m usc le c md ( )

    # Read t he a l i g n m e nt ba ck i na l i g n = A l i g n I O . r e a d ( S t r i n g I O ( m u s c l e o u t ) , "fasta" )

    # Format t he a l i gn m e n t f o r P h y l i p

    A l i g n I O . w r i t e ( [ a l i g n ] , aprotinin.phy, phylip )

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics

    Protein structures

    http://find/
  • 8/9/2019 biopywork at uga

    41/55

    Phylogenetics

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics

    Protein structures

    http://find/
  • 8/9/2019 biopywork at uga

    42/55

    Phylogenetic tree I/O

    Start with:>>> from Bio import Phylo

    Input and output of trees is just like SeqIO:

    read, parse single or multiple trees in Newick, Nexus andPhyloXML formats

    write to any of the formats supported by read/parse

    convert between two formats in one step

    Use StringIO to load strings directly:>>> from cStringIO import StringIO

    >>> handle = StringIO("((A,B),(C,(D,E)));")

    >>> tree = Phylo.read(handle, "newick")

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics

    Protein structures

    http://find/
  • 8/9/2019 biopywork at uga

    43/55

    Whats in a tree?

    Make a tree with branch lengths:>>> tree = Phylo.read(StringIO("((A:1,B:1):2,

    ... (C:2,(D:1,E:1):1):1);"), "newick")

    View the object structure of the entire tree:>>> print tree

    Draw an ASCII-art (plain text) representation:>>> Phylo.draw ascii(tree)

    . . . OK, lets do it properly now:>>> Phylo.draw(tree)

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics

    Protein structures

    http://find/
  • 8/9/2019 biopywork at uga

    44/55

    Modify the tree

    Check the tree object for its methods:>>> help(tree)

    Try a few:>>> tree.get terminals()>>> clade = tree.common ancestor("A", "B")

    >>> clade.color = "red"

    >>> tree.root with outgroup("D", "E")

    >>> tree.ladderize()>>> Phylo.draw(tree)

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics

    Protein structures

    http://find/
  • 8/9/2019 biopywork at uga

    45/55

    External applications

    Biopython wraps a number of external programs for phylogenetics.Were not going to use them now, but heres where to find them:

    Bio.Phylo.PAML PAML wrappers & helpers

    Bio.Phylo.Applications command-line wrapper for PhyML(PhymlCommandline); RAxML and others on theway. (Anything youd like to see sooner?)

    Bio.Emboss.Applications other tools ported via Embassy,

    including Phylip

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLASTPhylogenetics

    Protein structures

    http://find/
  • 8/9/2019 biopywork at uga

    46/55

    Proteinstructures

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    http://find/
  • 8/9/2019 biopywork at uga

    47/55

    Going 3D: The PDB module

    Load a structure:>>> from Bio import PDB

    >>> parser = PDB.PDBParser()

    >>> struct = parser.get structure(1ATP,

    1ATP.pdb)

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    http://find/
  • 8/9/2019 biopywork at uga

    48/55

    Going 3D: The PDB module

    Load a structure:>>> from Bio import PDB

    >>> parser = PDB.PDBParser()

    >>> struct = parser.get structure(1ATP,

    1ATP.pdb)

    Inspect the object hierarchy:

    >>> list(struct)

    >>> model = struct[0]

    >>> list(model)>>> chain = model[E]

    >>> list(chain)

    >>> residue = chain[15]

    >>> list(residue)

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    http://find/
  • 8/9/2019 biopywork at uga

    49/55

    Figure: The SMCRA object hierarchy

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    http://find/
  • 8/9/2019 biopywork at uga

    50/55

    Extracting a peptide sequence

    Get the amino acid sequence through a Polypeptide object:

    >>> from Bio import PDB

    >>> parser = PDB.PDBParser()

    >>> struct = parser.get structure(1ATP,... 1ATP.pdb)

    >>> ppb = PDB.PPBuilder()

    >>> peptides = ppb.build peptides(struct)

    >>> for pep in peptides:

    ... print pep.get sequence()

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    C S

    http://find/
  • 8/9/2019 biopywork at uga

    51/55

    Calculating RMSD

    Given two aligned structures, filter a list of targetresidues for high RMS deviation.

    Input: list of residue positions (integers)two equivalent chains from aligned protein

    models residue numbers must matchMinimum RMSD value (float)

    Output: list of residue positions, filtered

    Procedure: 1 Extract coordinates ofC atoms2

    If available (not glycine), extractCcoordinates, too

    3 Use Bio.SVDSuperimposer to calculate theRMSD between coordinates

    4 Compare to the given RMSD threshold

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    http://find/
  • 8/9/2019 biopywork at uga

    52/55

    from Bio . SVDSup erimp oser import SVDSuperimposerfrom numpy import a r r a y

    d e f f i l t r m s ( r e s i d s , r e f c h a i n , cmpchain , t h r e s h = 0. 5 ):s u p e r = S V DS u pe r i mp o se r ( )f o r r e s i n r e s i d s :

    r e f r e s = r e f c h a i n [ r e s ]c m p r e s = c m p ch a i n [ r e s ]

    c oo rd 1 = [ r e f r e s [ CA] . g e t c o o r d ( ) ]c o o r d 2 = [ c m pr e s [ CA] . g e t c o o r d ( ) ]i f r e f r e s . h a s i d ( CB ) and c m p r e s . h a s i d ( CB ) :

    # Not g l y c i n e c o o r d 1 . a pp en d ( r e f r e s [ CB] . g e t c o o r d ( ) )

    coo rd2 . append ( cmp res [ CB] . g e t c o o r d ( ) )s u p e r . s e t ( a r r a y ( c o o r d 1 ) , a r r a y ( c o o r d 2 ) )rmsd = s u p er . g e t i n i t r m s ( )i f rmsd >= t h r e s h o l d :

    y i e l d r e s

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    http://find/
  • 8/9/2019 biopywork at uga

    53/55

    Figure: Superimposed structures, with selected deviating residues

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    F th di

    http://find/
  • 8/9/2019 biopywork at uga

    54/55

    Further reading

    Biopython tutorial:http:

    //biopython.org/DIST/docs/tutorial/Tutorial.html

    Biopython wiki:http://biopython.org/

    This presentation:http://www.slideshare.net/etalevich/

    biopython-programming-workshop-at-uga

    Eric Talevich IOB Workshop: Biopython

    Sequences and alignmentsNCBI EUtils and BLAST

    PhylogeneticsProtein structures

    http://biopython.org/DIST/docs/tutorial/Tutorial.htmlhttp://biopython.org/DIST/docs/tutorial/Tutorial.htmlhttp://biopython.org/http://www.slideshare.net/etalevich/biopython-programming-workshop-at-ugahttp://www.slideshare.net/etalevich/biopython-programming-workshop-at-ugahttp://www.slideshare.net/etalevich/biopython-programming-workshop-at-ugahttp://www.slideshare.net/etalevich/biopython-programming-workshop-at-ugahttp://biopython.org/http://biopython.org/DIST/docs/tutorial/Tutorial.htmlhttp://biopython.org/DIST/docs/tutorial/Tutorial.htmlhttp://find/
  • 8/9/2019 biopywork at uga

    55/55

    ThanksPreciate it.

    Gracias

    Eric Talevich IOB Workshop: Biopython

    http://find/