BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple...

63
BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl PDB BLAST Some of the slides are based on previous material by J. Grefenstette

Transcript of BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple...

Page 1: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 1

Topics

Advanced SQL for Multiple Table Databases JOINS

Objected Oriented Perl BioPerl

PDB BLAST

Some of the slides are based on previous material by J. Grefenstette

Page 2: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

Multiple Table DatabasesSuppose we want to store info such as:+----------+----------+----------------+-----------+------------+| name | city | country | city size | population |+----------+----------+----------------+-----------+------------+| Wu | Shanghai | China | 17500000 | 1321851888 || Lakshmi | Bombay | India | 21600000 | 1129866154 || Prabhu | Delhi | India | 21500000 | 1129866154 || Susan | New York | United States | 21900000 | 301139947 || Richard | Chicago | United States | 9800000 | 301139947 || Robert | London | United Kingdom | 12000000 | 60776238 || Ali | London | United Kingdom | 12000000 | 60776238 || Sohyoung | Seoul | South Korea | 23400000 | 49044790 |+----------+----------+----------------+-----------+------------+Problem: When population of city or country changes, we have to update many records

data may become inconsistent if we update, say, the city size for one person!

Solution: Design Database so that information is only stored in one place

called Normalizing the Database Uses several tables to eliminate many-to-one relationships within a table Queries can join several tables together

JOINS

2BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Page 3: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

Multiple Table DatabasesThe PERSON table:

mysql> select * from person;+----+----------+----------+| id | name | city |+----+----------+----------+| 1 | Susan | New York || 2 | Robert | London || 3 | Ali | London || 4 | Richard | Chicago || 5 | Wu | Shanghai || 6 | Lakshmi | Bombay || 7 | Sohyoung | Seoul || 8 | Prabhu | Delhi |+----+----------+----------+8 rows in set (0.00 sec)

JOINS

3BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Page 4: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

Multiple Table DatabasesThe CITY table:

mysql> select * from city limit 10;+-------------+---------------+------------+| name | country | population |+-------------+---------------+------------+| Tokyo | Japan | 33600000 || Seoul | South Korea | 23400000 || Mexico City | Mexico | 22400000 || New York | United States | 21900000 || Bombay | India | 21600000 || Delhi | India | 21500000 || Sao Paulo | Brazil | 20600000 || Los Angeles | United States | 18000000 || Shanghai | China | 17500000 || Osaka | Japan | 16700000 |+-------------+---------------+------------+10 rows in set (0.00 sec)

JOINS

4BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Page 5: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

Multiple Table DatabasesThe COUNTRY table:

mysql> select * from country limit 10;+---------------+------------+| name | population |+---------------+------------+| China | 1321851888 || India | 1129866154 || United States | 301139947 || Indonesia | 234693997 || Brazil | 190010647 || Pakistan | 169270617 || Bangladesh | 150448339 || Russia | 141377752 || Nigeria | 135031164 || Japan | 127467972 |+---------------+------------+10 rows in set (0.01 sec)

JOINS

5BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Page 6: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 6

Wikipedia Definition of a JOIN

An SQL JOIN clause combines records from two or more tables in a database. It creates a set that can be saved as a table or used as is. A JOIN is a means for combining fields from two tables by using values common to each. ANSI standard SQL specifies four types of JOINs: INNER, OUTER, LEFT, and RIGHT. In special cases, a table (base table, view, or joined table) can JOIN to itself in a self-join.

JOINS

Page 7: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

JOINing Two TablesTo JOIN two tables, use:SELECT columns From table1, table2 WHERE join conditions

mysql> select * from person, city where person.city = city.name;+----+----------+----------+----------+----------------+------------+| id | name | city | name | country | population |+----+----------+----------+----------+----------------+------------+| 1 | Susan | New York | New York | United States | 21900000 || 2 | Robert | London | London | United Kingdom | 12000000 || 3 | Ali | London | London | United Kingdom | 12000000 || 4 | Richard | Chicago | Chicago | United States | 9800000 || 5 | Wu | Shanghai | Shanghai | China | 17500000 || 6 | Lakshmi | Bombay | Bombay | India | 21600000 || 7 | Sohyoung | Seoul | Seoul | South Korea | 23400000 || 8 | Prabhu | Delhi | Delhi | India | 21500000 |+----+----------+----------+----------+----------------+------------+8 rows in set (0.01 sec)

JOINS

7BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Page 8: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

Ambiguous Column NamesTo name a field, use table.column if column name appears

in multiple tables:

mysql> select person.name, person.city, country, populationfrom person, city where person.city = city.name;

+----------+----------+----------------+------------+| name | city | country | population |+----------+----------+----------------+------------+| Susan | New York | United States | 21900000 || Robert | London | United Kingdom | 12000000 || Ali | London | United Kingdom | 12000000 || Richard | Chicago | United States | 9800000 || Wu | Shanghai | China | 17500000 || Lakshmi | Bombay | India | 21600000 || Sohyoung | Seoul | South Korea | 23400000 || Prabhu | Delhi | India | 21500000 |+----------+----------+----------------+------------+8 rows in set (0.00 sec)

JOINS

8BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Page 9: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

Aliases for Column TablesYou can provide an ALIAS for any column name:

mysql> select person.name, person.city, city.country,city.population as 'city size'from person, city where person.city = city.name;

+----------+----------+----------------+-----------+| name | city | country | city size |+----------+----------+----------------+-----------+| Susan | New York | United States | 21900000 || Robert | London | United Kingdom | 12000000 || Ali | London | United Kingdom | 12000000 || Richard | Chicago | United States | 9800000 || Wu | Shanghai | China | 17500000 || Lakshmi | Bombay | India | 21600000 || Sohyoung | Seoul | South Korea | 23400000 || Prabhu | Delhi | India | 21500000 |+----------+----------+----------------+-----------+8 rows in set (0.00 sec)

JOINS

9BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Page 10: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

WHERE conditions in JOINSUse WHERE conditions to limit which records are

returned.Example: show date for people from India

mysql> select person.name, person.city, city.country, city.population as 'city size' from person, city where person.city = city.name and city.country = 'India';

+---------+--------+---------+-----------+| name | city | country | city size |+---------+--------+---------+-----------+| Lakshmi | Bombay | India | 21600000 || Prabhu | Delhi | India | 21500000 |+---------+--------+---------+-----------+2 rows in set (0.00 sec)

JOINS

10BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Page 11: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

JOIN Three TablesCan JOIN three of more tables.Example: Show both country and city population for each person:

mysql> select person.name, person.city, city.country, city.population as 'city size', country.population from person, city, country where person.city = city.name and city.country = country.name;

+----------+----------+----------------+-----------+------------+| name | city | country | city size | population |+----------+----------+----------------+-----------+------------+| Susan | New York | United States | 21900000 | 301139947 || Robert | London | United Kingdom | 12000000 | 60776238 || Ali | London | United Kingdom | 12000000 | 60776238 || Richard | Chicago | United States | 9800000 | 301139947 || Wu | Shanghai | China | 17500000 | 1321851888 || Lakshmi | Bombay | India | 21600000 | 1129866154 || Sohyoung | Seoul | South Korea | 23400000 | 49044790 || Prabhu | Delhi | India | 21500000 | 1129866154 |+----------+----------+----------------+-----------+------------+8 rows in set (0.00 sec)

JOINS

11BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Page 12: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

ORDER BYSame as above, by sort results by country

population:

mysql> select person.name, person.city, city.country, city.population as 'city size', country.population from person, city, country where person.city = city.name and city.country = country.nameorder by country.population desc;

+----------+----------+----------------+-----------+------------+| name | city | country | city size | population |+----------+----------+----------------+-----------+------------+| Wu | Shanghai | China | 17500000 | 1321851888 || Lakshmi | Bombay | India | 21600000 | 1129866154 || Prabhu | Delhi | India | 21500000 | 1129866154 || Susan | New York | United States | 21900000 | 301139947 || Richard | Chicago | United States | 9800000 | 301139947 || Robert | London | United Kingdom | 12000000 | 60776238 || Ali | London | United Kingdom | 12000000 | 60776238 || Sohyoung | Seoul | South Korea | 23400000 | 49044790 |+----------+----------+----------------+-----------+------------+8 rows in set (0.00 sec)

JOINS

12BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Page 13: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 13

BioPerl

Object Oriented Programming (OOP) Intro to BioPerl

Bio::Seq Bio::SeqIO Bio::DB Bio::Perl Others

BioPerl OOP

Page 14: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

Perl Modules A Module is a file containing Perl code

The file should have a .pm extension

Code within a module is brought into a program with the use statement:

use moduledir::module;

where moduledir is the directory in which the module file exists. the moduledir can be omitted if module is in current directory

or in the standard Perl directories (specified by the @INC list)

You can access data and subroutines by using “full name”use myModule;

$myModule::counter = 3;

myModule::translate($dna);

BioPerl OOP

14BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Page 15: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

Object-oriented programming (OOP) is a software engineering technique for modularizing code

Think about a program as a set of interacting objects

Objects have attributes data that describes the object

Objects also have actions (or methods) code that controls how an object interacts with other objects

Main ideas: Hide the details of an object’s data

so it can be changed easily (by the developer) users don’t need to know excess details

All interactions are performed via the object’s methods

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 15

Object Oriented Programming

BioPerl OOP

Page 16: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 16

Basics of Perl Objects

In Procedural Programming, data and functions to operate on data that are kept separate. Example:$reverse_complement = revcom( $sequence );

In Object-Oriented Programming, both data and functions are encapsulated into the same construct (called objects). Example: $reversed_obj = $seq_obj->revcom();

$seq_obj is called an object, and revcom is called a method

In other words, the object $seq_obj “knows” how to calculate and return its own reverse complement.

BioPerl OOP

Page 17: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

Basics of Perl Objects Objects are separated into types called classes

The class of an object defines both the data that it can hold and the methods that it knows.

A specific object that has a defined class is referred to as an instance of that class.

Perl uses the term "module" rather than "class".

An object in Perl is a special kind of hash reference.

In Bioperl the data that the object contains is stored in a single, complex hash and the object, like $seq_obj, is a reference to this hash. In addition, the methods that the object can use are also stored in this hash as particular kinds of references.

BioPerl OOP

17BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB,

and BLAST)

Page 18: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

18

BioPerl(www.bioperl.org) A collection of modules that facilitates the development of Perl

scripts for bioinformatics applications Some Simple Procedural Functions

Bio::Perl Sequences

Bio::Seq Bio:SeqIO Bio::PrimarySeq

Multiple Sequence Alignments Bio::SimpleAlign

Access to Databases Bio::DB::GenBank

Access to Tools Bio::Tools::Blast Bio::Tools::BPlite

See Table 12.1 of Tisdall

BioPerl

Page 19: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 19

# these methods return strings, and accept strings in some cases:

$seqobj->seq(); # string of sequence$seqobj->subseq(5,10); # part of the sequence as a string$seqobj->accession_number(); # when there, the accession number$seqobj->alphabet(); # one of 'dna','rna',or 'protein'$seqobj->seq_version() # when there, the version$seqobj->keywords(); # when there, the Keywords line$seqobj->length() # length$seqobj->desc(); # description$seqobj->primary_id(); # a unique id for this sequence regardless # of its display_id or accession number$seqobj->display_id(); # the human readable id of the sequence

Most of these features are included in GenBank records.

Methods on Bio::Seq Objects

BioPerl

Page 20: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

Using Bio::Seq Each Class contains a method called new that returns an new object of the

class The new function takes arguments that initialize the data belonging to the

new object Once an object is created, you can access all the methods associated with its

class by calling its attached functions

use Bio::Seq; # module to handle sequence data

# create a new sequence object$seq_obj = Bio::Seq->new(-seq => 'ATGGGGGTGGTGGTACCCT', -id => 'human_id', -accession_number => 'AL000012', -alphabet => 'dna');

# the seq() method returns the object's sequence as a stringprint "DNA: ", $seq_obj->seq(), "\n";

# create a new sequence object contain protein sequence$protein_obj = $seq_obj->translate();print "Protein: ", $protein_obj->seq(), "\n";

OUTPUT:DNA: ATGGGGGTGGTGGTACCCTProtein: MGVVVP

BioPerl

20BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB,

and BLAST)

Page 21: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

Using Bio::Seq and Bio::SeqIO#!/usr/bin/perl

# File: bioseqio.pl

use strict;

use warnings;

use Bio::Seq;

use Bio::SeqIO; # module to read/write many sequence formats

# create a new sequence object

my $seq_obj = Bio::Seq->new(-seq => 'ATGGGGGTGGTGGTACCCT',

-id => 'human_id',

-accession_number => 'AL000012',

-alphabet => 'dna');

# create a new FASTA file handler object (NOTE: ">" for writing)

my $seqio_obj = Bio::SeqIO->new(-file=>">myseq.fsa",'-format'=>'fasta');

# tell the file handler to print the sequence object

$seqio_obj->write_seq($seq_obj);

% cat myseq.fsa>human_idATGGGGGTGGTGGTACCCT

BioPerl

21BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB,

and BLAST)

Page 22: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

Reading FASTA file with Bio::SeqIO#!/usr/bin/perl# File: read_fasta.pluse strict;use warnings;use Bio::Seq;use Bio::SeqIO; # module to read/write many sequence formats

my $file = "seq3.fsa"; #this is a FASTA file with 3 seqs

# create a new FASTA file handler object (no ">" for reading!)my $seqio_obj = Bio::SeqIO->new(-file=> $file, -format=>"fasta");

# read sequences from the FASTA filewhile (my $seq_obj = $seqio_obj->next_seq()) { my $seq_string = $seq_obj->seq(); print "$seq_string\n";}OUTPUT:

ATTGGGTGCGCGTGCNCCTTCC

aaaaaatcatactacatgtagggtaca

aaaaaatcatact

BioPerl

22BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Page 23: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

A Sample GenBank Record - ILOCUS NM_007553 1541 bp mRNA linear ROD 07-

JAN-2002DEFINITION Mus musculus bone morphogenetic protein 2 (Bmp2), mRNA.ACCESSION NM_007553VERSION NM_007553.1 GI:6680793KEYWORDS .SOURCE house mouse. ORGANISM Mus musculus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;

Euteleostomi; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae;

Murinae; Mus.REFERENCE 1 (bases 1 to 1541) AUTHORS Feng,J.Q., Harris,M.A., Ghosh-Choudhury,N., Feng,M.,

Mundy,G.R. and Harris,S.E. TITLE Structure and sequence of mouse bone morphogenetic protein-2

gene (BMP-2): comparison of the structures and promoter regions

of BMP-2 and BMP-4 genes JOURNAL Biochim. Biophys. Acta 1218 (2), 221-224 (1994) MEDLINE 94289485REFERENCE 2 (bases 1 to 1541) AUTHORS Heller,L.C., Li,Y., Abrams,K.L. and Rogers,M.B. TITLE Transcriptional regulation of the Bmp2 gene. Retinoic acid induction in F9 embryonal carcinoma cells and Saccharomyces cerevisiae JOURNAL J. Biol. Chem. 274 (3), 1394-1400 (1999) MEDLINE 99098878 PUBMED 9880512COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to

final NCBI review. The reference sequence was derived from

L25602.1.

FEATURES Location/Qualifiers source 1..1541 /organism="Mus musculus" /db_xref="taxon:10090" /chromosome="2" /map="2 76.1 cM" /cell_type="spleen" gene 1..1541 /gene="Bmp2" /db_xref="LocusID:12156" /db_xref="MGD:88177" CDS 357..1541 /gene="Bmp2" /codon_start=1 /db_xref="LocusID:12156" /db_xref="MGD:88177" /product="bone morphogenetic protein 2" /protein_id="NP_031579.1" /db_xref="GI:6680794"

/translation="MVAGTRCLLVLLLPQVLLGGAAGLIPELGRKKFAAASSRPLSRP

SEDVLSEFELRLLSMFGLKQRPTPSKDVVVPPYMLDLYRRHSGQPGAPAPDHRLERAA

SRANTVRTFHQLEAVEELPEMSGKTARRFFFNLSSVPSDEFLTSAELQIFREQIQEAL

GNSSFQHRINIYEIIKPAAANLKFPVTRLLDTRLVNQNTSQWESFDVTPAVMRWTTQG

HTNHGFVVEVAHLEENPGVSKRHVRISRSLHQDEHSWSQIRPLLVTFGHDGKGHPLHK

REKRQAKHKQRKRLKSSCKRHPLYVDFSDVGWNDWIVAPPGYHAFYCHGECPFPLADH

LNSTNHAIVQTLVNSVNSKIPKACCVPTELSAISMLYLDENEKVVLKNYQDMVVEGCG CR"

BioPerl

23BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Page 24: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 24

A Sample GenBank Record - II

misc_feature 492..1154 /note="TGFb_propeptide; Region: TGF-beta

propeptide" misc_feature 1236..1538 /note="TGFB; Region: Transforming growth

factor-beta (TGF-beta) family" misc_feature 1236..1538 /note="TGF-beta; Region: Transforming

growth factor beta like domain"BASE COUNT 361 a 444 c 414 g 322 tORIGIN 1 cgcactcctc cccctgctcg aggctgtgtg tcagcacttg

gctggagact tcttgaactt 61 gccgggagag tgacttgggc tccccacttc gcgccggtgt

cctcgcccgg cggatccagt 121 cttgccgcct ccagcccgat cacctctctt cctcagcccg

ctggcccacc ccaagacaca 181 gttccctaca gggagaacac ccggagaagg aggaggaggc

gaagaaaagc aacagaagcc 241 cagttgctgc tccaggtccc tcggacagag ctttttccat

gtggagactc tctcaatgga 301 cgtgccccct agtgcttctt agacggactg cggtctccta

aagccgcagg tcgaccatgg 361 tggccgggac ccgctgtctt ctagtgttgc tgcttcccca

ggtcctcctg ggcggcgcgg 421 ccggcctcat tccagagctg ggccgcaaga agttcgccgc

ggcatccagc cgacccttgt 481 cccggccttc ggaagacgtc ctcagcgaat ttgagttgag

gctgctcagc atgtttggcc 541 tgaagcagag acccaccccc agcaaggacg tcgtggtgcc

cccctatatg ctagatctgt 601 accgcaggca ctcaggccag ccaggagcgc ccgccccaga

ccaccggctg gagagggcag

661 ccagccgcgc caacaccgtg cgcacgttcc atcaactaga agccgtggag gaacttccag

721 agatgagtgg gaaaacggcc cggcgcttct tcttcaattt aagttctgtc cccagtgacg

781 agtttctcac atctgcagaa ctccagatct tccgggaaca gatacaggaa gctttgggaa

841 acagtagttt ccagcaccga attaatattt atgaaattat aaagcctgca gcagccaact

901 tgaaatttcc tgtgaccaga ctattggaca ccaggttagt gaatcagaac acaagtcagt

961 gggagagctt cgacgtcacc ccagctgtga tgcggtggac cacacaggga cacaccaacc

1021 atgggtttgt ggtggaagtg gcccatttag aggagaaccc aggtgtctcc aagagacatg

1081 tgaggattag caggtctttg caccaagatg aacacagctg gtcacagata aggccattgc

1141 tagtgacttt tggacatgat ggaaaaggac atccgctcca caaacgagaa aagcgtcaag

1201 ccaaacacaa acagcggaag cgcctcaagt ccagctgcaa gagacaccct ttgtatgtgg

1261 acttcagtga tgtggggtgg aatgactgga tcgtggcacc tccgggctat catgcctttt

1321 actgccatgg ggagtgtcct tttccccttg ctgaccacct gaactccact aaccatgcca

1381 tagtgcagac tctggtgaac tctgtgaatt ccaaaatccc taaggcatgc tgtgtcccca

1441 cagagctcag cgcaatctcc atgttgtacc tagatgaaaa tgaaaaggtt gtgctaaaaa

1501 attatcagga catggttgtg gagggctgcg ggtgtcgtta g

//

BioPerl

Page 25: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

#!/usr/bin/perl#File: readgb.pl

use strict;use warnings;use Bio::Seq;use Bio::SeqIO;my $inputfile = $ARGV[0];

# create file input objectmy $in = Bio::SeqIO->new(-file=> $inputfile, -format=> 'GenBank'); while (my $seq = $in->next_seq()){ # print out parts of the sequence print "Sequence ", $seq->id(), " with accession ", $seq->accession_number, " and description:\n", $seq->desc, "\n\n";}

Reading GenBank files with Bio::SeqIO

Bio::SeqIO provides methods for reading and printing files in many file formats: FASTA, GenBank, SwissProt, GCG, EMBL, etc

BioPerl

25BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB,

and BLAST)

Page 26: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 26

[binf:~/binf634/bioperl] jsolka% ./readgb.pl genbank_sample.txtSequence NM_007553 with accession NM_007553 and description:Mus musculus bone morphogenetic protein 2 (Bmp2), mRNA.

The Program in Action

BioPerl

Page 27: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 27

Direct Access to Databases

You can create new Bio::Seq objects by importing them directly from a remote database using the Bio::DB family of modules GenBank: Bio::DB::GenBank GenPept Bio::DB::GenPept SwissProt Bio::DB::SwissProt GDB Bio::DB::GDB ACEDB Bio::DB:Ace

BioPerl

Page 28: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

Fetching a Sequence from GenBank#!/usr/bin/perluse strict;use warnings;use Bio::Seq;use Bio::DB::GenBank;# File: getgb.pl

# get GI number from command linemy ($id) = @ARGV;

# create a new GenBank handle objectmy $gb = new Bio::DB::GenBank();

# Fetch the sequence from GenBank, returning a seq_objectmy $seq_obj = $gb->get_Seq_by_id($id);

# print out parts of the sequenceprint "Id = ", $id, "\n", "Accession = ", $seq_obj->accession_number, "\n", "Description = ", $seq_obj->desc, "\n", "Sequence = ", $seq_obj->seq, "\n";

BioPerl

28BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Page 29: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 29

[binf:~/binf634/bioperl] jsolka% getgb.pl 1293614Id = 1293614Accession = AAA98665Description = TCP1-beta [Saccharomyces cerevisiae].Sequence = SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEAAEVLLRVDNIIRARPRTANRQHM

•Check out this site for more information•http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

The Program in Action

BioPerl

Page 30: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

Bio::Perl has a number of easy to use functions, including

get_sequence - gets sequence from standard internet databases read_sequence - reads a sequence from a file read_all_sequences - reads all sequences from a file new_sequence - makes a bioperl sequence from a string write_sequence - writes one or more sequences to a file translate - provides a translation of a sequence translate_as_string - provides translation of a sequence as string blast_sequence - BLASTs sequence against databases at NCBI write_blast - writes a blast report out to a file

Bio::Perl

BioPerl

30BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB,

and BLAST)

Page 31: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

The Bio::Perl module provides some simple access functions, for example, this script will retrieve a GenBank sequence and write it out in two different formats

#!/usr/bin/perl# File: getgenbank.pluse strict;use warnings;use Bio::Perl;

# this script will only work with an internet connection# on the computer it is run onmy $seq_object = get_sequence('genbank',"ROA1_HUMAN");

# write the sequence in FASTA formatwrite_sequence(">roa1.fasta",'fasta',$seq_object);

# write the sequence in GenBank formatwrite_sequence(">roa1.genbank",'genbank',$seq_object);

exit;

Bio::Perl

BioPerl

31BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB,

and BLAST)

Page 32: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

% cat roa1.fasta>ROA1_HUMAN Heterogeneous nuclear ribonucleoprotein A1 (Helix-destabilizing proteinSKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKRSRGFGFVTYATVEEVDAAMNARPHKVDGRVVEPKRAVSREDSQRPGAHLTVKKIFVGGIKEDTEEHHLRDYFEQYGKIEVIEIMTDRGSGKKRGFAFVTFDDHDSVDKIVIQKYHTVNGHNCEVRKALSKQEMASASSSQRGRSGSGNFGGGRGGGFGGNDNFGRGGNFSGRGGFGGSRGGGGYGGSGDGYNGFGNDGGYGGGGPGYSGGSRGYGSGGQGYGNQGSGYGGSGSYDSYNNGGGRGFGGGSGSNFGGGGSYNDFGNYNNQSSNFGPMKGGNFGGRSSGPYGGGGQYFAKPRNQGGYGGSSSSSSYGSGRRF

% cat roa1.genbankLOCUS ROA1_HUMAN 371 aa linear HUMAN 01-MAR-1989DEFINITION Heterogeneous nuclear ribonucleoprotein A1 (Helix-destabilizing protein) (Single-strand binding protein) (hnRNP core protein A1).ACCESSION P09651KEYWORDS Nuclear protein; RNA-binding; Repeat; Ribonucleoprotein; Methylation; Phosphorylation; Alternative splicing; 3D-structure; Polymorphism.SOURCE Human ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.ORIGIN 1 sksespkepe qlrklfiggl sfettdeslr shfeqwgtlt dcvvmrdpnt krsrgfgfvt 61 yatveevdaa mnarphkvdg rvvepkravs redsqrpgah ltvkkifvgg ikedteehhl 121 rdyfeqygki evieimtdrg sgkkrgfafv tfddhdsvdk iviqkyhtvn ghncevrkal 181 skqemasass sqrgrsgsgn fgggrgggfg gndnfgrggn fsgrggfggs rggggyggsg 241 dgyngfgndg gyggggpgys ggsrgygsgg qgygnqgsgy ggsgsydsyn ngggrgfggg 301 sgsnfggggs yndfgnynnq ssnfgpmkgg nfggrssgpy ggggqyfakp rnqggyggss 361 ssssygsgrr f// 32

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Page 33: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 33

LWP: fetch WWW documents To automate WWW access LWP::Simple - procedural interface to LWP Example of usage:use LWP::Simple;

$url = "http://www.whatever.com/data.html";

$page = get($url);

if ($page)

{ # do something }

else { print "Problems getting $url"; } See the following URL for an example

http://stein.cshl.org/genome_informatics/chervitz/perl-modules/ http://oreilly.com/openbook/webclient/ch05.html

BioPerl

Page 34: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 34

Protein Data Bank The Protein Data Bank (PDB) is

a repository of protein data structure

http://www.rcsb.org/pdb/home/home.do

The nature of proteins Primary sequences Secondary structures

Alpha helices Beta strands Turns

Tertiary structures Beta sheets Alpha-alpha units

Quaternary structures Enzyme Virus

PDB

Page 35: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 35

Overview of PDB

ASCII flat files that are human readable

One structure per file

Problems with consistency of the formats Code you create today may not work

tomorrow

BioPerlPDB

Page 36: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 36

Files and Folders

PDB is distributed as files within directories

There are in-depth discussions of how to handle these directory structures within Chapter 11 of Tisdall We will talk a little about recursion next

week

PDB

Page 37: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 37

An Example PDB File - IHEADER IMMUNOGLOBULIN 10-MAR-99 43C9 TITLE CRYSTALLOGRAPHIC STRUCTURE OF THE ESTEROLYTIC AND TITLE 2 AMIDOLYTIC 43C9 ANTIBODY COMPND MOL_ID: 1; COMPND 2 MOLECULE: IMMUNOGLOBULIN (LIGHT CHAIN); COMPND 3 CHAIN: A, C, E, G; COMPND 4 FRAGMENT: FV; COMPND 5 ENGINEERED: YES; COMPND 6 MOL_ID: 2; COMPND 7 MOLECULE: IMMUNOGLOBULIN (HEAVY CHAIN); COMPND 8 CHAIN: B, D, F, H; COMPND 9 FRAGMENT: FV; COMPND 10 ENGINEERED: YES SOURCE MOL_ID: 1; SOURCE 2 ORGANISM_SCIENTIFIC: MUS MUSCULUS; SOURCE 3 ORGANISM_COMMON: MOUSE; SOURCE 4 EXPRESSION_SYSTEM: ESCHERICHIA COLI; SOURCE 5 EXPRESSION_SYSTEM_CELL_LINE: BL21(DE3); SOURCE 6 MOL_ID: 2; SOURCE 7 ORGANISM_SCIENTIFIC: MUS MUSCULUS; SOURCE 8 ORGANISM_COMMON: MOUSE; SOURCE 9 EXPRESSION_SYSTEM: ESCHERICHIA COLI; SOURCE 10 EXPRESSION_SYSTEM_CELL_LINE: BL21(DE3) KEYWDS IMMUNOGLOBULIN EXPDTA X-RAY DIFFRACTION

AUTHOR M.M.THAYER,E.D.GETZOFF,V.A.ROBERTS

REVDAT 1 18-AUG-99 43C9 0

JRNL AUTH M.M.THAYER,E.H.OLENDER,A.S.ARVAI,C.K.KOIKE,

JRNL AUTH 2 I.L.CANESTRELLI,J.D.STEWART,S.J.BENKOVIC,

JRNL AUTH 3 E.D.GETZOFF,V.A.ROBERTS

JRNL TITL STRUCTURAL BASIS FOR AMIDE HYDROLYSIS CATALYZED BY

JRNL TITL 2 THE 43C9 ANTIBODY

JRNL REF J.MOL.BIOL. V. 291 329 1999

JRNL REFN ASTM JMOBAK UK ISSN 0022-2836 0070

REMARK 1

REMARK 1 REFERENCE 1

REMARK 1 AUTH V.A.ROBERTS,J.STEWART,S.J.BENKOVIC,E.D.GETZOFF

REMARK 1 TITL CATALYTIC ANTIBODY MODEL AND MUTAGENESIS IMPLICATE

REMARK 1 TITL 2 ARGININE IN TRANSITION- STATE STABILIZATION

REMARK 1 REF J.MOL.BIOL. V. 235 1098 1994

REMARK 1 REFN ASTM JMOBAK UK ISSN 0022-2836 0070

REMARK 1 REFERENCE 2 ...

PDB

Page 38: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

An Example PDB File - IIREMARK 800 REMARK 800 HIS 91 IN THE VARIABLE LIGHT CHAIN (A,C,E, OR G) - REMARK 800 NUCLEOPHILE THAT FORMS A COVALENT BOND TO THE REMARK 800 SUBSTRATE REMARK 800 REMARK 800 ARG 96 IN THE VARIABLE LIGHT CHAIN (A,C,E, OR G)- REMARK 800 SIDE CHAIN STABILIZES NEGATIVE CHARGE FORMED IN REMARK 800 THE TRANSITION STATES OF THE REACTION REMARK 800 REMARK 900 REMARK 900 RELATED ENTRIES REMARK 900 RELATED ID: 43CA RELATED DB: PDB DBREF 43C9 A 1 107 GB 207840 M68968 1 113 DBREF 43C9 B 0 112 GB 207840 M68968 127 244 DBREF 43C9 C 1 107 GB 207840 M68968 1 113 DBREF 43C9 D 0 112 GB 207840 M68968 127 244 DBREF 43C9 E 1 107 GB 207840 M68968 1 113 DBREF 43C9 F 0 112 GB 207840 M68968 127 244 DBREF 43C9 G 1 107 GB 207840 M68968 1 113 DBREF 43C9 H 0 112 GB 207840 M68968 127 244SEQRES 1 A 113 ASP VAL VAL MET THR GLN THR PRO SER SER LEU ALA MET SEQRES 2 A 113 SER VAL GLY GLN LYS VAL THR MET SER CYS LYS SER SER SEQRES 3 A 113 GLN SER LEU LEU ASN ILE SER ASN GLN LYS ASN TYR LEU SEQRES 4 A 113 ALA TRP TYR GLN GLN LYS PRO GLY GLN SER PRO LYS LEU SEQRES 5 A 113 LEU VAL TYR PHE ALA SER THR ARG GLU SER GLY VAL PRO SEQRES 6 A 113 ASP ARG PHE ILE GLY SER GLY SER GLY THR ASP PHE THR SEQRES 7 A 113 LEU THR ILE SER SER VAL GLN ALA GLU ASP GLN ALA ASP SEQRES 8 A 113 TYR PHE CYS GLN GLN HIS TYR ARG ALA PRO ARG THR PHE

SEQRES 9 A 113 GLY GLY GLY THR LYS LEU GLU ILE LYS SEQRES 1 B 118 GLY GLN VAL GLN LEU VAL GLU SER GLY PRO GLY LEU VAL SEQRES 2 B 118 ALA PRO SER GLN SER LEU SER ILE THR CYS THR VAL SER SEQRES 3 B 118 GLY ILE SER LEU SER ARG TYR ASN VAL HIS TRP VAL ARG SEQRES 4 B 118 GLN SER PRO GLY LYS GLY LEU GLU TRP LEU GLY MET ILE SEQRES 5 B 118 TRP GLY GLY GLY SER ILE GLU TYR ASN PRO ALA LEU LYS SEQRES 6 B 118 SER ARG LEU SER ILE SER LYS ASP ASN SER LYS SER GLN SEQRES 7 B 118 ILE PHE LEU LYS MET ASN SER LEU GLN THR ASP ASP SER SEQRES 8 B 118 ALA MET TYR TYR CYS VAL SER TYR GLY TYR GLY GLY ASP SEQRES 9 B 118 ARG PHE SER TYR TRP GLY GLN GLY THR LEU VAL THR VAL SEQRES 10 B 118 SER ...FORMUL 9 HOH *162(H2 O1) SHEET 1 A 4 MET A 4 THR A 7 0 SHEET 2 A 4 VAL A 19 LYS A 24 -1 N LYS A 24 O THR A 5 SHEET 3 A 4 ASP A 70 ILE A 75 -1 N ILE A 75 O VAL A 19 SHEET 4 A 4 PHE A 62 SER A 67 -1 N SER A 67 O ASP A 70 SHEET 1 B 5 SER A 10 SER A 14 0 SHEET 2 B 5 THR A 102 LYS A 107 1 N LYS A 103 O LEU A 11 SHEET 3 B 5 ALA A 84 GLN A 90 -1 N TYR A 86 O THR A 102 SHEET 4 B 5 LEU A 33 GLN A 38 -1 N GLN A 38 O ASP A 85 SHEET 5 B 5 PRO A 44 VAL A 48 -1 N VAL A 48 O TRP A 35 SHEET 1 C 4 GLN B 3 SER B 7 0 SHEET 2 C 4 LEU B 18 SER B 25 -1 N SER B 25 O GLN B 3 SHEET 3 C 4 GLN B 77 MET B 82 -1 N MET B 82 O LEU B 18 ...

PDB

38BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB,

and BLAST)

Page 39: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 39

An Example PDB File-IIIATOM 1 N ASP A 1 7.422 50.761 33.806 1.00 64.60 N ATOM 2 CA ASP A 1 8.287 50.204 34.870 1.00 77.57 C ATOM 3 C ASP A 1 8.848 48.834 34.503 1.00 77.32 C ATOM 4 O ASP A 1 9.752 48.357 35.183 1.00 82.14 O ATOM 5 CB ASP A 1 7.507 50.107 36.205 1.00 86.78 C ATOM 6 CG ASP A 1 8.392 49.691 37.423 1.00 92.57 C ATOM 7 OD1 ASP A 1 9.630 49.891 37.411 1.00 93.28 O ATOM 8 OD2 ASP A 1 7.826 49.183 38.422 1.00 96.45 O ATOM 9 N VAL A 2 8.333 48.184 33.458 1.00 70.66 N ATOM 10 CA VAL A 2 8.864 46.859 33.118 1.00 63.77 C ATOM 11 C VAL A 2 10.397 47.012 32.964 1.00 58.68 C ATOM 12 O VAL A 2 10.879 47.598 32.010 1.00 62.61 O ATOM 13 CB VAL A 2 8.149 46.194 31.855 1.00 62.22 C ...CONECT 6928 6353

MASTER 381 0 0 13 80 0 4 24 7254 8 16

76 END

PDB

Page 40: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 40

PDB File Format - I

In-depth discussions are available here http://www.wwpdb.org/docs.html

PDB files are composed of 80 columns that begin with one of several record names and end with a newline

Blank columns are padded with spaces A record type is one or more lines with the

same record name Each record type has a different number of

fields with them

PDB

Page 41: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 41

PDB File Formats -II

Record Types DBREF

Reference to the entry in the sequence database SEQADV

Identification of conflicts between pdb and the named sequence database

SEQRES Primary sequence of backbone residues

MODRES Identification of modification to standard residues

PDB

Page 42: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 42

Our Example PDB File – The DBREF Section

DBREF 43C9 A 1 107 GB 207840 M68968 1 113

DBREF 43C9 B 0 112 GB 207840 M68968 127 244

DBREF 43C9 C 1 107 GB 207840 M68968 1 113

DBREF 43C9 D 0 112 GB 207840 M68968 127 244

DBREF 43C9 E 1 107 GB 207840 M68968 1 113

DBREF 43C9 F 0 112 GB 207840 M68968 127 244

DBREF 43C9 G 1 107 GB 207840 M68968 1 113

DBREF 43C9 H 0 112 GB 207840 M68968 127 244

This tells us there is a PDB file called pdb43cp.ent and that the residues in chain A are labeled from 1 to 107 in the original GenBank database and the ID number is 207840 and the name m68968 are assigned to that database and the residues are numbered 1 to 113 in PDB

PDB

Page 43: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

Our Example PDB File – The SEQRES Section

SEQRES 1 A 113 ASP VAL VAL MET THR GLN THR PRO SER SER LEU ALA MET SEQRES 2 A 113 SER VAL GLY GLN LYS VAL THR MET SER CYS LYS SER SER SEQRES 3 A 113 GLN SER LEU LEU ASN ILE SER ASN GLN LYS ASN TYR LEU SEQRES 4 A 113 ALA TRP TYR GLN GLN LYS PRO GLY GLN SER PRO LYS LEU SEQRES 5 A 113 LEU VAL TYR PHE ALA SER THR ARG GLU SER GLY VAL PRO SEQRES 6 A 113 ASP ARG PHE ILE GLY SER GLY SER GLY THR ASP PHE THR SEQRES 7 A 113 LEU THR ILE SER SER VAL GLN ALA GLU ASP GLN ALA ASP SEQRES 8 A 113 TYR PHE CYS GLN GLN HIS TYR ARG ALA PRO ARG THR PHE SEQRES 9 A 113 GLY GLY GLY THR LYS LEU GLU ILE LYS

Col Data Type Field Def

1-6 Record name

“SEQRES”

9-10 Integer serNum Serial # of the SEQRES rec for the current chain. Starts at 1 and increments by 1. Resets for a new chain.

12 Character Chain ID Chain identifier may be a single character including a blank.

14-17 Integer numRes Number of residues in the chain. Repeated in every record.

20-22 Residue name

resName Residue name

24-26 Residue name

resName Residue name

PDB

43

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Page 44: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 44

Using BioPerl to Parse a pdb File

#!/usr/local/bin/perl -w

# Parsing PDB reports with BioPerl's Bio::Structure:IO module

use strict;use Bio::Structure::IO;

my $inFile;

# Prompt the user for the file name if it's not an argument

# NOTE: PDB file must be in text (not html) formatif (! $ARGV[0]){ print "What is the PDB file to parse? ";

# Get input and remove the newline character at the end

chomp ($inFile = <STDIN>);}else{ $inFile = $ARGV[0];}

my $structio = Bio::Structure::IO->new(-file => $inFile);

my $struc = $structio->next_structure; for my $chain ($struc->get_chains) { my $chainid = $chain->id; # one-letter chaincode if present,

'default' otherwise for my $res ($struc-

>get_residues($chain)) { my $resid = $res->id; # format is 3-lettercode - dash -

residue number, e.g. PHE-20 my $atoms = $struc->get_atoms($res); # actually a list of atom objects,

used here to get a count print join "\t", $chainid,$resid,

$atoms,"\n"; }}

BioPerl and PDB

Page 45: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 45

The Code in Action ./pdb_parse_0.pl pdb43c9.ent >

pdb_parse_0.outA ASP-1 8A VAL-2 7A VAL-3 7A MET-4 8A THR-5 7A GLN-6 9A THR-7 7A PRO-8 7A SER-9 6A SER-10 6A LEU-11 8A ALA-12 5...

BioPerl and PDB

Page 46: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 46

The Code in Action (cont.)

H GLY-106 4H THR-107 7H LEU-108 8H VAL-109 7H THR-110 7H VAL-111 7H SER-112 7default HOH-1 1default HOH-2 1default HOH-3 1default HOH-4 1default HOH-5 1...default HOH-440 1default HOH-449 1default HOH-454 1default HOH-456 1default HOH-463 1default HOH-469 1default HOH-472 1

BioPerl and PDB

Page 47: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 47

BLAST (http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHome)

The Basic Local Alignment Search Tool (BLAST)

Tests a query against a library of known sequences to find similarity

Collection of programs query to database pairs

nucleotide-nucleotide protein-nucleotide protein-protein nucleotid protein

BLAST

Page 48: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 48

BLAST Advantages of a Local Installation Versus Public Server

Using public server Databases used by BLAST such as GenBank

are always up to date

Using local server Proprietary sequences Sequences that are run time and time again

BLAST

Page 49: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 49

String Matching and Homology BLAST is basically a string matching program Biological string matching looks for similarity as an

indication of homology Similarity between the query and the sequences in the

database may be measured by the percent identity, or the number of bases in the query that exactly match a corresponding region of a sequence from the database.

It may be measured by the degree of conservation , which finds matches between equivalent (redundant) codons or between amino acid residues with similar properties that don’t alter the function of a protein

Two sequences are or are not homologous, there’s no degree of homology

Beginning PERL FOR Bioinformatics, Tisdall

BLAST

Page 50: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 50

Homology It is difficult to find a single consistent definition of homology

“In genetics, the term "homolog" is used both to refer to a homologous protein, and to the gene (DNA sequence) encoding it.”

-- Wikipedia

In-depth philosophical discussions can be found in

G. P. Wagner, “The Biological Homology Concept,” Annu. Rev. Ecol. Syst. 1989. 20:51~9.

BLAST

Page 51: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 51

String Matching References Algorithms on String, Trees, and Sequences:

Computer Science and Computational Biology, Dan Gusfield: Cambridge University Press, 1997.

Defacto Standard

R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.

These two complement each other nicely

BLAST

Page 52: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 52

Outputs of a BLAST Search - I

Standard outputs A set of scores and statistics on the matches based

on the raw score S, various parameters of the scoring algorithm, and properties of the query and database

The raw score S A measure of similarity and the size of the match

BLAST lists hits by their E value The E value (expect) value of a match estimates the

chance that the string match (allowing for gaps) occurs in a randomly generate database of the same size and composition.

BLAST

Page 53: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 53

Outputs of a BLAST Search - II

The closer the E value is to 0 the less likely that the match occurred by chance The better the match

For BLASTN E value less than 1 may be a solid hit E value less than 10 may be worth looking

at

BLAST

Page 54: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 54

Screen Capture of Example BLAST File Using First 8 Lines (400 Bases) of the File sample.dna From Chapter 8 of Tisdall as a Query Against the nt BLAST Database

BLAST

Some additional tutorial information can be found athttp://www.ncbi.nlm.nih.gov/Class/minicourses/

Page 55: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 55

A Closer Look at This File

BLAST

Page 56: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 56

Output from Clicking on the First Line

BLAST

Page 57: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 57

A Little Further Down in the File

BLAST

Page 58: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 58

Another Example BLAST File (Text Version) - I

BLASTN 2.2.2 [Jan-08-2002]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,

Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),

"Gapped BLAST and PSI-BLAST: a new generation of protein database search

programs", Nucleic Acids Res. 25:3389-3402.

Query= gi|16972687|gb|BM083928.1|BM083928imageqc_2_2001/smj416bdff41.x1 Soares_NFL_T_GBC_S1 Homo

sapiens cDNAclone IMAGE:4597167 3', mRNA sequence (505 letters)

Database: Non-redundant nucleotide - Mon Feb 18 09:50:08 EST 2002

1,171,752 sequences; 566,364,212 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

ref|NM_017857.1| Homo sapiens slingshot 3 (SSH-3), mRNA 957 0.0

ref|XM_040057.1| Homo sapiens slingshot 3 (SSH-3), mRNA 957 0.0

dbj|AK000522.1|AK000522 Homo sapiens cDNA FLJ20515 fis, clone KA... 957 0.0

gb|BC004210.1|BC004210 Homo sapiens, Similar to hypothetical pro... 954 0.0

gb|BC004176.1|BC004176 Homo sapiens, Similar to hypothetical pro... 954 0.0

gb|AF085851.1|HUMYI53C10 Homo sapiens full length insert cDNA cl... 944 0.0

ref|NM_018276.1| Homo sapiens slingshot 3 (SSH-3), mRNA 615 e-174

dbj|AK001790.1|AK001790 Homo sapiens cDNA FLJ10928 fis, clone OV... 615 e-174

emb|AL353811.12|AL353811 Human DNA sequence from clone RP11-482I... 46 0.004

emb|AL035531.17|HS483L3 Human DNA sequence from clone RP3-483L3 ... 44 0.014

>ref|NM_017857.1| Homo sapiens slingshot 3 (SSH-3), mRNA Length = 2064

Score = 957 bits (483), Expect = 0.0 Identities = 492/494 (99%), Gaps = 1/494 (0%) Strand = Plus / Minus

Query: 4 aataaccaaatatgaaaatgtgttttatttctcagtacaaagccagatactgtaaggcta 63

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Sbjct: 2046 aataaccaaatatgaaaatgtgttttatttctcagtacaaagccagatactgtaaggcta 1987

BLAST

Page 59: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 59

Another Example BLAST File (Text Version) - II

…Matrix: blastn matrix:1 -3Gap Penalties: Existence: 5, Extension: 2Number of Hits to DB: 1,627,832Number of Sequences: 1171752Number of extensions: 1627832Number of successful extensions: 453460Number of sequences better than 5.0e-02: 10length of query: 505length of database: 566,364,212effective HSP length: 19effective length of query: 486effective length of database: 544,100,924effective search space: 264433049064effective search space used: 264433049064T: 0A: 40X1: 6 (11.9 bits)X2: 15 (29.7 bits)S1: 12 (24.3 bits)S2: 22 (44.1 bits

BLAST

Page 60: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

Parsing BLAST reports with SearchIO#!/usr/local/bin/perl -w

# Parsing BLAST reports with BioPerl's Bio::SearchIO module# WI Biocomputing course - Bioinformatics for Biologists -

October 2003

# See help at http://www.bioperl.org/wiki/HOWTO:SearchIO for all data that can be extracted

use Bio::SearchIO;

# Prompt the user for the file name if it's not an argument# NOTE: BLAST file must be in text (not html) formatif (! $ARGV[0]){ print "What is the BLAST file to parse? ";

# Get input and remove the newline character at the end chomp ($inFile = <STDIN>);}else{ $inFile = $ARGV[0];}

$report = new Bio::SearchIO( -file=>"$inFile", -format => "blast");

print "QueryAcc\tHitDesc\tHitSignif\tHSP_rank\t\%ID\teValue\tHSP_length\n";

# Go through BLAST reports one by one

while($result = $report->next_result) { # Go through each each matching sequence while($hit = $result->next_hit) { # Go through each each HSP for this

sequence while ($hsp = $hit->next_hsp) { # Print some tab-delimited data

about this HSP print $result->query_accession, "\

t"; print $hit->description, "\t"; print $hit->significance, "\t"; print $hsp->rank, "\t"; print $hsp->percent_identity, "\t"; print $hsp->evalue, "\t"; print $hsp->hsp_length, "\n"; } } }

BioPerl and BLAST

60

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

Page 61: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 61

Output from blast_parse_0.plQueryAcc HitDesc HitSignif HSP_rank %ID eValue HSP_lengthBM083928.1 Homo sapiens slingshot 3 (SSH-3), mRNA 0.0 1 99.5951417004049

0.0 494BM083928.1 Homo sapiens slingshot 3 (SSH-3), mRNA 0.0 1 99.5951417004049

0.0 494BM083928.1 Homo sapiens cDNA FLJ20515 fis, clone KAT09889 0.0 1 99.5951417004049

0.0 494BM083928.1 Homo sapiens, Similar to hypothetical protein FLJ10928, clone MGC:4436 IMAGE:2958967, mRNA,

complete cds0.0 1 99.5934959349593 0.0 492BM083928.1 Homo sapiens, Similar to hypothetical protein FLJ10928, clone MGC:2772 IMAGE:2958967, mRNA,

complete cds0.0 1 99.5934959349593 0.0 492BM083928.1 Homo sapiens full length insert cDNA clone YI53C10 0.0 1

99.5893223819302 0.0 487BM083928.1 Homo sapiens slingshot 3 (SSH-3), mRNA 1e-174 1 99.685534591195

1e-174 318BM083928.1 Homo sapiens slingshot 3 (SSH-3), mRNA 1e-174 2 100 6e-48

98BM083928.1 Homo sapiens slingshot 3 (SSH-3), mRNA 1e-174 3 98.1132075471698

1e-18 53BM083928.1 Homo sapiens cDNA FLJ10928 fis, clone OVARC1000473, weakly similar to DUAL SPECIFICITY PROTEIN

PHOSPHATASE 3 (EC 3.1.3.48) (EC 3.1.3.16) 1e-174 1 99.685534591195 1e-174318

BM083928.1 Homo sapiens cDNA FLJ10928 fis, clone OVARC1000473, weakly similar to DUAL SPECIFICITY PROTEIN PHOSPHATASE 3 (EC 3.1.3.48) (EC 3.1.3.16) 1e-174 2 100 6e-48 98

BM083928.1 Homo sapiens cDNA FLJ10928 fis, clone OVARC1000473, weakly similar to DUAL SPECIFICITY PROTEIN PHOSPHATASE 3 (EC 3.1.3.48) (EC 3.1.3.16) 1e-174 3 98.1132075471698 1e-1853

BM083928.1 Human DNA sequence from clone RP11-482I10 on chromosome 9 Contains STSs and GSSs, complete sequence [Homo sapiens] 0.004 1 100 0.004 23

BM083928.1 Human DNA sequence from clone RP3-483L3 on chromosome 6p25.1-25.3 Contains STSs and GSSs, complete sequence [Homo sapiens] 0.014 1 100 0.014 22

BioPerl and BLAST

Page 62: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 62

Summary BioPerl offer several modules that make processing sequence

data convenient

Can fetch sequence data and GenBank record directly from GenBank

More info Online documentation:% perldoc Bio::Perl% perldoc Bio::DB::GenBank% perldoc Bio::SeqIO http://www.bioperl.org/wiki http://www.bioperl.org/wiki/HOWTOs

BioPerl

Page 63: BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)1 Topics Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl.

BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST) 63

Homework

Read Chapter 12 Exercise 12.4 Exercise 12.6