Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction...

31
Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition to the NCBI on the east coast One precursor was Margaret Dayhoff’s Atlas of Protein Sequence and Structure In 1987 genbank fit onto a few 360 KB floppy disks. Genbank uses a flat file database format (see http://en.wikipedia.org/wiki/Flat_file_database ) NCBI does not use a relational databank (as in Oracle, peoplesoft) NCBI stores data in ASN.1 format ( http://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One ), which allows to hardwire crosslinks to other data bases, and makes retrieval of related information fast. NCBI’s sample record (http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html) contains links to most the fields used in the gbk flatfile. In the genbank records at NCBI the links connect to the features (i.e. the pubmed record, or the encoded protein Genbank

Transcript of Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction...

Page 1: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

• Founded in 1982 at the Los Alamos National Laboratory• Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups • 1989-92 transition to the NCBI on the east coast • One precursor was Margaret Dayhoff’s Atlas of Protein Sequence and Structure• In 1987 genbank fit onto a few 360 KB floppy disks. • Genbank uses a flat file database format (see http://en.wikipedia.org/wiki/Flat_file_database)• NCBI does not use a relational databank (as in Oracle, peoplesoft)• NCBI stores data in ASN.1 format (http://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One),

which allows to hardwire crosslinks to other data bases, and makes retrieval of related information fast.

• NCBI’s sample record (http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html) contains links to most the fields used in the gbk flatfile.

• In the genbank records at NCBI the links connect to the features (i.e. the pubmed record, or the encoded protein sequence) --- not easy to work with.

Genbank

Page 2: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Dr. Margaret Belle (Oakley) Dayhoff March 11, 1925 – February 5, 1983

Among other things, we owe her the first nucleotide and protein data bank, the PAM substitution matrix, and the single letter amino acid code. (Image from wikipedia)

Page 3: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Atlas of Protein Sequences 1972 (cont)

The Atlas also contained RNA sequences, and PAM matrix for nucleotides

Page 4: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Atlas of Protein Sequences 1972 (cont) Contained phylogenetic reconstructions that went back in time to far before the Last Unversal Common Ancestor (LUCA) aka the cenancestor of all living cellular organisms alive today.

tRNA phylogeny

Page 5: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

PAM 250 log (odds) matrix

Page 6: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Uses of Blast in bioinformatics

The Blast web tool at NCBI is limited:• custom and multiple databases are not available• “time-out” before long searches are completed

Also: The command-line allows the user to run commands repeatedly

What if researcher wants to use tBlastN to find all olfactory receptors in the mosquito? Or, if you want to check the presence of a (pseudo)gene in a preliminary genome assembly?Answer: Use Blast from command-line

by B

ob F

riedm

an

Page 7: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Types of Blast searching

• blastp compares an amino acid query sequence against a protein sequence database

• blastn compares a nucleotide query sequence against a nucleotide sequence database

• blastx compares the six-frame conceptual protein translation products of a nucleotide query sequence against a protein sequence database

• tblastn compares a protein query sequence against a nucleotide sequence database translated in six reading frames

• tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

by B

ob F

riedm

an

Page 8: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Routine BlastP search

FASTA formatted textor Genbank ID#

Protein database

Run

by B

ob F

riedm

an

Page 9: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Kerfeld and Scott, PLoS Biology 201199

E value Threshold

• Alignments will be reported with E-values less than or equal to the expect values threshold– Setting a larger E

threshold will result in more reported hits

– Setting a smaller E threshold will result in fewer reported hits

Page 10: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Restrict by taxonomic group

Filter repetitive regions

Statistical cut-off

Similarity matrix (cost of gaps)

BlastP parameters

by B

ob F

riedm

an

Size of words in look-up table

Page 11: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Kerfeld and Scott, PLoS Biology 20111111

BLAST as an Experiment: Parameters to manipulate in a BLAST search

• Expect• Word size• Matrix• Gap costs• Filter• Mask

Page 12: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Kerfeld and Scott, PLoS Biology 20111212

BLAST Phase 1: Segmenting the query sequence; scoring potential word matches--compile

• BLAST:– Segments the query

sequence into pieces (“words”)

• Default word length: 3 amino acids or 11 nucleic acids

– Creates a list of synonyms and their scores for comparing query words to target words

• Uses scoring matrix to calculate scores for synonyms that might be found in the database

– Saves the scores (and synonyms) exceeding a given threshold T

Page 13: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Kerfeld and Scott, PLoS Biology 20111313

BLAST Phase 2: Scanning the database

• BLAST – Scans the database for matches to the word list with

acceptable T values

– Requires two matches (“hits”) within the target sequence

– Sets aside sequences with matches above T for further analysis

…………..SWITEASFSPPGIM…..SWI PGI

Possible match from the database

Words

Page 14: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Kerfeld and Scott, PLoS Biology 20111414

BLAST Phase 3: Extending the hits

• BLAST– Searches 5’ and 3’ of the word hit on both the query and target

sequence – Adds up the score for sequence identity or similarity until value exceeds

S– Alignment is dropped from subsequent analyses if value never exceeds

S

Page 15: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Kerfeld and Scott, PLoS Biology 20111515

Selecting Scoring Matrices

• Choose a matrix appropriate to the suspected degree of sequence identity between the query and its target sequences

• PAM: empirically derived for close relatives

• BLOSUM: empirically derived for distant relatives

More divergent

BLOSUM45 PAM240

BLOSUM62 PAM180

BLOSUM80 PAM120

BLOSUM90 PAM30

Less divergent

Page 16: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Establishing a significant “hit”

E-value is the Expected number of sequence (HSPs) matches in database of n number of sequences• database size is arbitrary• multiple testing problem• E-value calculated from many assumptions• E-value depends on size of data bank.

Examples:E-value = 1 = expect the match to occur in the database by chance 1x

E-value = .05 = expect 5% chance of match occurring

E-value = 1x10-20 = strict match between protein domains

Blast’s E-value indicates statistical significance of a sequence matchKarlin S, Altschul SF (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. PNAS 87:2264-8

by B

ob F

riedm

an

Page 17: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Blast databases• EST - Expression Sequence Tags; cDNA• wgs – whole genome shotgun reads• Reference genome sequences• NR - non-redundant DNA or amino acid sequence database• NT - NR database excluding EST, STS, GSS, HTGS• PDB - DNA or amino acid sequences accompanied by 3d

structures• STS - Sequence Tagged Sites; short genomic markers for

mapping• Swissprot - well-annotated amino-acid sequences

• Also, to obtain organism-specific sequence set:

ftp://ftp.ncbi.nih.gov/genomes/

by B

ob F

riedm

an

Page 18: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

More databases

by B

ob F

riedm

an

Page 19: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

And more databasesby

Bob

Frie

dman

Page 20: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Example of web based BLAST program: BLASTP sequence: vma1 gi:137464

BLink provides similar information

Page 21: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Effect of low complexity filter

BUT the most common sequences are simple repeats

Page 22: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Custom databases

Custom databases can include private sequence data, non-redundant gene sets based on genomic locations, merging of genetic data from specific organisms

It’s also faster to search only the sequence data that is necessary

Can search against sequences with custom names

by B

ob F

riedm

an

Page 23: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Formatting a custom database

Format sequence data into Fasta format

Example of Fasta format:>sequence 1AAATGCTTAAAAA>sequence 2AAATTGCTAAAAGA

Convert Fasta to Blast format by using FormatDB program from command-line:formatdb -p F -o T -i name_of_fasta_file

(formatdb.log is a file where the results are logged from the formatting operation)

by B

ob F

riedm

an

Page 24: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

BlastP search of custom database

by B

ob F

riedm

an

Page 25: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

The favored operating system flavor in computational biology is UNIX/LINUX. The command line is similar to DOS.Some of the frequently used commands are here

Command Line

pwdls ls –lchmodchmod a+x blastall.shchmod 755 *.shcd cd ..cd $HOMEpasswd

ps ps auxrmmorecat vi (text editor)psps auxsshsftp

For windows an “ok” ssh program is putty. UConn also has a site license for the ssh program from ssh.com

Page 26: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

UNIXBasic UNIX commands ls, cd, chmod, cp, rm, mkdir, more (or) less, vi, ps, kill -9, manA brief listing is here

chmod is a particular pain in the ... . Under unix every file has an owner and the owner, his group and everyone else have permissions to read, write and/or execute the file (or they don’t). If you want to see which permissions are currently assigned to your files, type ls -l at the command prompt. chmod a+x *.pl gives everyone execute permission for all files that end with .pl the * is a wildcard. (warning don't ever use rm in conjunction with *) For more on chmod type  ”man chmod“.(In the OSX GUI you can control click at a file, and change permissions in the info box). Most ssh clients (FUGU and SSH) allow you to use a GUI to change file permissions (in FUGU ctrl click).

Page 27: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Unix - command line interfaceIf you tried to execute a command, and you made a mistake, for example, you mistyped a file name, you can recall the last command using the up arrow (down arrow for more recent).

If you are tired typing long filenames, you can use the tab key to complete the line, provided there is only one way to complete the line. E.g: cd /Desktop could be replaced by cd /D<tab>If there are two or more choices you hear a boing, if you hit <tab> again, you get a list of choices.

Page 28: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

characters at the end of linesFile tranfers from Windows to UNIX and return: End of Line characters are a problem. Under Windows DO NOT use notepad, it does not understand UNIX newline symbols ‘\n’. Best write your programs under UNIX using vi or vim (or any other editor you are comfortable with)2nd best is to use a text editor like textwrangler (very nice and free program for UNIX). Like vi and vim it provides context dependent coloring.3rd best is to remove end of line symbols in a UNIX editor or use sed (Stream EDitor) after you transferred the file: sed s/.$// name_of_WINDOWS_infile > name_of_UNIX_outfile (This replaces the last non letter character before the eol ($) with nothing)

Some versions of office allow to change files as UNIX textfiles, but ...

A related problem is encountered by Mac users. Most text editors will use MAC carriage returns at the end of the line. Most unix programs will not be able to handle these. In a terminal window you could use the following command to convert your file:tr ’\r' ’\n' < name_of_the_Mac_file > name_of_the_unix_file If you are working in a GUI environment, you also could use the convertNewLines.app program (install it in your application folder, drag the file you want to convert into the icon). The program is available here.

The EoL confusion is very inconvenient, but there really is no easy solution, tough luck; and you better know about this in case something goes wrong.

Page 29: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Special characters:

\n #newline\t #tab

Page 30: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

To move files between local PC and server:

Page 31: Founded in 1982 at the Los Alamos National Laboratory Initially managed at Stanford in conjunction with the BIOSCI/Bionet news groups 1989-92 transition.

Example: SSH to bbcsrv3.biotech.uconn.edu

A) Move files to bbcxsrv p_abyssi.faa and t_maritima.faa (using ssh or enter afp://bbcxsrv.biotech.uconn.edu in finder -> Go -> connect to server)

(check options for blastall and formatdb)

qloginformatdb -i p_abyssi.faa -o T -p Tblastall -i t_maritima.faa -d p_abyssi.faa -o

blast.out -p blastp -e 10 -m 8 -a2

./extract_lines.pl blast.outPerl script that only retains the first hit and gets rid of comment linessftp results

load into spreadsheetsort data, do histogram …

the extract_lines.pl script is here (you can sftp it into your account, you’ll need to chmod 755 extr*.pl afterwards)