ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl...

27
Lecture 1: Introduction to Bioinformatics January 30, 2017 ICQB Introduction to Computational & Quantitative Biology (G4120) Spring 2017 Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology & Immunology

Transcript of ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl...

Page 1: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

ICQBIntroduction to Computational & Quantitative Biology (G4120) Spring 2017 Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology & Immunology

Page 2: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

Growth of GenBank

Page 3: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

The Genomics Era

Archaea 176Bacteria 3,021Eukarya 59

Source: GOLD Release v.5 May 28, 2014, genomesonline.org and Su, Andrew (2013) Cumulative sequenced genomes, dx.doi.org/10.6084/m9.figshare.723384

Page 4: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

1977 Maxam-Gilbert and Sanger sequencing1980 øX174 (5,386 bp)1981 Human mitochondria (16,569 bp)1981 Poliovirus (7,440 bp) 1990 Human Genome Project1992 The Institute for Genomic Research1994 RK2 (60,099 bp)1995 Haemophilus influenzae (1.8 Mb) 1995 Mycoplasma genitalium (0.58 Mb)1996 Methanococcus jannaschii (1.6 Mb)1996 Saccharomyces cerevisiae (12.1 Mb)1997 Escherichia coli (4.7 Mb)1998 Celera, Inc. 1998 Caenorhabditis elegans (97 Mb)2000 Drosophila melanogaster (180 Mb)2000 Arabidopsis thaliana (115 Mb)2001 Salmonella typhimurium (4.8 Mb)2001 Homo sapiens (2.9 Gb)2002 Mus musculus (2.6 Gb)2003 Nanoarchaeum equitans (0.49 Mb)2004 Legionella pneumophila (3.4 Mb)2005 Pan troglodytes (2.8 Gb)2006 454 Pyrosequencer2007 Illumina Genome Analyzer and SOLiD2007 Illumina HiSeq2010 Ion Torrent2011 Illumina MiSeq and PacBio RS2013 PacBio RS II2014 Oxford Nanopore MinION

History of GenomicsOverview of Published Genomes

Source: David W. Ussery (2004) Genome Update: 161 prokaryotic genomes sequenced, and counting, Microbiology150: 261-263.

Growth of Sequenced Prokaryotic Genomes

Page 5: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

Exponential Growth of Biological Data and Computing Power

GenBank“From 1982 to the present, the number of bases in GenBank has doubled approximately every 18 months.”Source: www.ncbi.nlm.nih.gov/genbank/statistics

Moore’s LawOver the history of computing hardware, the number of transistors in a dense integrated circuit doubles approximately every 18 to 24 months.Source: Moore, Gordon E. (1965) Cramming more components onto integrated circuits. Electronics: 114-117 (with subsequent adjustments).

Page 6: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

Moore’s Law

Page 7: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

Dealing with exponentially increasing biological data…

…requires assistance.

Page 8: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

c. 40,000 B.C.

Lebombo Bone Tally StickA baboon fibula with 29 notches discovered in the Lebombo Mountains of Africa.

Page 9: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

History of Computing40,000 BC Tally systems Africa & Europe

20,000 BC Prime system Africa

2400 BC Abacus Sumer & Babylon

100 BC Antikythera mechanism Greece

1500 Mechanical calculator Leonardo da Vinci

1621 Slide rule William Oughtred

1642 Arithmetic Machine Blaise Pascal

1822 Difference Engine Charles Babbage

1831 Computer program Lady Ada Lovelace

1936 Z1 Computer Konrad Zuse

1936 Turing Machine Alan Turing

1938 Boolean Circuits Claude Shannon

1943 COLOSSUS Alan Turing

1945 von Neumann Machine John von Neumann

1947 Transistor William Shockley, John Bardeen & Walter Brattain

1958 Integrated Circuit Jack Kilby & Robert Noyce

Page 10: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

Computational Biology • Data

Sequencers, FACS, scanners, microscopes, etc.

• Analysis Software, scripting, programming, etc.

• StorageDatabases, local, network or cloud storage and backup, etc.

• SharingWeb, Internet, email, portable or cloud storage, etc.

Page 11: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

History of Bioinformatics1869 DNA Johann Friedrich Miescher

1924 Chromosomal DNA Robert Feulgen

1928 Transforming principle Franklin Griffith

1944 DNA transformation Oswald Avery, Maclyn McCarty & Colin MacLeod

1948 Information Theory Claude Shannon

1949 Chargaff ’s Rule Erwin Chargaff

1953 Double helix James Watson & Francis Crick

1955 Protein sequencing Fred Sanger

1961 Codons Sidney Brenner & Francis Crick

1966 Genetic code Marshall Nirenberg, Robert Holley & Har Khorana

1970 Restriction enzyme Hamilton Smith, Johns Hopkins

1970 Needleman-Wunsch S. Needleman & C. Wunsch

1971 MEDLINE NIH/NLM

1977 DNA sequencing Allan Maxam & Walter Gilbert/Frederick Sanger

1977 Staden programs Roger Staden

1981 Smith-Waterman Temple Smith & Michael Waterman

1982 GenBank LANL/EMBL/NCBI

1988 NCBI NIH/NLM1988 FASTA William Pearson & David Lipman

1988 DNA Strider Christian Marck

1990 BLAST Stephen Altschul & David Lipman, NCBI

1994 DNA computer Leonard Adelman

1997 PubMed NCBI

Page 12: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

Sequence AnalysisSeqMatrix E. coli promoter output: DNA Location: 3,075Spacer Length: 11Similarity Score: 55.29

CGACATTGCTTGACCC <11> GCGTGTTCAATTCG

Page 13: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

Phylogeny

Page 14: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

Data Visualization

Page 15: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

MultimediaL27758. Birmingham IncP-a...[gi:508311] Related Sequences, PubMed, Taxonomy

LOCUS BIACOMGEN 60099 bp DNA linear BCT 08-JUL-1994 DEFINITION Birmingham IncP-alpha plasmid (R18, R68, RK2, RP1, RP4) complete genome. ACCESSION L27758 VERSION L27758.1 GI:508311 KEYWORDS complete genome. SOURCE Birmingham IncP-alpha plasmid (plasmid Birmingham IncP-alpha plasmid, kingdom Prokaryotae) DNA. ORGANISM Birmingham IncP-alpha plasmid broad host range plasmids. REFERENCE 1 (bases 1 to 60099) AUTHORS Pansegrau,W., Lanka,E., Barth,P.T., Figurski,D.H., Guiney,D.G., Haas,D., Helinski,D.R., Schwab,H., Stanisich,V.A. and Thomas,C.M. TITLE Complete nucleotide sequence of Birmingham IncP-alpha plasmids: compilation and comparative analysis JOURNAL J. Mol. Biol. 239, 623-663 (1994) MEDLINE 94285211 FEATURES Location/Qualifiers source 1..60099 /organism="Birmingham IncP-alpha plasmid" /plasmid="Birmingham IncP-alpha plasmid" /db_xref="taxon:35419" BASE COUNT 10839 a 18681 c 18448 g 12131 t ORIGIN 1 ttcacccccg aacacgagca cggcacccgc gaccactatg ccaagaatgc ccaaggtaaa 61 aattgccggc cccgccatga agtccgtgaa tgccccgacg gccgaagtga agggcaggcc 121 gccacccagg ccgccgccct cactgcccgg cacctggtcg ctgaatgtcg atgccagcac 181 ctgcggcacg tcaatgcttc cgggcgtcgc gctcgggctg atcgcccatc ccgttactgc 241 cccgatcccg gcaatggcaa ggactgccag cgccgcgatg aggaagcggg tgccccgctt 301 cttcatcttc gcgcctcggg cctcgaggcc gcctacctgg gcgaaaacat cggtgtttgtetc.

Page 16: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

AlgorithmsAn algorithm is simply a series of steps used to solve a problem. One of a computer’s great strengths is its ability to rapidly and accurately repeat recursive steps in an algorithm. This ability is essential to the data processing functions of modern computers. Many algorithms of specific interest to biologists, such as the BLAST algorithm, could not be practically applied without computers.

The first bioinformatics algorithms for comparing sequences to each other attempted to find the best possible (optimal) alignment. With the tremendous growth of sequence data, many modern search algorithms use heuristic (rule of thumb) approaches which may not find the absolutely best solution, but will quickly find a good solution.

Algorithms can improve as the underlying problem is better understood. Early algorithms for searching sequence data for significant matches depended on consensus sequences. It rapidly became clear that biologically significant sequences rarely perfectly matched a consensus, and more sophisticated approaches were adopted, including the use of matrices, Markov chains and hidden Markov models.

Page 17: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

Optimal vs. Heuristic AlgorithmsOptimal Algorithm • Finds the optimal solution to a problem.• Often uses an approach called dynamic programming, which solves the problem by breaking it

into smaller subproblems, which are separately solved, then sequentially reassembled to solve the entire problem.

• This approach was first applied to solving biological sequence comparison problems by Saul Needleman and Christian Wunsch in 1970, and can be used to solve either a global comparison problem (Needleman-Wunsch) or a local comparison problem (Smith-Waterman).

Heuristic Algorithm• Solves a problem by using rules of thumb to reach a solution. The solution is not guaranteed to be

an optimal solution, but is generally arrived at far faster than using an optimal solution approach such as dynamic programming.

• In sequence comparison, well known heuristic approaches include the BLAST algorithm, developed by Stephen Altschul in 1990, and the FASTA algorithm, developed by William Pearson in 1988.

• Although a heuristic algorithm such as BLAST may be 100 to 1,000 times faster than an optimal algorithm such as Smith-Waterman, it may miss matches found by the optimal algorithm.

Page 18: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

Search Algorithms in Bioinformatics

Global Alignment Search• Needleman-Wunsch algorithm, Needleman & Wunsch, 1970• Finds the best complete alignment of two sequences that maximizes the number of matches and

minimizes the number of gaps.

Local Alignment Search• Smith-Waterman algorithm, Smith & Waterman, 1981• Makes an optimal alignment of the best segment of similarity between two sequences. • Often better for comparing sequences of different lengths, or when looking at a particular region

of interest.

Heuristic Approximations to Smith-Waterman• FASTA, Pearson, 1988• BLAST, Altschul, 1990• Gapped BLAST and PSI-BLAST, Altschul, 1997• BLAT, Kent, 2002• DELTA-BLAST, Boratyn, 2012

Page 19: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

BLAST (Heuristic) vs. Smith-Waterman (Optimal)

Page 20: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

Search Algorithm VariablesGaps• When scoring an alignment, penalties are assigned for creating and extending gaps. The longer the

gap, the greater the penalty.• Can vary the gap penalties. This will likely change your results, so take note.

Dayhoff Substitution Matrices • For protein comparison, all of these search algorithms use Dayhoff substitution matrices which

encode likelihoods of an amino acid substitution. • When scoring an alignment, penalties are assigned for what the substitution matrix considers poor

substitutions, the worse the substitution, the greater the penalty.• Can get different results depending on which substitution matrix you use.

Blosum 62Blosum 62 is the default matrix for NCBI BLAST protein comparison. It is optimized for known close homologies.

PAM 250PAM 250 is a matrix which is optimized for known distant homologies.

Page 21: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

Blosum 62 (Close) vs. PAM 250 (Distant)

Page 22: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

Programming in BioinformaticsComputer programming is simply how we instruct computers to perform tasks for us. These are some of the programming techniques and languages commonly used in bioinformatics:

• Regular expressions (regex)

• Shell scripting (e.g bash), pipelines and redirects (Unix) and macros

• Structured Query Language (SQL)

• Perl practical extraction and reporting programming language

• Python programming language

• BioPython programming libraries

• R statistical computing and graphics language

• MATLAB numerical computing environment and language

• Java general object oriented programming language

Page 23: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

Bioinformatics ResourcesBooksBiological Sequence Analysis by Richard Durbin, et al. BLAST: An Essential Guide to the BASIC Local Alignment Search Tool by Ian Korf, Mark Yandell & Joseph BedellMastering Regular Expressions by Jeffrey E. F. FriedlPractical Computing for Biologists by Stephen Haddock & Casey DunnBioinformatics: Sequence and Genome Analysis, Second Edition by David W. MountDeveloping Bioinformatics Computer Skills by Cynthia Gibas (forthcoming edition)Introduction to Computational Biology: Maps, Sequences and Genomes by Michael S. Waterman (forthcoming edition)

JournalsBioinformatics (formerly CABIOS)Briefings in BioinformaticsJournal of Computational BiologyPLoS Computational BiologyNucleic Acids Research

Websiteshttp://www.ncbi.nlm.nih.gov National Center for Biotechnology Information http://www.sanger.ac.ukWellcome Trust Sanger Institutehttp://www.ebi.ac.uk European Bioinformatics Institute (EMBL-EBI) http://www.iscb.org International Society for Computational Biology http://www.google.com Google

Page 24: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 1: Introduction to BioinformaticsJanuary 30, 2017

National Center for Biotechnology Information (NCBI)NCBI https://www.ncbi.nlm.nih.gov• PubMed (over 26 million citations), PubMed Central (over 4.2 million full text articles from over

6,000 journals), textbooks, other reference material • GenBank, RefSeq, CDD, MMDB and other sequence and structure databases • Over 2 trillion base pairs (2 terabases) from over 512,000 species• Prokaryotic genome data and browsers (nearly 23,000 complete microbial genomes, nearly 7,000 viruses, over 8,000 plasmids, environmental samples and additional sequences)

• Eukaryotic genome data and browsers (nearly 4,000 complete eukaryotic genomes, over 9,000 organelles, numerous genomes in progress)

• BLAST, PSI-BLAST, DELTA-BLAST, Primer-BLAST and VAST search tools

NCBI BLAST and BLAST Tutorialshttps://blast.ncbi.nlm.nih.gov/

https://www.ncbi.nlm.nih.gov/Class/BLAST/blast_course.short.html

https://www.ncbi.nlm.nih.gov/Class/BLAST/blast_course.html

Page 25: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 2: Introduction to ComputingJanuary 30, 2017

Regular ExpressionsA regular expression is a pattern that describes a set of strings. Regular expressions are constructed analogously to arithmetic expressions, by using various operators to combine smaller expressions.

Wildcards   . A period matches any character except a line break (i.e. a carriage return).    ^ A caret matches the beginning of a line (unless used in a character class).   $ A dollar sign matches the end of line (unless used in a character class).   

Character Classes[...] To match a set of characters, place square brackets around them. [agct] will match an a, g, c or t. [^...] To exclude a set of characters, place a caret after the opening bracket. [^agct] will not match an a, g, c

or t, but will match any other character.- To specify a range of characters, use a hyphen within the backets. [0-9] will match any digit.

Quantifiers* An asterisk matches zero or more occurrences of the specified class or character that precedes it. .* will

match no or any characters.+ A plus sign matches one or more occurrences of the specified class or character that precedes it. [0-9]+ will

match at least one digit.

Escape Character\ A backslash acts as an escape character, allowing you to search for wildcard or special characters, e.g. \. will

actually match a period, and \\ will match a backslash.

Page 26: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 2: Introduction to ComputingJanuary 30, 2017

Regular Expressions and TextWrangler

Grep with TextWranglerTextWrangler is an example of a free Macintosh application (the commercial version is BBEdit) with a built-in grep utility which can be used through its Find function. To use it, simply make sure the Grep option is checked in the Find dialogue. A similar utility for Windows is grepWin.

For example, to strip out all non-DNA characters (such as line numbers or spaces) from a text file containing a DNA sequence (such as r751.dna) in TextWrangler, one can simply enter [^acgt] in Find, make sure Grep is checked and Entire Word is unchecked, leave Replace blank, then select Replace All (by default, it is case insensitive).

To strip all numbering and blank spaces from a text file containing a protein sequence (such as gag.aa), one can first enter \d in Find, then Replace with nothing, then enter a space in Find, and Replace with nothing.

Useful TextWrangler Special Character Matches

\r Line break (carriage return)    \n Unix line break (line feed)    \t Tab    \f Page break (form feed)

\d Any digit [0-9]    \D Any non-digit character (including carriage return)   

\s Any whitespace character (space, tab, carriage return, line feed, form feed)   \S Any non-whitespace character (any character not included by \s)

Page 27: ICQB 2017 Lecture1 - Columbia University€¦ · modern computers. Many algorithms of ... • Perl practical extraction and reporting programming language ... Introduction to Computational

Lecture 2: Introduction to ComputingJanuary 30, 2017

Regular Expression with Subpatterns

SubpatternsA subpattern consists of a simple or complex pattern enclosed in a pair of parentheses. Subpatterns allow you to reorder the data as it is replaced. This is a feature of certain grep implementations such as the one in TextWranger or in the Perl programming language.

& The entire matched pattern (replacement only).(x)The pattern x is remembered (search only).\1, \2, ..., \99 The nth subpattern in the entire search pattern.

If you have: You will get: ccc Find: [acgt$]+ Codon: cccgcg Replace: Codon: & Codon: gcgatg Codon: atgcga Codon: cga

If you have: You will get: 215 A Find: ^(.*) ([^ ]+)$ A: 21562 C Replace: \2: \1 C: 6215 G G: 1544 T T: 44