Introduction to Bioinformatics
Patricia M. PalagiSwiss Institute of Bioinformatics (SIB)
PI Group (PIG)
Bioinformatics: definition
The applications of computer sciences to molecular biology
In particular to the study of macromolecules such as proteins, nucleic acids and oligosaccharides
(sugar)
Some synonyms for molecular bioinformatics
• Computational biology• Biocomputing• Genome computing• Sequence analysis (restrictive)
Molecular bioinformatics is sometimes confused with...
• «Bio-inspired» computer sciences (artificial life, neural networks, genetic algorithms);
• Biomathematics or biostatistics;• Modelization of biological systems.
• Databases– Nucleic acid sequence databases (EMBL /
GenBank / DDBJ) and protein sequence databases (SWISS-PROT / TrEMBL);
– Databases specialized for genomics (FlyBase, OMIM), mutations, 3D structures (PDB), 2D gels (SWISS-2DPAGE), references (Medline); etc.;
– More than 1’000 are currently available;– They can generally be accessed from the Web;– Size range from <10 Kb to >10 Gb;– Frequency of update: from daily (EMBL) to
annually.
2 components of bioinformatics
• Tools– Programs to analyze raw experimental results
(from sequencing machines, mass spectrometers, etc.);
– Programs to analyze the intrinsic properties of DNA or protein sequences;
– Sequence comparison and similarity search tools;– Micro-array analysis software;– Tridimensional structures visualization and
modelization tools;– These software tools are either part of commercial
packages or are available to all on the WWW.
2 components of bioinformatics
Some important facts on bioinformatics
• It is a discipline that complements but does not supplement experimental research;
• It can help plan experiments, not replace the experiments;
• It is not cheap;• Good bioinformatic studies take significant
amounts of time;• Like anywhere else: some garbage in, lots
of garbage out!
Bioinformatics and the discovery process in biology
• Discoveries are made through studies of anomalies;
• Computer analysis tends to smooth out the ‘spikes’ of anomalies;
• We need to make sure that we do not throw away the baby with the water.
A common fallacy• Genome projects are providing massive
amounts of data.• Yes, they are providing lots of sequence
data, but a lack of information on "proteins" and no characterization data;
• The amount of data is relatively small in absolute term. Compared to images, sequence data does not cause real problems in terms of storage or processing.
Viroide 300Small phage (virus infecting a bacteria) 2,000HIV virus 10,000Herpes virus 150,000Mycoplasma genitalium (parasite bacteria) 600,000Bacteria 1 à 13 millionsBaker’s yeast 13 millionsDrosophila (fruit fly) 180 millionsFugu (fish) 360 millionsHuman 3.2 billionsPine 68 billionsSalamander 81 billionsAmoeba 670 billions
Genome sizes (in base pairs)
CCCCTGACGACCGATTCAAAAACCACTTTCCTCTTTTACGGCGCCCTAGCGCTATGGCGGTGAAGACTGCTTGACATTAACATGCCTGTTGAGGCTAGAGAATCCATGCGAAGGCGGTTCGGAAACTGCTTCGAAGGCGTGGGGTGGTGCGGGTGGGATTTGAACCCACGCAGGCCTACGCCATCGGGTCCTAAGCCCGACCCCTTTGGCCAGGCTCGGGCACCCCCGCACCGTGTAGTCTTTAGGTTTAGCTTTCAGGGTTAAAACGGTTTAACACTCATGAGTATCACTGGGCTGGCTACTGGGCTCTGCATTCCCGAGGCCATGCTGCCCGTGAGGAATAACGGGTCTGAGGAGCCGTTGACAGGTTGCCATTTGGCCTTGCCCCCAAAAGTGATGCTGTGGATCACGACCTCCTCGGAGGAGGGGAGCCTCAGCATACACTTTATAAAAGGCTTTAAGGGTTTAGCCGGATAATGTTGTTGGGGCGTGCAGCGGCAAGTGCTGCAGCTCATGGGTATGGTATGCGGCTTTGCCTGGTGATGCGGTTTGGCCCCCGTTGTCTGCGACGTCTGCGGTGTTAGGAGGGCTGTGGTGCTGCAGCCACACGGGAAGGCGGCTCTGCAGGGAGTGCTTTAGGGAGGATATAGTGGGGAGGGTCAGGAGGGAGGTTGAGAGGTGGGGGATGATAGGCCCTGGGGAGACGGTCCTCCTAGGCCTGAGCGGCGGTAAGGACAGCTATGTCCTGCTGGACCCTCTCCGAGATAGTCGGGCCCTCGAGGCTGGTGGCGGTGTCTATAGTGGAGGGCATACCGGGGTACAACAGGGAGGGAGATATCGAGAAGATCAGGAGGGTGGCCGCGGCTAGGGGCGTCGACGTGATAGTGACGAGCATAAGGGAGTATGGGGGCCAGCCTCTATGAGATATACTCCAGGGCCCGAGGGAGGGGGGCGGGCCACGCCGCCTGCACCTACTGCGGCATAAGCAGGAGGAGGATACTTGCCCTCTACGCCCGCCTCTACGGCGCCCACAAGGTCGCTACGGCCCACAACCACGACGAGGCGCAGACAGCTATAGTGAACTTCCTCAGGGGGGACTGGGTTGGCATGCTGAAAACACACCCCCTCTACAGGAGCGGGGGCGAGGACCTGGTTCCAAGGATAAAGCCTCTTAGGAAAGTCTACGAGTGGGAGACGGCCAGCTATGGTACTCCACCGCTACCCCATCCAGGAGGCTGAATGCCCCTTCATAAACATGAACCCAACCCTCAGGGCGAGGGTGAGGACGGCCCTGAGGGTGCTAGAGGAGAGGAGCCCGGGCACCCTGCTCAGGATGATGGAGAGGCTCGACGAGGATGAGGCCGCTGGCCCAGGCCATGAAGCCCTCCTCCCTAGGCAGGTGCGAGAGATGCGGGGAGCCGACCAGCCCGAAGAGGAGGCTCTGCAAGCTCTGCGAGCTCCTGGAGGAGGCCGGGTTCCAGGAGCCCATCTACGCGATCGCAGGGAGCAAGAGATTAAGGCTTCAGAGCCCCACCGCTAGCCCTGGGTGAACGCGCTATGGCAAAGCCAAAGGTTAGCCTGCCGGAGGATGTGGAGCCCCCCAAGGCTATAGTCAAGAAGCCTAGGCTAGTGAAGCTAGGCCCCGTAGACCCGGGGAGGAGGGGAAGGGGGTTCAGCCTAGGCGAGCTCGCGGAGGCTGGGCTAGACGCTAAAAAGGCGAGGAAGCTTGGCCTGCACGTGGACACGAGGAGGAGGACGGTCCACCCGTGGAACGTGGAGGCCCTCAAGAAGTATATAGAGAGGCTTAGAGGCGGGCGTAGAGGTCTAGACCCCGGGGCTATATACTACCACTTCGCCCTCCCCATTATACTATCCACATCCACCCTGGCCCTCCCCACCTCCAGGACCTCAATATCCCCCTCAGCCCTGGTGTACACGCTCAAAGACGGCTCCCTGTAAGGCCCTGGTCACCACCCCCACGTGAATCACCCCTCCCGCGTGTACGGCGGCTATAAGCCCCCTCTCCCAGCCCTCCCGGAGGACGCGGAGCCCGGAGCCTACTCCGACCCTACCGCCCCTCCTCGCCACAACCACTATGTCCCCGTCAATCTCACCATAGAGGGCGGCTGGGTGTAGGGCCTTGAGGGCCTCGTGGGCCAGAGGCTCCCCCCGGAATATCGGCGCGCCAACTATCTCGGCCTCGCCGGGCCTGACCCTCCTCTCCCTCCCTCCCGAGGTCCTAAGGGCTATCAGCCTCTTATGAAGAGCCCTCTCCCCCCGGCTCTTGCCCGCCTCTCCAGCCAGCCTCTCCACAGACAGAGTGTCAAGCCCCCACACCCTCTCGAGCAGCCTGGCCCGTCGGCTGGCTATGCCCACCGCGACTACAAGCCTTGCTCTAGAGGCTATGGGGGCTGCCTTAGACTCGAGCCCCTCCCACAGTGATATCCAGCCATCTGTATCCACTACCACCTGGCTGGCCAGTGAGGCCAATCTAGATGCGCAGGCGAGGTAGCGGGACTCCGACCCCCGGGGGGTGAAGCCGCCGACGAAACACGGCTCACTCGAGAACGAGTCGTCTAGGCCCGGGACGGCCACGCCCTGTGGAGACGCCAGCGCCATAAACCCCGGGGCGAAGACCTCGTTCTGGCCTATATCCGCCGACAGCAGTCTATACCCACCACCGCCCCTGTTAACTATCCAAGCCGCTATGCTCTTACCGGAGTCGCTCGGCCCCACAATAGCCACCCTGCCCCGCTGAGAGGCCTCCCTGGCTATGGAGTCGAACCTGTTGTAAGCCTCCTCCACGCCCCCTGTGGAGACTACACCGGACACAATAGCCCTCCCCTCAACCCTGGCGACCGACCTGCCTGCAGGGACCACTAGAGTAGAGCCCTCCCCCAGCCTTCCACCCAAAACCTCTGCAGCACCCTCTACAACCTCTATCCTCCCCGGGCCGCGGACTAGCGCCGAGCCCCATGCAATCTCCACAGGCAAAGCTTTAAACCCCCAGGTAAGATATGTGAACCGGGCCGCGGTAGTATAGCCTGGACTAGTATGCGGGCCTGTCAAGGGCCCCGCCTCCGCCCCACCCTCATTCTACTACACGCTTATCAGGATAAACAGCCGGGCAAACGTTTTTAACCCCGCCGAAATTCATACTCCCGGGGCGGAGGCGGGCCTGCGGAGAGCCCGTGACCCGGGTTCAAATCCCGGCCGCGGCGCCAATAATCCTCGCGGCCCGCCTTCAAGACTCACTAAACCCCGGTTGAGCACCCGCAGCATCGATGCTAAGGCTCGAGCCATGCATAGCCGCGGGGGGTGGGGGGATTTGGCGAGGCCTGTTGAGGCGGTAAAGAGGCTGCTGGAGAGGTGGCTGGAGGGTAGGAGGAGGGGTTATGTCCTTACGCTTGTAGCTCTTAGAAGGCTTGAGGAGAGGGGGGAGGAGGCTACTGTAGAGAGTAGGGAGGAGGGCCTGAGGATTCTGGAGAGGACGGAGGGGAGGATAGACTGGGGTGTTACTAGGGATGAGTACACTGTCAACATGGTCTCCAGCGTTCTTCGCGAGCTGGCCGAGAGCGGCCTTGTCGAGATGGTGGACGGCGGGAGGAGCGTCAGGTACAGGATAGCGAGGGATGCTGAGGAGGAGTTCCTCTCCAGCTTCGGCCACCTCCTGCAGCTTGTGAGGATGCCGAAGTAGCGTTAAAGCCCTAGGTGCCAGAGGCCGCCGGAGGCTAAGAGGCCGATGAAGGCCTTGAGAGGTGCCGCCAAGCTATCCCTATCCCTGCTGCTCTTTTGGGCTAGCTACTCGATCTACTACACTATAACGAGGCGTGCTGTAGAGGAGGGCCTAGGAGAGGGATCCTACCTCCTGGGCGTCTTGATGTCGGGGGCTGAGGAGGCGCCGCTCGCGTCAATAGTCCTTGGCTACCTGGCGGACAGGCTAGGCTACCGCTTACCCCTGGCCCTGGGCCTGTTTGAGGCTGGGCTGGTCGCTGCAATGGCCTTCACCCCCCTAGAGACCTACCCCATACTGGCTGGGGCTGCGTCGCTAGTCTACGCCTCATACTCCGCCCTAATGGGCCTCGTCCTGGGTGAGAGCGGGGGGAGCGGCTTCAGGTACAGTGTTATAGCAGCCTTCGGCAGCCTTGGCTGGGCTCTCGGCGGGTTGGCGGGGGGAGCGGCTTACTCCCGCCTGGGGTCACTGGGGCTAGTGGCCGCAGCCCTCATGGCCGCCTCATACCTAGTCGCCCTCTCAGCCTCGCCCCCCCGCGGCGGCGCGGCGCCCAGTGTGGGGGAGACGATAACCGCTCTGAAGGGGGTTCTGCCCCTATTTGCAAGCCTCTCAACCAGCTGGGCGGCGGGCTTCTTCTTCGGGGCTGCCAGCATAAGGCTTAGCGAGGCGCTCGAGAGCCCTATCGCCTACGGGCTAGTGCTGACCACCGTCCCCGCACTCCTAGGCTTCCTGGCGAGGCCTGCGGCGGGCAGGCTGGTCGACAAGGCCGGGGCTGTAGTGCTTGCGTTGTCCAACGCGGCATACTCCCTTCTCGCCCTAGTTTTCGGCCTGCCCACCAGTCCGGCCCTGCTGGCCCTTGCATGGAGCCTGCCCCTATACCCCTTTAGGGATGCCGCCGCGGCCATCGCAGTTAGCAGCAGGCTTGAGAGGCTGCAGGCGACGGCCGCGGGGCTGCTCTCAGCGAGCGAGAGCGTCGGCGGCGCTGCAACCCTTGCCCTGGCACTGCTCCTGGATGGGGGGTTTAGGGAGATGATGACGGCTTCAATAGCCCTTATGCTCCTCTCCACCCTACTCCTCGCAGACCACTCTACGGCTCCACGCCGAGAGCCCTGTCCCCGGCGTCGCCAAGGCCCGGCACTATGAAGTAGTTCTCGTCCAGCTCGGGGTCTAGGGCTAGCGTGTATATGGGGGTGTCGCCGTAGAGGGATGATATGTACTCGACGCCCCTGGACGCTATTATAGAGCCTATAACGACCTTGCTGGCCCCCCTGTCTCTGGCCAGCCTCACGGCCTCCGCCACAGTCTTGCCCGTGGCCAGCATCGGGTCTAGAACGACGGCGGGGCCGTCGAACATGCGGGGTAGCCTGGAGTAGTAGATCTATCTTGAGCCTGCCCGGCTCCTCGACCCTCCTGGCTGCTACGAGGGCTATCCTCGCCTCCGGCATCATCGAGGCGAAACCCTCTACCATGGGGAGGCTAGCCCCGAGTATCCCTACGAGGTAGACGGGCCCCGCTGGCGCCAGCTCCGCCTTAGCCCCCAGGGGGGTCTCCACCTCCTCCTCCACCCACCCGAGCTCGCCCGCAATGTACACCGCCAGTATGGAGCCCGCTATCCTGACGTACCTCCTAAACTCCGGGAACCCGGTTGTCCGGTCCCTGAGAACCTTGAGGACGTAGGCTAGGGGTGTTTCGCCCCCAATAACCCTAACTGCCGCCACCATGGGAACCTCTAGGTAGTGGTTGAGGCTCCGGAGCTTAAGAGGGTTAAACTCCAGGATGGCCACCTGGGTGCCGCCGGGGATTGGACAGTAGGGTTCTAGAGTCCGCGAGAGCCCTATCCCGCTACCCCCTCTGCGACCGCTGCCTCGGCAGGCTCTTCGCTAGGCTTGGGAGAGGCTGGAGCAATAGGGAGCGGGGAGAGGCTGTCAAGAGGGTTCTGGTGATGGAGCTTCACAGGAGGGTCCTCGAGGGGGATGAGGCGTTGAAAACCCTGGTCTCTGCAGCTCCGAACATAGGGGAGGTGGCAAGGGATGTCGTGGAGCACCTCTCCCCAGGTTCCTACAGGGAGGGCGGCCCATGCGCTGTCTGCGGCGGGCGGCTGGAGAGTGTTATAGCCTCAGCGGTGGAGGGGTACAGGCTGCTAAGGGCTTACGATATCGAGAGGTTCGTAGTCGGGGTCCGGCTAGAGAGAGGTGTTGCCATGGCTGAGGAGGAGGTAAAGCTGGCCGCCGGCGCCGGGTACGGCGAGTCCATTAAGGCTGAGATCAGGAGGGAGGTCAAGCTCCTGGTGAGCCGGGGTGGAGTGACCGTGGACTTCGACAGCCCTGAAGCGACCCTAATGGTGGAGTTCCCCGGGGGCGGGGTTGACATACAGGTCAACAGCCTGCTCTACAAGGCTAGGTACTGGAAGCTTGCCAGGAACATAAGGGCATACTGGCCCACGCCAGAGGGGCCGAGGTACTTCAGCGTGGAGCAGGCTCTATGGCCGGTTCTAAAGCTCACTGGGGGGGAGAGGCTGGTTGTACACGCTGCTGGCAGGGAGGATGTAGACGCCAGGATGCTGGGCAGCGGGAGGCCGATAGTCGAGGTCAAGTCGCCTAGGCGCAGGAGGATCCCGCTTGAGGAGCTGGAGGCGGCCGCCAACGCCGGCGGGAAGGGGCTGGTTAGGTTCAGGTTCGAGACGGCTGCCAAGCGTGCCGAGGTCGCGCTTTACAAGGAGGAGACTGCGGTTAGGAAGGTGTACCGCGCCCTGGTAGCGGTGGAGGGTGGTGTTAGTGAGGTGGATGTTGAAGGGTTGAGGAGGGCTCTCGAGGGCGCGGTTATAATGCAGAGGACGCCCTCCAGGGTCCTCCATAGGAGGCCGGATATACTGAGGAGGAGGCTCTACAGCCTAGACTGCAGCCCCCTGGAGGGGGCGCCTCTGATGGAGTGCATATTGGAGGCGGAAGGGGGTCTCTACATCAAGGAGCTGGTCAGCGGTGATGGCGGGAGAACCAGGCCAAGCTTCGCTGAGGTCCTCGGCAGGGATGTGTGTATAGAGCTCGACGTGGTGTGGGTGGAGCATGAAGCTCCAGCCGCACCCGGCTAAAGCTAAATTAAGCTGGGCTGAGCAAAATACCGGGGGGAGCGTAGGTTGGTCAAGGCACCTAGAGGCTATAGGAACAGGACTAGGAGGCTGAGGAAGCCTGTGAGGGAGAAGGGCAGCATACCCAGGCTCAGCACCTACCTTAGGGAGTACAGGGTGGGCGATAAGGTGGCTATAATCATAAACCCCTCCTTCCCAGACTGGGGCATGCCCCACAGGAGGTTCCACGGGCTGACGGGAACGGTGGGGAAGAGGGGCGAGGCCTACGAGGTAGAGGTCTATCTGGGTAGGAAGAGGAAGACCCTCTTCGTCCCCCCCGTGCACCTCAAACCCCTCAGCACAGCCGCCGAGAGGCGGGGCAGCTAGAGCTGTCCCCACGGTTCCACGCTGGAGGGGGTGCTAGTGTTGGAGAGGAGGATCCTAGAGTATAAGGCGGTGCCCTACCAGGTAGCCAAGAAGTATATGTACGAGAGGGTTAGGGAGGGCGACATAATATCGATACAGGAGTCGACTTGGGAGTACTTCAGGAAGGTAGTGTTCTGCGACCCGGAGGCTGCCTCCGAGCTTGTTGAGGAGATTGTGAAGGAGGGTGTCAGCCGTGAGGCGCGGCGAACATCGCGAGCATATGCCCCAAGACCGAGGGCGAGCTCAGGAGCATTCTCGAGATGGACAGGAGCATAACCTCCGTACACATGGCTAGCAAACTGTACCCCATAGTTTCCAAATACTGCAAGGACTAGACCCCGCCCCCCTTCAGCCCGGGGATTAACAGTTTAATCTCCGCGTCCCAACCATATTTATGTTGATAGCGGCTGTACGGAGAGTGTTGAGAAGTGTCTAGACCCCGCCCCCGCGACAGGAAGCCCCCCCACCAGGGGAGGCCGCAGCCCCACATCGCCGCCCTTGAGGTGGAGGCTATAGTTCTGGACTACATACCCGAGGGCTACCCGAGAGACCCCCACAGGGAGCACCGCAGTAAGCCCGTCGTTCAGCTCGGGGTTAGGAGGCTGCACCTAGTCGACGGTGTCCCCCTCCATGAGGTCGATATACTGGAGCGGGTCACCCTGGCTAGGGAGGTTGTGTATAGCGTCCCCATAGTGGCCCGGCTCCCCGGGGGGGTCGAGAGGAGGGTGAAAAGTGTTAGTCGCGGTAACATGCCTCCCCGGCCAGGCGCGGGAGGGCGGGGTCAGGGAGATATACTGCTACCCCCTCTCCTACGCCGACCAGGCGACCCTGGAGGCGCTGCAGCAGCTCCTGGGTGAGGGGGACGAGAGGCACAGGTATATACTTGTGTCCCCCGACAAGCTCTCCGAGGTGGCCAGAGGTCACGGCCTCTCGGGGAAGATAGTGAGCACGCCCAGAGACCCTATATCCTACCAGGACCTCACCGACGTCGCCAGGGCTACGCTGCCGGACGCTGTGAGGAAGCTGGTCAGGGAGAGGGACTTCTTCGTGGAGTTCTTCAACGTGGCCGAGCCGATAAACATAAGGATACACGCGCTGGAGGCCCTAAAGGGTGTGGGTAAGAAGATGGCTAGGCACCTCCTCCTCGAGAGGGAGAGGCGTAGGTTCACGAGTTTCGAGGAGGTGAAGA
The human genome is 380,000 longer than the sequence shown
here
Cost: $ 2.7 billions
The computational challenges in bioinformatics
• Currently they lie in two different aspects of the bioinformatics «galaxy»:– High throughput raw data acquisition, tracking and
preliminary analysis. Current genomics (DNA sequencing), transcriptomics (micro-array) and proteomics (MS) projects require high quantity of storage space (Terabytes), lots of computing power and specialized software systems such as LIMS (Laboratory Information Management Systems).
– Modelisation. Whether it is in the field of 3D structure (homology modeling, docking, etc.) or in attempts to model life processes (pathways, cellular development, etc).
Some bioinformatics research and service centers
• National Center for Biotechnology Information (NCBI) in the USA;
• European Bioinformatics Institute (EBI) in the UK;• Swiss Institute of Bioinformatics (SIB);• Australian National Genome Information Service (ANGIS);• Canadian Bioinformatics Resource (CBR);• Peking Center of Bioinformatics (CBI);• Singapore BioInformatics Centre (BIC);• South-African National Bioinformatics Institute (SANBI).
www.ncbi.nlm.nih.gov
www.ebi.ac.uk
www.isb-sib.ch
ExPASy
A page to navigate through multiple sites in bioinformatics: www.expasy.org/alinks.html
The contents of the SWISS-PROT protein knowledgebase
• Sequences!• ANNOTATIONS• References• Taxonomic data• Keywords• Cross-references• Documentation
•Function(s); role(s)•Post-translational modifications•Domains•Subcellular location•Protein/protein interactions•Similarities•Diseases, mutagenesis•Conflicts and variants
Domains, functional sites, protein familiesPROSITEInterProPfamPRINTSProDomSMART
Nucleotide sequence dbEMBL, GenBank, DDBJ
3D/Structural dbsHSSPPDB
Organism-spec. dbsDictyDbEcoGeneFlyBaseHIVLepromaMaizeDBMendelMGDMypuListSGDStyGeneSubtiListTIGRTubercuListWormPepYEPDZfin
Protein-specific dbsGCRDbMEROPSREBASETRANSFAC
SWISS-PROT2D-gel protein dbsSWISS-2DPAGEANU-2DPAGECOMPLUYEAST-2DPAGEECO2DBASEHSC-2DPAGEAarhus and GhentMAIZE-2DPAGEPHCI-2DPAGEPMMA-2DPAGESiena-2DPAGE
Human diseasesMIM
PTMCarbBankGlycoSuiteDB
Some fields of application of bioinformatics
1: Data acquisition• Examples: DNA sequencing, mass
spectrometry (MS), 2D gels image acquisition;• Software programs tightly linked to the
instrumentation hardware;• The main issues are in the field of signal
detection and image analysis;• There is no biological context at this level.
How to detect spots in 2D gels
How many spots ?
Separation of these 2 spots
• Examples: “base calling” in DNA sequencing, interpretation of mass spectra, detection of spots on 2D-gels;
• There is a need for sophisticated algorithms that can extract a maximum of information (optimization of the “signal/noise ratio”);
• These algorithms require some knowledge of the biological context.
2.Preliminary data analysis
DNA sequencing
Programme to analyse data from DNA sequencing machine
Example: pregap4 from Rodger Stadenhttps://sourceforge.net/projects/staden.
• Sequence assembly: the reconstruction of a complete DNA sequence from fragments of 100 to 300 base pairs. These fragments are supposed to overlap;
• The assembled sequence is called a «contig»;• This step is required for the «shotgun» sequencing
method where all or part of a genome is broken down in small pieces;
• It is not a trivial task because of: (a) sequencing errors; (b) sequence repeats.
3: Assembly of DNA sequences
4: Coding sequence detection
• How to find genes in genomic sequences;• A problem whose complexity is directly
correlated with the complexity of a genome. It is easy to find genes in bacteria; very difficult in «superior» eukaryotes (human, Drosophila, etc);
• Various computer methods are used to tackle this problem. Use of intrinsic (transcription signals detection, statistical analysis) and extrinsic (similarity with known genes) approaches.
4: Coding sequence detection
HMMgene
Netgene2
Genebuilder
Summary of results
3 ’5 ’
108310031305
14061452
16611914
2000
1084 (1.00)
1304 (0.77)
1407 (0.89)
1451 (0.90)
1662 (1.00)
1913 (1.00)
HMMgene Genebuilder Netgene2
Not easy sometimes…
Ex: Chromosome 21
• Analyze the restriction sites (enzymes)• Detection of regions of low complexity;• Translating DNA in protein• Detection of sequence repeats such as microsatellites,
minisatellites, Alu repeats, Line-1 elements and many others;
• Detection of important non-coding DNA elements such as transcription signals (promoter elements), origins of replication, etc.;
• Detection of tRNA sequences and of other types of RNA (examples: rRNA, uRNA, tmRNA).
5: DNA sequence analysis
Restriction enzyme (Webcut)
1
432
5 enzymes cut 3 times because 4 CDS
6: Similarity searches• The essential tool in molecular bioinformatics: the
comparison of a DNA or protein sequence (“query”) with all or part of the known sequences (“database”);
• No theoretical challenge; but two issues:– Optimization of computing speed either using
algorithmic shortcuts or specialized hardware;– Optimization of the use of biological
information (how to make these programs “smarter”).
Alignment of 2 sequences. An example
MY-TAIL--ORIS-RICH-#x #### x#x# ####MONTAILLEURESTRICHE
Identities (#), mismatches (x), insertions (-)
BLAST
Statistical measure
BLASTN
BLASTN (nt sequence against ESTs)
Introns
BLASTP
ribosomal protein L24 [Homo sapiens] ribosomal protein L24 [Mus musculus]
7: Protein primary sequence analysis• Physico-chemical characterization• Detection of topogenic regions (i.e. signal sequences,
transit peptides) -> sub-cellular localization• Detection of transmembrane regions;• Prediction of functional regions (conserved regions);• Prediction of post-translational modification (PTM)
sites.• Prediction of antigenicity;• Search of compositionally-biased sequences (i.e. low
complexity sequences, PEST regions, etc.);• Detection of sequence repeats;
Tools to calculate pI/Mw
Resolving physico-chemical characteristics
Mw, pI and composition
Mw, pI and composition
Sub-cellular localisation PSORT II
Signal sequenceSignalP V1.1
Signal peptide cleavage sites in amino acid sequences
Hydrophobic regions
• Ala: 1.8 Leu: 3.8• Arg: -4.5 Lys: -3.9• Asn: -3.5 Met: 1.9• Asp: -3.5 Phe: 2.8• Cys: 2.5 Pro: -1.6• Gln: -3.5 Ser: -0.8• Glu: -3.5 Thr: -0.7• Gly: -0.4 Trp: -0.9• His: -3.2 Tyr: -1.3• Ile: 4.5 Val: 4.2
Kite&Dolittle
•Methods based on different scales: numerical values assigned to each of the 20 amino acid types.
ProtScale• Tool to plot various protein physicochemical
parameters along the sequence;• More than 50 amino-acid scales are available:
hydrophobicity/hydrophilicity, secondary structure propensity (alpha helix, beta sheet, turn, etc.); amino-acid composition; number of codons; bulkiness; flexibility; etc.;
• WWW site: www.expasy.org/tools/protparam.html
ProtScale (Kite&Dolittle)
ProtScale (Chou&Fasman)
Transmembrane regions (TM)
• 13% to 35% of the proteins of genomes are predicted to have one or more TM regions;
• Eukaryotic genomes are richer than microbial genomes in TM-containing proteins;
• All kinds of TM proteins: from 1 to 14 alpha-helical TM regions, different topologies, different target membranes, etc.
ProfileScanLooking for functional regions
Looking for functional regions
ATPase family
LOGO
ATPase signature
Prediction of post-translational modifications (PTM)
• For the prediction of cleavage sites of signal sequences and transit peptides, see the section on the prediction of topogenic regions;
• To predict some PTM’s a pattern (consensus sequence) can be used. These are found in the PROSITE database;
• Example: potential N-glycosylation sites: N-{P}-[ST]-{P};• NetOGlyc; Neural network for the prediction of mucin-
type O-glycosylation sites: www.cbs.dtu.dk/services/NetOGlyc/
• DGPI; prediction of GPI-anchor sites: www.bigfoot.com/~dgpi
Sequence 484 ISPTTINTC 0.065 . Sequence 487 TTINTCGAI 0.029 . Sequence 499 CFDKTGTLT 0.077 . Sequence 501 DKTGTLTED 0.845 *T* Sequence 503 TGTLTEDGL 0.533 *T*
http://www.cbs.dtu.dk/services/NetPhos/
Phosphorylation site prediction
Sulfinator
Sulfation
Glycosylation
8: Multiple sequence alignments• Alignment of two DNA or protein sequences
(binary alignment);• Alignment of multiple sequences.
CLUSTAL dendogram
Multiple alignment: a dendogram
9: RNA folding
• Predicts an optimal secondary structure for a RNA; • Generally applied to tRNAs, rRNAs but also to parts
of mRNAs;• Makes use of information on base pairing; local
energy minimization and structural constraints.
10: Protein secondary and tertiary structure analysis
• Prediction of secondary structure by statistical methods or by neural networks;
• Prediction of the 3D structure directly from the sequence (“ab-initio”). This is still a major challenge!;
• Modeling by homology: prediction of the structure of a new protein similar to one whose sequence is already known;
• Simulation of the “docking” of two proteins or between a protein and a small molecule.
Secondary structure prediction
GOR IV
3D structure modelling
Protein sequence
Protein structure
?
11: Phylogenetic analysis• Reconstruction of the molecular evolution of
families of proteins;• Reconstruction of the evolution of living
species; creation of taxonomic trees;• Reconstruction of the evolution of metabolic
pathways.
Reptiles: a paraphyletic group
12:Proteomics tools
• Tools to identify proteins from the results of 2D-gel and mass-spectrometric (MS) experiments;
• Also allow to further characterized identified proteins by predicting and, in some case proving, the presence of post-translational modifications;
• This subfield of bioinformatics is also known as “proteomatics” (Appel 1998).
Top Related