Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss...

Post on 21-May-2020

15 views 1 download

Transcript of Introduction to Bioinformaticslsir · Introduction to Bioinformatics Patricia M. Palagi Swiss...

Introduction to Bioinformatics

Patricia M. PalagiSwiss Institute of Bioinformatics (SIB)

PI Group (PIG)

Bioinformatics: definition

The applications of computer sciences to molecular biology

In particular to the study of macromolecules such as proteins, nucleic acids and oligosaccharides

(sugar)

Some synonyms for molecular bioinformatics

• Computational biology• Biocomputing• Genome computing• Sequence analysis (restrictive)

Molecular bioinformatics is sometimes confused with...

• «Bio-inspired» computer sciences (artificial life, neural networks, genetic algorithms);

• Biomathematics or biostatistics;• Modelization of biological systems.

• Databases– Nucleic acid sequence databases (EMBL /

GenBank / DDBJ) and protein sequence databases (SWISS-PROT / TrEMBL);

– Databases specialized for genomics (FlyBase, OMIM), mutations, 3D structures (PDB), 2D gels (SWISS-2DPAGE), references (Medline); etc.;

– More than 1’000 are currently available;– They can generally be accessed from the Web;– Size range from <10 Kb to >10 Gb;– Frequency of update: from daily (EMBL) to

annually.

2 components of bioinformatics

• Tools– Programs to analyze raw experimental results

(from sequencing machines, mass spectrometers, etc.);

– Programs to analyze the intrinsic properties of DNA or protein sequences;

– Sequence comparison and similarity search tools;– Micro-array analysis software;– Tridimensional structures visualization and

modelization tools;– These software tools are either part of commercial

packages or are available to all on the WWW.

2 components of bioinformatics

Some important facts on bioinformatics

• It is a discipline that complements but does not supplement experimental research;

• It can help plan experiments, not replace the experiments;

• It is not cheap;• Good bioinformatic studies take significant

amounts of time;• Like anywhere else: some garbage in, lots

of garbage out!

Bioinformatics and the discovery process in biology

• Discoveries are made through studies of anomalies;

• Computer analysis tends to smooth out the ‘spikes’ of anomalies;

• We need to make sure that we do not throw away the baby with the water.

A common fallacy• Genome projects are providing massive

amounts of data.• Yes, they are providing lots of sequence

data, but a lack of information on "proteins" and no characterization data;

• The amount of data is relatively small in absolute term. Compared to images, sequence data does not cause real problems in terms of storage or processing.

Viroide 300Small phage (virus infecting a bacteria) 2,000HIV virus 10,000Herpes virus 150,000Mycoplasma genitalium (parasite bacteria) 600,000Bacteria 1 à 13 millionsBaker’s yeast 13 millionsDrosophila (fruit fly) 180 millionsFugu (fish) 360 millionsHuman 3.2 billionsPine 68 billionsSalamander 81 billionsAmoeba 670 billions

Genome sizes (in base pairs)

CCCCTGACGACCGATTCAAAAACCACTTTCCTCTTTTACGGCGCCCTAGCGCTATGGCGGTGAAGACTGCTTGACATTAACATGCCTGTTGAGGCTAGAGAATCCATGCGAAGGCGGTTCGGAAACTGCTTCGAAGGCGTGGGGTGGTGCGGGTGGGATTTGAACCCACGCAGGCCTACGCCATCGGGTCCTAAGCCCGACCCCTTTGGCCAGGCTCGGGCACCCCCGCACCGTGTAGTCTTTAGGTTTAGCTTTCAGGGTTAAAACGGTTTAACACTCATGAGTATCACTGGGCTGGCTACTGGGCTCTGCATTCCCGAGGCCATGCTGCCCGTGAGGAATAACGGGTCTGAGGAGCCGTTGACAGGTTGCCATTTGGCCTTGCCCCCAAAAGTGATGCTGTGGATCACGACCTCCTCGGAGGAGGGGAGCCTCAGCATACACTTTATAAAAGGCTTTAAGGGTTTAGCCGGATAATGTTGTTGGGGCGTGCAGCGGCAAGTGCTGCAGCTCATGGGTATGGTATGCGGCTTTGCCTGGTGATGCGGTTTGGCCCCCGTTGTCTGCGACGTCTGCGGTGTTAGGAGGGCTGTGGTGCTGCAGCCACACGGGAAGGCGGCTCTGCAGGGAGTGCTTTAGGGAGGATATAGTGGGGAGGGTCAGGAGGGAGGTTGAGAGGTGGGGGATGATAGGCCCTGGGGAGACGGTCCTCCTAGGCCTGAGCGGCGGTAAGGACAGCTATGTCCTGCTGGACCCTCTCCGAGATAGTCGGGCCCTCGAGGCTGGTGGCGGTGTCTATAGTGGAGGGCATACCGGGGTACAACAGGGAGGGAGATATCGAGAAGATCAGGAGGGTGGCCGCGGCTAGGGGCGTCGACGTGATAGTGACGAGCATAAGGGAGTATGGGGGCCAGCCTCTATGAGATATACTCCAGGGCCCGAGGGAGGGGGGCGGGCCACGCCGCCTGCACCTACTGCGGCATAAGCAGGAGGAGGATACTTGCCCTCTACGCCCGCCTCTACGGCGCCCACAAGGTCGCTACGGCCCACAACCACGACGAGGCGCAGACAGCTATAGTGAACTTCCTCAGGGGGGACTGGGTTGGCATGCTGAAAACACACCCCCTCTACAGGAGCGGGGGCGAGGACCTGGTTCCAAGGATAAAGCCTCTTAGGAAAGTCTACGAGTGGGAGACGGCCAGCTATGGTACTCCACCGCTACCCCATCCAGGAGGCTGAATGCCCCTTCATAAACATGAACCCAACCCTCAGGGCGAGGGTGAGGACGGCCCTGAGGGTGCTAGAGGAGAGGAGCCCGGGCACCCTGCTCAGGATGATGGAGAGGCTCGACGAGGATGAGGCCGCTGGCCCAGGCCATGAAGCCCTCCTCCCTAGGCAGGTGCGAGAGATGCGGGGAGCCGACCAGCCCGAAGAGGAGGCTCTGCAAGCTCTGCGAGCTCCTGGAGGAGGCCGGGTTCCAGGAGCCCATCTACGCGATCGCAGGGAGCAAGAGATTAAGGCTTCAGAGCCCCACCGCTAGCCCTGGGTGAACGCGCTATGGCAAAGCCAAAGGTTAGCCTGCCGGAGGATGTGGAGCCCCCCAAGGCTATAGTCAAGAAGCCTAGGCTAGTGAAGCTAGGCCCCGTAGACCCGGGGAGGAGGGGAAGGGGGTTCAGCCTAGGCGAGCTCGCGGAGGCTGGGCTAGACGCTAAAAAGGCGAGGAAGCTTGGCCTGCACGTGGACACGAGGAGGAGGACGGTCCACCCGTGGAACGTGGAGGCCCTCAAGAAGTATATAGAGAGGCTTAGAGGCGGGCGTAGAGGTCTAGACCCCGGGGCTATATACTACCACTTCGCCCTCCCCATTATACTATCCACATCCACCCTGGCCCTCCCCACCTCCAGGACCTCAATATCCCCCTCAGCCCTGGTGTACACGCTCAAAGACGGCTCCCTGTAAGGCCCTGGTCACCACCCCCACGTGAATCACCCCTCCCGCGTGTACGGCGGCTATAAGCCCCCTCTCCCAGCCCTCCCGGAGGACGCGGAGCCCGGAGCCTACTCCGACCCTACCGCCCCTCCTCGCCACAACCACTATGTCCCCGTCAATCTCACCATAGAGGGCGGCTGGGTGTAGGGCCTTGAGGGCCTCGTGGGCCAGAGGCTCCCCCCGGAATATCGGCGCGCCAACTATCTCGGCCTCGCCGGGCCTGACCCTCCTCTCCCTCCCTCCCGAGGTCCTAAGGGCTATCAGCCTCTTATGAAGAGCCCTCTCCCCCCGGCTCTTGCCCGCCTCTCCAGCCAGCCTCTCCACAGACAGAGTGTCAAGCCCCCACACCCTCTCGAGCAGCCTGGCCCGTCGGCTGGCTATGCCCACCGCGACTACAAGCCTTGCTCTAGAGGCTATGGGGGCTGCCTTAGACTCGAGCCCCTCCCACAGTGATATCCAGCCATCTGTATCCACTACCACCTGGCTGGCCAGTGAGGCCAATCTAGATGCGCAGGCGAGGTAGCGGGACTCCGACCCCCGGGGGGTGAAGCCGCCGACGAAACACGGCTCACTCGAGAACGAGTCGTCTAGGCCCGGGACGGCCACGCCCTGTGGAGACGCCAGCGCCATAAACCCCGGGGCGAAGACCTCGTTCTGGCCTATATCCGCCGACAGCAGTCTATACCCACCACCGCCCCTGTTAACTATCCAAGCCGCTATGCTCTTACCGGAGTCGCTCGGCCCCACAATAGCCACCCTGCCCCGCTGAGAGGCCTCCCTGGCTATGGAGTCGAACCTGTTGTAAGCCTCCTCCACGCCCCCTGTGGAGACTACACCGGACACAATAGCCCTCCCCTCAACCCTGGCGACCGACCTGCCTGCAGGGACCACTAGAGTAGAGCCCTCCCCCAGCCTTCCACCCAAAACCTCTGCAGCACCCTCTACAACCTCTATCCTCCCCGGGCCGCGGACTAGCGCCGAGCCCCATGCAATCTCCACAGGCAAAGCTTTAAACCCCCAGGTAAGATATGTGAACCGGGCCGCGGTAGTATAGCCTGGACTAGTATGCGGGCCTGTCAAGGGCCCCGCCTCCGCCCCACCCTCATTCTACTACACGCTTATCAGGATAAACAGCCGGGCAAACGTTTTTAACCCCGCCGAAATTCATACTCCCGGGGCGGAGGCGGGCCTGCGGAGAGCCCGTGACCCGGGTTCAAATCCCGGCCGCGGCGCCAATAATCCTCGCGGCCCGCCTTCAAGACTCACTAAACCCCGGTTGAGCACCCGCAGCATCGATGCTAAGGCTCGAGCCATGCATAGCCGCGGGGGGTGGGGGGATTTGGCGAGGCCTGTTGAGGCGGTAAAGAGGCTGCTGGAGAGGTGGCTGGAGGGTAGGAGGAGGGGTTATGTCCTTACGCTTGTAGCTCTTAGAAGGCTTGAGGAGAGGGGGGAGGAGGCTACTGTAGAGAGTAGGGAGGAGGGCCTGAGGATTCTGGAGAGGACGGAGGGGAGGATAGACTGGGGTGTTACTAGGGATGAGTACACTGTCAACATGGTCTCCAGCGTTCTTCGCGAGCTGGCCGAGAGCGGCCTTGTCGAGATGGTGGACGGCGGGAGGAGCGTCAGGTACAGGATAGCGAGGGATGCTGAGGAGGAGTTCCTCTCCAGCTTCGGCCACCTCCTGCAGCTTGTGAGGATGCCGAAGTAGCGTTAAAGCCCTAGGTGCCAGAGGCCGCCGGAGGCTAAGAGGCCGATGAAGGCCTTGAGAGGTGCCGCCAAGCTATCCCTATCCCTGCTGCTCTTTTGGGCTAGCTACTCGATCTACTACACTATAACGAGGCGTGCTGTAGAGGAGGGCCTAGGAGAGGGATCCTACCTCCTGGGCGTCTTGATGTCGGGGGCTGAGGAGGCGCCGCTCGCGTCAATAGTCCTTGGCTACCTGGCGGACAGGCTAGGCTACCGCTTACCCCTGGCCCTGGGCCTGTTTGAGGCTGGGCTGGTCGCTGCAATGGCCTTCACCCCCCTAGAGACCTACCCCATACTGGCTGGGGCTGCGTCGCTAGTCTACGCCTCATACTCCGCCCTAATGGGCCTCGTCCTGGGTGAGAGCGGGGGGAGCGGCTTCAGGTACAGTGTTATAGCAGCCTTCGGCAGCCTTGGCTGGGCTCTCGGCGGGTTGGCGGGGGGAGCGGCTTACTCCCGCCTGGGGTCACTGGGGCTAGTGGCCGCAGCCCTCATGGCCGCCTCATACCTAGTCGCCCTCTCAGCCTCGCCCCCCCGCGGCGGCGCGGCGCCCAGTGTGGGGGAGACGATAACCGCTCTGAAGGGGGTTCTGCCCCTATTTGCAAGCCTCTCAACCAGCTGGGCGGCGGGCTTCTTCTTCGGGGCTGCCAGCATAAGGCTTAGCGAGGCGCTCGAGAGCCCTATCGCCTACGGGCTAGTGCTGACCACCGTCCCCGCACTCCTAGGCTTCCTGGCGAGGCCTGCGGCGGGCAGGCTGGTCGACAAGGCCGGGGCTGTAGTGCTTGCGTTGTCCAACGCGGCATACTCCCTTCTCGCCCTAGTTTTCGGCCTGCCCACCAGTCCGGCCCTGCTGGCCCTTGCATGGAGCCTGCCCCTATACCCCTTTAGGGATGCCGCCGCGGCCATCGCAGTTAGCAGCAGGCTTGAGAGGCTGCAGGCGACGGCCGCGGGGCTGCTCTCAGCGAGCGAGAGCGTCGGCGGCGCTGCAACCCTTGCCCTGGCACTGCTCCTGGATGGGGGGTTTAGGGAGATGATGACGGCTTCAATAGCCCTTATGCTCCTCTCCACCCTACTCCTCGCAGACCACTCTACGGCTCCACGCCGAGAGCCCTGTCCCCGGCGTCGCCAAGGCCCGGCACTATGAAGTAGTTCTCGTCCAGCTCGGGGTCTAGGGCTAGCGTGTATATGGGGGTGTCGCCGTAGAGGGATGATATGTACTCGACGCCCCTGGACGCTATTATAGAGCCTATAACGACCTTGCTGGCCCCCCTGTCTCTGGCCAGCCTCACGGCCTCCGCCACAGTCTTGCCCGTGGCCAGCATCGGGTCTAGAACGACGGCGGGGCCGTCGAACATGCGGGGTAGCCTGGAGTAGTAGATCTATCTTGAGCCTGCCCGGCTCCTCGACCCTCCTGGCTGCTACGAGGGCTATCCTCGCCTCCGGCATCATCGAGGCGAAACCCTCTACCATGGGGAGGCTAGCCCCGAGTATCCCTACGAGGTAGACGGGCCCCGCTGGCGCCAGCTCCGCCTTAGCCCCCAGGGGGGTCTCCACCTCCTCCTCCACCCACCCGAGCTCGCCCGCAATGTACACCGCCAGTATGGAGCCCGCTATCCTGACGTACCTCCTAAACTCCGGGAACCCGGTTGTCCGGTCCCTGAGAACCTTGAGGACGTAGGCTAGGGGTGTTTCGCCCCCAATAACCCTAACTGCCGCCACCATGGGAACCTCTAGGTAGTGGTTGAGGCTCCGGAGCTTAAGAGGGTTAAACTCCAGGATGGCCACCTGGGTGCCGCCGGGGATTGGACAGTAGGGTTCTAGAGTCCGCGAGAGCCCTATCCCGCTACCCCCTCTGCGACCGCTGCCTCGGCAGGCTCTTCGCTAGGCTTGGGAGAGGCTGGAGCAATAGGGAGCGGGGAGAGGCTGTCAAGAGGGTTCTGGTGATGGAGCTTCACAGGAGGGTCCTCGAGGGGGATGAGGCGTTGAAAACCCTGGTCTCTGCAGCTCCGAACATAGGGGAGGTGGCAAGGGATGTCGTGGAGCACCTCTCCCCAGGTTCCTACAGGGAGGGCGGCCCATGCGCTGTCTGCGGCGGGCGGCTGGAGAGTGTTATAGCCTCAGCGGTGGAGGGGTACAGGCTGCTAAGGGCTTACGATATCGAGAGGTTCGTAGTCGGGGTCCGGCTAGAGAGAGGTGTTGCCATGGCTGAGGAGGAGGTAAAGCTGGCCGCCGGCGCCGGGTACGGCGAGTCCATTAAGGCTGAGATCAGGAGGGAGGTCAAGCTCCTGGTGAGCCGGGGTGGAGTGACCGTGGACTTCGACAGCCCTGAAGCGACCCTAATGGTGGAGTTCCCCGGGGGCGGGGTTGACATACAGGTCAACAGCCTGCTCTACAAGGCTAGGTACTGGAAGCTTGCCAGGAACATAAGGGCATACTGGCCCACGCCAGAGGGGCCGAGGTACTTCAGCGTGGAGCAGGCTCTATGGCCGGTTCTAAAGCTCACTGGGGGGGAGAGGCTGGTTGTACACGCTGCTGGCAGGGAGGATGTAGACGCCAGGATGCTGGGCAGCGGGAGGCCGATAGTCGAGGTCAAGTCGCCTAGGCGCAGGAGGATCCCGCTTGAGGAGCTGGAGGCGGCCGCCAACGCCGGCGGGAAGGGGCTGGTTAGGTTCAGGTTCGAGACGGCTGCCAAGCGTGCCGAGGTCGCGCTTTACAAGGAGGAGACTGCGGTTAGGAAGGTGTACCGCGCCCTGGTAGCGGTGGAGGGTGGTGTTAGTGAGGTGGATGTTGAAGGGTTGAGGAGGGCTCTCGAGGGCGCGGTTATAATGCAGAGGACGCCCTCCAGGGTCCTCCATAGGAGGCCGGATATACTGAGGAGGAGGCTCTACAGCCTAGACTGCAGCCCCCTGGAGGGGGCGCCTCTGATGGAGTGCATATTGGAGGCGGAAGGGGGTCTCTACATCAAGGAGCTGGTCAGCGGTGATGGCGGGAGAACCAGGCCAAGCTTCGCTGAGGTCCTCGGCAGGGATGTGTGTATAGAGCTCGACGTGGTGTGGGTGGAGCATGAAGCTCCAGCCGCACCCGGCTAAAGCTAAATTAAGCTGGGCTGAGCAAAATACCGGGGGGAGCGTAGGTTGGTCAAGGCACCTAGAGGCTATAGGAACAGGACTAGGAGGCTGAGGAAGCCTGTGAGGGAGAAGGGCAGCATACCCAGGCTCAGCACCTACCTTAGGGAGTACAGGGTGGGCGATAAGGTGGCTATAATCATAAACCCCTCCTTCCCAGACTGGGGCATGCCCCACAGGAGGTTCCACGGGCTGACGGGAACGGTGGGGAAGAGGGGCGAGGCCTACGAGGTAGAGGTCTATCTGGGTAGGAAGAGGAAGACCCTCTTCGTCCCCCCCGTGCACCTCAAACCCCTCAGCACAGCCGCCGAGAGGCGGGGCAGCTAGAGCTGTCCCCACGGTTCCACGCTGGAGGGGGTGCTAGTGTTGGAGAGGAGGATCCTAGAGTATAAGGCGGTGCCCTACCAGGTAGCCAAGAAGTATATGTACGAGAGGGTTAGGGAGGGCGACATAATATCGATACAGGAGTCGACTTGGGAGTACTTCAGGAAGGTAGTGTTCTGCGACCCGGAGGCTGCCTCCGAGCTTGTTGAGGAGATTGTGAAGGAGGGTGTCAGCCGTGAGGCGCGGCGAACATCGCGAGCATATGCCCCAAGACCGAGGGCGAGCTCAGGAGCATTCTCGAGATGGACAGGAGCATAACCTCCGTACACATGGCTAGCAAACTGTACCCCATAGTTTCCAAATACTGCAAGGACTAGACCCCGCCCCCCTTCAGCCCGGGGATTAACAGTTTAATCTCCGCGTCCCAACCATATTTATGTTGATAGCGGCTGTACGGAGAGTGTTGAGAAGTGTCTAGACCCCGCCCCCGCGACAGGAAGCCCCCCCACCAGGGGAGGCCGCAGCCCCACATCGCCGCCCTTGAGGTGGAGGCTATAGTTCTGGACTACATACCCGAGGGCTACCCGAGAGACCCCCACAGGGAGCACCGCAGTAAGCCCGTCGTTCAGCTCGGGGTTAGGAGGCTGCACCTAGTCGACGGTGTCCCCCTCCATGAGGTCGATATACTGGAGCGGGTCACCCTGGCTAGGGAGGTTGTGTATAGCGTCCCCATAGTGGCCCGGCTCCCCGGGGGGGTCGAGAGGAGGGTGAAAAGTGTTAGTCGCGGTAACATGCCTCCCCGGCCAGGCGCGGGAGGGCGGGGTCAGGGAGATATACTGCTACCCCCTCTCCTACGCCGACCAGGCGACCCTGGAGGCGCTGCAGCAGCTCCTGGGTGAGGGGGACGAGAGGCACAGGTATATACTTGTGTCCCCCGACAAGCTCTCCGAGGTGGCCAGAGGTCACGGCCTCTCGGGGAAGATAGTGAGCACGCCCAGAGACCCTATATCCTACCAGGACCTCACCGACGTCGCCAGGGCTACGCTGCCGGACGCTGTGAGGAAGCTGGTCAGGGAGAGGGACTTCTTCGTGGAGTTCTTCAACGTGGCCGAGCCGATAAACATAAGGATACACGCGCTGGAGGCCCTAAAGGGTGTGGGTAAGAAGATGGCTAGGCACCTCCTCCTCGAGAGGGAGAGGCGTAGGTTCACGAGTTTCGAGGAGGTGAAGA

The human genome is 380,000 longer than the sequence shown

here

Cost: $ 2.7 billions

The computational challenges in bioinformatics

• Currently they lie in two different aspects of the bioinformatics «galaxy»:– High throughput raw data acquisition, tracking and

preliminary analysis. Current genomics (DNA sequencing), transcriptomics (micro-array) and proteomics (MS) projects require high quantity of storage space (Terabytes), lots of computing power and specialized software systems such as LIMS (Laboratory Information Management Systems).

– Modelisation. Whether it is in the field of 3D structure (homology modeling, docking, etc.) or in attempts to model life processes (pathways, cellular development, etc).

Some bioinformatics research and service centers

• National Center for Biotechnology Information (NCBI) in the USA;

• European Bioinformatics Institute (EBI) in the UK;• Swiss Institute of Bioinformatics (SIB);• Australian National Genome Information Service (ANGIS);• Canadian Bioinformatics Resource (CBR);• Peking Center of Bioinformatics (CBI);• Singapore BioInformatics Centre (BIC);• South-African National Bioinformatics Institute (SANBI).

www.ncbi.nlm.nih.gov

www.ebi.ac.uk

www.isb-sib.ch

ExPASy

A page to navigate through multiple sites in bioinformatics: www.expasy.org/alinks.html

The contents of the SWISS-PROT protein knowledgebase

• Sequences!• ANNOTATIONS• References• Taxonomic data• Keywords• Cross-references• Documentation

•Function(s); role(s)•Post-translational modifications•Domains•Subcellular location•Protein/protein interactions•Similarities•Diseases, mutagenesis•Conflicts and variants

Domains, functional sites, protein familiesPROSITEInterProPfamPRINTSProDomSMART

Nucleotide sequence dbEMBL, GenBank, DDBJ

3D/Structural dbsHSSPPDB

Organism-spec. dbsDictyDbEcoGeneFlyBaseHIVLepromaMaizeDBMendelMGDMypuListSGDStyGeneSubtiListTIGRTubercuListWormPepYEPDZfin

Protein-specific dbsGCRDbMEROPSREBASETRANSFAC

SWISS-PROT2D-gel protein dbsSWISS-2DPAGEANU-2DPAGECOMPLUYEAST-2DPAGEECO2DBASEHSC-2DPAGEAarhus and GhentMAIZE-2DPAGEPHCI-2DPAGEPMMA-2DPAGESiena-2DPAGE

Human diseasesMIM

PTMCarbBankGlycoSuiteDB

Some fields of application of bioinformatics

1: Data acquisition• Examples: DNA sequencing, mass

spectrometry (MS), 2D gels image acquisition;• Software programs tightly linked to the

instrumentation hardware;• The main issues are in the field of signal

detection and image analysis;• There is no biological context at this level.

How to detect spots in 2D gels

How many spots ?

Separation of these 2 spots

• Examples: “base calling” in DNA sequencing, interpretation of mass spectra, detection of spots on 2D-gels;

• There is a need for sophisticated algorithms that can extract a maximum of information (optimization of the “signal/noise ratio”);

• These algorithms require some knowledge of the biological context.

2.Preliminary data analysis

DNA sequencing

Programme to analyse data from DNA sequencing machine

Example: pregap4 from Rodger Stadenhttps://sourceforge.net/projects/staden.

• Sequence assembly: the reconstruction of a complete DNA sequence from fragments of 100 to 300 base pairs. These fragments are supposed to overlap;

• The assembled sequence is called a «contig»;• This step is required for the «shotgun» sequencing

method where all or part of a genome is broken down in small pieces;

• It is not a trivial task because of: (a) sequencing errors; (b) sequence repeats.

3: Assembly of DNA sequences

4: Coding sequence detection

• How to find genes in genomic sequences;• A problem whose complexity is directly

correlated with the complexity of a genome. It is easy to find genes in bacteria; very difficult in «superior» eukaryotes (human, Drosophila, etc);

• Various computer methods are used to tackle this problem. Use of intrinsic (transcription signals detection, statistical analysis) and extrinsic (similarity with known genes) approaches.

4: Coding sequence detection

HMMgene

Netgene2

Genebuilder

Summary of results

3 ’5 ’

108310031305

14061452

16611914

2000

1084 (1.00)

1304 (0.77)

1407 (0.89)

1451 (0.90)

1662 (1.00)

1913 (1.00)

HMMgene Genebuilder Netgene2

Not easy sometimes…

Ex: Chromosome 21

• Analyze the restriction sites (enzymes)• Detection of regions of low complexity;• Translating DNA in protein• Detection of sequence repeats such as microsatellites,

minisatellites, Alu repeats, Line-1 elements and many others;

• Detection of important non-coding DNA elements such as transcription signals (promoter elements), origins of replication, etc.;

• Detection of tRNA sequences and of other types of RNA (examples: rRNA, uRNA, tmRNA).

5: DNA sequence analysis

Restriction enzyme (Webcut)

1

432

5 enzymes cut 3 times because 4 CDS

6: Similarity searches• The essential tool in molecular bioinformatics: the

comparison of a DNA or protein sequence (“query”) with all or part of the known sequences (“database”);

• No theoretical challenge; but two issues:– Optimization of computing speed either using

algorithmic shortcuts or specialized hardware;– Optimization of the use of biological

information (how to make these programs “smarter”).

Alignment of 2 sequences. An example

MY-TAIL--ORIS-RICH-#x #### x#x# ####MONTAILLEURESTRICHE

Identities (#), mismatches (x), insertions (-)

BLAST

Statistical measure

BLASTN

BLASTN (nt sequence against ESTs)

Introns

BLASTP

ribosomal protein L24 [Homo sapiens] ribosomal protein L24 [Mus musculus]

7: Protein primary sequence analysis• Physico-chemical characterization• Detection of topogenic regions (i.e. signal sequences,

transit peptides) -> sub-cellular localization• Detection of transmembrane regions;• Prediction of functional regions (conserved regions);• Prediction of post-translational modification (PTM)

sites.• Prediction of antigenicity;• Search of compositionally-biased sequences (i.e. low

complexity sequences, PEST regions, etc.);• Detection of sequence repeats;

Tools to calculate pI/Mw

Resolving physico-chemical characteristics

Mw, pI and composition

Mw, pI and composition

Sub-cellular localisation PSORT II

Signal sequenceSignalP V1.1

Signal peptide cleavage sites in amino acid sequences

Hydrophobic regions

• Ala: 1.8 Leu: 3.8• Arg: -4.5 Lys: -3.9• Asn: -3.5 Met: 1.9• Asp: -3.5 Phe: 2.8• Cys: 2.5 Pro: -1.6• Gln: -3.5 Ser: -0.8• Glu: -3.5 Thr: -0.7• Gly: -0.4 Trp: -0.9• His: -3.2 Tyr: -1.3• Ile: 4.5 Val: 4.2

Kite&Dolittle

•Methods based on different scales: numerical values assigned to each of the 20 amino acid types.

ProtScale• Tool to plot various protein physicochemical

parameters along the sequence;• More than 50 amino-acid scales are available:

hydrophobicity/hydrophilicity, secondary structure propensity (alpha helix, beta sheet, turn, etc.); amino-acid composition; number of codons; bulkiness; flexibility; etc.;

• WWW site: www.expasy.org/tools/protparam.html

ProtScale (Kite&Dolittle)

ProtScale (Chou&Fasman)

Transmembrane regions (TM)

• 13% to 35% of the proteins of genomes are predicted to have one or more TM regions;

• Eukaryotic genomes are richer than microbial genomes in TM-containing proteins;

• All kinds of TM proteins: from 1 to 14 alpha-helical TM regions, different topologies, different target membranes, etc.

ProfileScanLooking for functional regions

Looking for functional regions

ATPase family

LOGO

ATPase signature

Prediction of post-translational modifications (PTM)

• For the prediction of cleavage sites of signal sequences and transit peptides, see the section on the prediction of topogenic regions;

• To predict some PTM’s a pattern (consensus sequence) can be used. These are found in the PROSITE database;

• Example: potential N-glycosylation sites: N-{P}-[ST]-{P};• NetOGlyc; Neural network for the prediction of mucin-

type O-glycosylation sites: www.cbs.dtu.dk/services/NetOGlyc/

• DGPI; prediction of GPI-anchor sites: www.bigfoot.com/~dgpi

Sequence 484 ISPTTINTC 0.065 . Sequence 487 TTINTCGAI 0.029 . Sequence 499 CFDKTGTLT 0.077 . Sequence 501 DKTGTLTED 0.845 *T* Sequence 503 TGTLTEDGL 0.533 *T*

http://www.cbs.dtu.dk/services/NetPhos/

Phosphorylation site prediction

Sulfinator

Sulfation

Glycosylation

8: Multiple sequence alignments• Alignment of two DNA or protein sequences

(binary alignment);• Alignment of multiple sequences.

CLUSTAL dendogram

Multiple alignment: a dendogram

9: RNA folding

• Predicts an optimal secondary structure for a RNA; • Generally applied to tRNAs, rRNAs but also to parts

of mRNAs;• Makes use of information on base pairing; local

energy minimization and structural constraints.

10: Protein secondary and tertiary structure analysis

• Prediction of secondary structure by statistical methods or by neural networks;

• Prediction of the 3D structure directly from the sequence (“ab-initio”). This is still a major challenge!;

• Modeling by homology: prediction of the structure of a new protein similar to one whose sequence is already known;

• Simulation of the “docking” of two proteins or between a protein and a small molecule.

Secondary structure prediction

GOR IV

3D structure modelling

Protein sequence

Protein structure

?

11: Phylogenetic analysis• Reconstruction of the molecular evolution of

families of proteins;• Reconstruction of the evolution of living

species; creation of taxonomic trees;• Reconstruction of the evolution of metabolic

pathways.

Reptiles: a paraphyletic group

12:Proteomics tools

• Tools to identify proteins from the results of 2D-gel and mass-spectrometric (MS) experiments;

• Also allow to further characterized identified proteins by predicting and, in some case proving, the presence of post-translational modifications;

• This subfield of bioinformatics is also known as “proteomatics” (Appel 1998).