Bioinformatics
description
Transcript of Bioinformatics
![Page 1: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/1.jpg)
Bioinformatics
Dr. Aladdin Hamwieh Khalid Al-shamaaAbdulqader Jighly
2010-2011
Lecture 2Databases
Aleppo UniversityFaculty of technical engineeringDepartment of Biotechnology
![Page 2: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/2.jpg)
Main Lines• Different database types• Types of data within databases• The FASTA format• The Genbank format of the EMBL• Gene file format• Protein databank (PDB) format• Literature databases• Create a local database
![Page 3: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/3.jpg)
• Primary databasesThey store the raw data that come directly from experiments. E.g. GenBank, the European Molecular Biology Laboratory (EMBL) database and the DNA Data Bank of Japan (DDBJ) and Protein DataBank (PDB).
• Secondary databasesThey contain computationally processed or manually crated information, based on original information from primary databases. E.g. SWISS-Prot and Protein Information Resources (PIR)
• Tertiary (specialized) databasesThese databases provide the most sophisticated, additional information around the raw data. They catered to a particular research interest E.g. flybase, HIV sequence database, and Ribosomal Database Project are databases that specialize in a particular organism or a particular type of data.
Different database types
![Page 4: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/4.jpg)
• For example, if the raw data is a genome sequence, it might not only provide the location of genes and the encoded amino-acid sequences of the corresponding proteins, but also tell the user in which tissue types the genes are expressed. A tertiary database may combine data from several underlying primary or secondary databases.
Different database types
![Page 5: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/5.jpg)
Types of data within databases• DNA sequences• RNA sequences• RNA secondary structures• Genes• Protein structures• Expression array data (i.e. which gene
is expressed & when)• Metabolic pathways (i.e. protein
interaction networks)• Haplotypes• Literatures
![Page 6: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/6.jpg)
Primary DNA Databases • GenBank: National Centre for
Biotechnology Information (NCBI), USAhttp://www.ncbi.nlm.nih.gov/Genbank
• EMBL: European Bioinformatics Institute, UKhttp://www.ebi.ac.uk/embl
• DDBJ DNA DataBase of Japan: National Institute of Genetics, Japan http://www.ddbj.nig.ac.jp
![Page 7: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/7.jpg)
>FOSB_MOUSE Protein fosB. 338 bpMFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL
The FASTA format
![Page 8: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/8.jpg)
ID TRBG361 standard; mRNA; PLN; 1859 BP. XX AC X56734; S46826; XX SV X56734.1 XX DT 12-SEP-1991 (Rel. 29, Created) DT 15-MAR-1999 (Rel. 59, Last updated, Version 9) XX DE Trifolium repens mRNA for non-cyanogenic beta-glucosidase XX KW beta-glucosidase. XX OS Trifolium repens (white clover) OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; OC eurosids I; Fabales; Fabaceae; Papilionoideae; Trifolieae; Trifolium. - XX RN [5] RP 1-1859 RX MEDLINE; 91322517. RX PUBMED; 1907511. RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.; RT "Nucleotide and derived amino acid sequence of the cyanogenic RT beta-glucosidase (linamarase) from white clover (Trifolium repens L.)."; RL Plant Mol. Biol. 17(2):209-219(1991). XX RN [6] RP 1-1859 RA Hughes M.A.; RT ; RL Submitted (19-NOV-1990) to the EMBL/GenBank/DDBJ databases. RL M.A. Hughes, UNIVERSITY OF NEWCASTLE UPON TYNE, MEDICAL SCHOOL, NEW CASTLE RL UPON TYNE, NE2 4HH, UK XX DR GOA; P26204. DR MENDEL; 11000; Trirp;1162;11000. DR SWISS-PROT; P26204; BGLS_TRIRP. XX FH Key Location/Qualifiers
The Genbank format of the EMBL
![Page 9: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/9.jpg)
FT source 1..1859 FT /db_xref="taxon:3899" FT /mol_type="mRNA" FT /organism="Trifolium repens" FT /tissue_type="leaves" FT /clone_lib="lambda gt10" FT /clone="TRE361" FT CDS 14..1495 FT /db_xref="GOA:P26204" FT /db_xref="SWISS-PROT:P26204" FT /note="non-cyanogenic" FT /EC_number="3.2.1.21" FT /product="beta-glucosidase" FT /protein_id="CAA40058.1" FT /translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFI FT FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMK FT DQNMDSYRFSISWPRILPKGKLSGGINHEGIKYYNNLINELLANGIQPFVTLFHWDLPQ FT VLEDEYGGFLNSGVINDFRDYTDLCFKEFGDRVRYWSTLNEPWVFSNSGYALGTNAPGR FT CSASNVAKPGDSGTGPYIVTHNQILAHAEAVHVYKTKYQAYQKGKIGITLVSNWLMPLD FT DNSIPDIKAAERSLDFQFGLFMEQLTTGDYSKSMRRIVKNRLPKFSKFESSLVNGSFDF FT IGINYYSSSYISNAPSHGNAKPSYSTNPMTNISFEKHGIPLGPRAASIWIYVYPYMFIQ FT EDFEIFCYILKINITILQFSITENGMNEFNDATLPVEEALLNTYRIDYYYRHLYYIRSA FT IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD" FT mRNA 1..1859 FT /evidence=EXPERIMENTAL XX SQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other; aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt 60 cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag 120 tcggagcagt tttcctcgtg gcttcatctt tggtgctgga tcttcagcat accaatttga 180 aggtgcagta aacgaaggcg gtagaggacc aagtatttgg gataccttca cccataaata 240 tccagaaaaa ataagggatg gaagcaatgc agacatcacg gttgaccaat atcaccgcta 300 caaggaagat gttgggatta tgaaggatca aaatatggat tcgtatagat tctcaatctc 360 ~ ~ ~ ~ ~ ~ ~ tggattaaaa aggtacccta agctttctgc ccaatggtac aagaactttc tcaaaagaaa 1560 ctagctagta ttattaaaag aactttgtag tagattacag tacatcgttt gaagttgagt 1620 tggtgcacct aattaaataa aagaggttac tcttaacata tttttaggcc attcgttgtg 1680 aagttgttag gctgttattt ctattatact atgttgtagt aataagtgca ttgttgtacc 1740 agaagctatg atcataacta taggttgatc cttcatgtat cagtttgatg ttgagaatac 1800 tttgaattaa aagtcttttt ttattttttt aaaaaaaaaa aaaaaaaaaa aaaaaaaaa 1859
The Genbank format of the EMBL
![Page 10: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/10.jpg)
• The ID line:ID TRBG361 standard; mRNA; PLN; 1859 BP.The ID line is always the first line of each sequence entry, it gives the names of the sequence and also its length in base pairs.
• The XX line:XX indicates an empty line which are inserted for easier readability.
• The AC line:AC X56734; S46826;An AC line contains the ACcession number(s) of the sequence entry.
• The SV line:SV X56734.1An SV line contain information on the Sequence Version of the sequence entry.
The Genbank format of the EMBL
![Page 11: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/11.jpg)
• The DT line:DT 12-SEP-1991 (Rel. 29, Created)A DT line contains the DaTe when the sequence entry was generated or updated.
• The DE line:DE Trifolium repens mRNA for non-cyanogenic beta-glucosidaseEach DE line contains a DEscription of the sequence entry, The DE line is format free.
• The KW line:KW beta-glucosidase.Lines starting with KW contain keywords which are used to generate cross-reference indices of the sequence entries.
• The OS line:OS Trifolium repens (white clover)An OS line specifies the Organism's Species from which the sequence entry was derived.
The Genbank format of the EMBL
![Page 12: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/12.jpg)
• The OC line:OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;An OC (Organism Classification) line contains the taxonomic classification of the organism from which the sequence entry was derived.
• The reference (RN, RP, RX, RA, RT, RL) lines:RN [5]RP 1-1859RX MEDLINE; 91322517.RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.;RT Nucleotide and derived amino acid sequence of the cyanogenicRL Plant Mol. Biol. 17(2):209-219(1991).This block of lines contains one reference to the original literature and always contains the lines in the order RN, RC, RP, RX, RG, RA, RT, RL Within each such reference
• The DR line:DR GOA; P26204.A DR line contains a Database Cross-reference to another database
The Genbank format of the EMBL
![Page 13: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/13.jpg)
• The FH line:FH Key Location/QualifiersThe FH (Feature Header) lines are present only to improve readability of a sequence entry
• The FT lines:FT source 1..1859FT /db_xref="taxon:3899“...FT CDS 14..1495FT /db_xref="GOA:P26204“...FT IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD“FT mRNA 1..1859FT /evidence=EXPERIMENTAL
The set of FT (Feature Table) lines provide different types of annotation for the sequence of a sequence entry.
The Genbank format of the EMBL
![Page 14: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/14.jpg)
• The SQ line:SQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other;
The SQ (SeQuence header) line comes before the lines with the sequence data and summarizes information about the sequence.
• The sequence data line: aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt 60... tttgaattaa aagtcttttt ttattttttt aaaaaaaaaa aaaaaaaaaa aaaaaaaaa 1859
The start identifier of lines containing the actual sequence data are two blank spaces. The sequence which is always given from 5' to 3', is given in chunks of 10 bases with up to 60 bases per line.
The Genbank format of the EMBL
![Page 15: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/15.jpg)
Code Base Description
G GuanineA AdenineT ThymineC Cytosine
R Purine (A or G)Y Pyrimidine (C or T or U)M Amino (A or C)K Ketone (G or T)S Strong interaction (C or G)W Weak interaction (A or T)H Not-G (A or C or T )H follows G in the alphabet
B Not-A (C or G or T )B follows A
V Not-T (Not-U) (A or C or G )V follows U
D Not-C (A or G or T )D follows C
N Any (A or C or G or T)
Nucleotide abbreviations
![Page 16: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/16.jpg)
Gene file format:// . . / / / /http www ensembl org info data ftp
.index html
![Page 17: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/17.jpg)
• Once the sequencing has been completed, the genome sequence is deposited in a database
• One of the first annotation tasks is usually to find all protein-coding genes within a newly sequenced genome.
• Location of RNA encoding genes (opposed to protein coding genes, RNA encoding genes are only transcribed, but not translated)
• As these groups are not necessarily in the same place, they need to exchange their annotation in a common format that is understandable to all of them.
Principles of genome annotation
![Page 18: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/18.jpg)
![Page 19: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/19.jpg)
The gene transfer format (GTF)GTF field Meaning of the field
name the name of the gene, string (e.g. "Haemoglobin")
source the source of the annotation, string (e.g. "EMBL", i.e. the name of the group or institute who supplied the annotation)
feature either "Exon", "CDS", "Start_Codon" or "Stop_Codon"
start start position of the feature, integer number (e.g. 12340, if the feature starts at position 12340)
end end position of the feature, integer number (e.g. 12355, if the feature ends at position 12355) .
score The score field indicates a degree of confidence in the feature's existence and coordinates. rational number (e.g. 6.4231), use "." if no score is used. It is rarely used.
strand either a "+" or a "-", indicating the DNA strand
frame only used for features "CDS", "Start_Codon" and "Stop_Codon", either "0", "1" or "2 ."
gene_nr number of that gene, integer (e.g. "42")
transcript_nr number of the transcript of that gene, integer (e.g. "421")
exon_nr number of the exon on which this feature falls (the first "Exon" gets number "1") .
![Page 20: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/20.jpg)
The next figure symbolically shows a protein-coding gene consisting of three exons which falls on the reverse strand:
Chr22 src Exon 649 700 . - . gene_id 1; transcript_id 1; exon_number 1Chr22 src CDS 649 700 . - ? gene_id 1; transcript_id 1; exon_number 1Chr22 src Exon 351 500 . - . gene_id 1; transcript_id 1; exon_number 2Chr22 src CDS 351 500 . - ? gene_id 1; transcript_id 1; exon_number 2Chr22 src Exon 150 250 . - . gene_id 1; transcript_id 1; exon_number 3Chr22 src CDS 153 250 . - ? gene_id 1; transcript_id 1; exon_number 3Chr22 src Start_Codon 698 700 . - 0 gene_id 1; transcript_id 1; exon_number 1Chr22 src Stop_Codon 150 152 . - 0 gene_id 1; transcript_id 1; exon_number 3
Examples of the GTF format
![Page 21: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/21.jpg)
The following figure symbolically shows a protein-coding gene consisting of five exonsA valid description of this gene in GTF format would be:
Chr1 src Exon 150 200 . + . gene_id 1; transcript_id 1; exon_number 1Chr1 src Exon 300 401 . + . gene_id 1; transcript_id 1; exon_number 2Chr1 src CDS 380 401 . + 0 gene_id 1; transcript_id 1; exon_number 2Chr1 src Exon 501 650 . + . gene_id 1; transcript_id 1; exon_number 3Chr1 src CDS 501 650 . + 2 gene_id 1; transcript_id 1; exon_number 3Chr1 src Exon 700 800 . + . gene_id 1; transcript_id 1; exon_number 4Chr1 src CDS 700 707 . + 2 gene_id 1; transcript_id 1; exon_number 4Chr1 src Exon 900 1000 . + . gene_id 1; transcript_id 1; exon_number 5Chr1 src Start_Codon 380 382 . + 0 gene_id 1; transcript_id 1; exon_number 2Chr1 src Stop_Codon 708 709 . + 0 gene_id 1; transcript_id 1; exon_number 4
Examples of the GTF format
![Page 22: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/22.jpg)
Protein databank (PDB) format
![Page 23: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/23.jpg)
GrainGenes• A Genomic Database for Triticeae
and Avena– Genetic maps– Genes– Alleles– Genetic markers– Phenotypic data– Quantitative trait loci studies– Experimental protocols – Publications
![Page 24: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/24.jpg)
Literature databases1. PubMed:
http://www.ncbi.nlm.nih.gov/pubmed/2. PubMed Central:
http://www.pubmedcentral.gov/3. HighWire Press: http://highwire.stanford.edu/4. DRIVER Project:
http://www.driver-community.eu/5. Web of Science®: http://isiknowledge.com6. arXiv: http://www.arxiv.org7. CiteSeer: http://citeseer.ist.psu.edu
![Page 25: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/25.jpg)
Create a local database
![Page 26: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/26.jpg)
Create a local database
![Page 27: Bioinformatics](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168f0550346895ddff53d/html5/thumbnails/27.jpg)
Thank You