Bioinformatics

Bioinformatics

Dr. Aladdin Hamwieh Khalid Al-shamaaAbdulqader Jighly

2010-2011

Lecture 2Databases

Aleppo UniversityFaculty of technical engineeringDepartment of Biotechnology

Main Lines• Different database types• Types of data within databases• The FASTA format• The Genbank format of the EMBL• Gene file format• Protein databank (PDB) format• Literature databases• Create a local database

• Primary databasesThey store the raw data that come directly from experiments. E.g. GenBank, the European Molecular Biology Laboratory (EMBL) database and the DNA Data Bank of Japan (DDBJ) and Protein DataBank (PDB).

• Secondary databasesThey contain computationally processed or manually crated information, based on original information from primary databases. E.g. SWISS-Prot and Protein Information Resources (PIR)

• Tertiary (specialized) databasesThese databases provide the most sophisticated, additional information around the raw data. They catered to a particular research interest E.g. flybase, HIV sequence database, and Ribosomal Database Project are databases that specialize in a particular organism or a particular type of data.

Different database types

• For example, if the raw data is a genome sequence, it might not only provide the location of genes and the encoded amino-acid sequences of the corresponding proteins, but also tell the user in which tissue types the genes are expressed. A tertiary database may combine data from several underlying primary or secondary databases.

Different database types

Types of data within databases• DNA sequences• RNA sequences• RNA secondary structures• Genes• Protein structures• Expression array data (i.e. which gene

is expressed & when)• Metabolic pathways (i.e. protein

interaction networks)• Haplotypes• Literatures

Primary DNA Databases • GenBank: National Centre for

Biotechnology Information (NCBI), USAhttp://www.ncbi.nlm.nih.gov/Genbank

• EMBL: European Bioinformatics Institute, UKhttp://www.ebi.ac.uk/embl

• DDBJ DNA DataBase of Japan: National Institute of Genetics, Japan http://www.ddbj.nig.ac.jp

http://www.ncbi.nlm.nih.gov/Genbank

http://www.ebi.ac.uk/embl

http://www.ddbj.nig.ac.jp/

>FOSB_MOUSE Protein fosB. 338 bpMFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL

The FASTA format

ID TRBG361 standard; mRNA; PLN; 1859 BP. XX AC X56734; S46826; XX SV X56734.1 XX DT 12-SEP-1991 (Rel. 29, Created) DT 15-MAR-1999 (Rel. 59, Last updated, Version 9) XX DE Trifolium repens mRNA for non-cyanogenic beta-glucosidase XX KW beta-glucosidase. XX OS Trifolium repens (white clover) OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; OC eurosids I; Fabales; Fabaceae; Papilionoideae; Trifolieae; Trifolium. - XX RN [5] RP 1-1859 RX MEDLINE; 91322517. RX PUBMED; 1907511. RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.; RT "Nucleotide and derived amino acid sequence of the cyanogenic RT beta-glucosidase (linamarase) from white clover (Trifolium repens L.)."; RL Plant Mol. Biol. 17(2):209-219(1991). XX RN [6] RP 1-1859 RA Hughes M.A.; RT ; RL Submitted (19-NOV-1990) to the EMBL/GenBank/DDBJ databases. RL M.A. Hughes, UNIVERSITY OF NEWCASTLE UPON TYNE, MEDICAL SCHOOL, NEW CASTLE RL UPON TYNE, NE2 4HH, UK XX DR GOA; P26204. DR MENDEL; 11000; Trirp;1162;11000. DR SWISS-PROT; P26204; BGLS_TRIRP. XX FH Key Location/Qualifiers

The Genbank format of the EMBL

FT source 1..1859 FT /db_xref="taxon:3899" FT /mol_type="mRNA" FT /organism="Trifolium repens" FT /tissue_type="leaves" FT /clone_lib="lambda gt10" FT /clone="TRE361" FT CDS 14..1495 FT /db_xref="GOA:P26204" FT /db_xref="SWISS-PROT:P26204" FT /note="non-cyanogenic" FT /EC_number="3.2.1.21" FT /product="beta-glucosidase" FT /protein_id="CAA40058.1" FT /translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFI FT FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMK FT DQNMDSYRFSISWPRILPKGKLSGGINHEGIKYYNNLINELLANGIQPFVTLFHWDLPQ FT VLEDEYGGFLNSGVINDFRDYTDLCFKEFGDRVRYWSTLNEPWVFSNSGYALGTNAPGR FT CSASNVAKPGDSGTGPYIVTHNQILAHAEAVHVYKTKYQAYQKGKIGITLVSNWLMPLD FT DNSIPDIKAAERSLDFQFGLFMEQLTTGDYSKSMRRIVKNRLPKFSKFESSLVNGSFDF FT IGINYYSSSYISNAPSHGNAKPSYSTNPMTNISFEKHGIPLGPRAASIWIYVYPYMFIQ FT EDFEIFCYILKINITILQFSITENGMNEFNDATLPVEEALLNTYRIDYYYRHLYYIRSA FT IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD" FT mRNA 1..1859 FT /evidence=EXPERIMENTAL XX SQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other; aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt 60 cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag 120 tcggagcagt tttcctcgtg gcttcatctt tggtgctgga tcttcagcat accaatttga 180 aggtgcagta aacgaaggcg gtagaggacc aagtatttgg gataccttca cccataaata 240 tccagaaaaa ataagggatg gaagcaatgc agacatcacg gttgaccaat atcaccgcta 300 caaggaagat gttgggatta tgaaggatca aaatatggat tcgtatagat tctcaatctc 360 ~ ~ ~ ~ ~ ~ ~ tggattaaaa aggtacccta agctttctgc ccaatggtac aagaactttc tcaaaagaaa 1560 ctagctagta ttattaaaag aactttgtag tagattacag tacatcgttt gaagttgagt 1620 tggtgcacct aattaaataa aagaggttac tcttaacata tttttaggcc attcgttgtg 1680 aagttgttag gctgttattt ctattatact atgttgtagt aataagtgca ttgttgtacc 1740 agaagctatg atcataacta taggttgatc cttcatgtat cagtttgatg ttgagaatac 1800 tttgaattaa aagtcttttt ttattttttt aaaaaaaaaa aaaaaaaaaa aaaaaaaaa 1859


• The ID line:ID TRBG361 standard; mRNA; PLN; 1859 BP.The ID line is always the first line of each sequence entry, it gives the names of the sequence and also its length in base pairs.

• The XX line:XX indicates an empty line which are inserted for easier readability.

• The AC line:AC X56734; S46826;An AC line contains the ACcession number(s) of the sequence entry.

• The SV line:SV X56734.1An SV line contain information on the Sequence Version of the sequence entry.


• The DT line:DT 12-SEP-1991 (Rel. 29, Created)A DT line contains the DaTe when the sequence entry was generated or updated.

• The DE line:DE Trifolium repens mRNA for non-cyanogenic beta-glucosidaseEach DE line contains a DEscription of the sequence entry, The DE line is format free.

• The KW line:KW beta-glucosidase.Lines starting with KW contain keywords which are used to generate cross-reference indices of the sequence entries.

• The OS line:OS Trifolium repens (white clover)An OS line specifies the Organism's Species from which the sequence entry was derived.


• The OC line:OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;An OC (Organism Classification) line contains the taxonomic classification of the organism from which the sequence entry was derived.

• The reference (RN, RP, RX, RA, RT, RL) lines:RN [5]RP 1-1859RX MEDLINE; 91322517.RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.;RT Nucleotide and derived amino acid sequence of the cyanogenicRL Plant Mol. Biol. 17(2):209-219(1991).This block of lines contains one reference to the original literature and always contains the lines in the order RN, RC, RP, RX, RG, RA, RT, RL Within each such reference

• The DR line:DR GOA; P26204.A DR line contains a Database Cross-reference to another database


• The FH line:FH Key Location/QualifiersThe FH (Feature Header) lines are present only to improve readability of a sequence entry

• The FT lines:FT source 1..1859FT /db_xref="taxon:3899“...FT CDS 14..1495FT /db_xref="GOA:P26204“...FT IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD“FT mRNA 1..1859FT /evidence=EXPERIMENTAL

The set of FT (Feature Table) lines provide different types of annotation for the sequence of a sequence entry.


• The SQ line:SQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other;

The SQ (SeQuence header) line comes before the lines with the sequence data and summarizes information about the sequence.

• The sequence data line: aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt 60... tttgaattaa aagtcttttt ttattttttt aaaaaaaaaa aaaaaaaaaa aaaaaaaaa 1859

The start identifier of lines containing the actual sequence data are two blank spaces. The sequence which is always given from 5' to 3', is given in chunks of 10 bases with up to 60 bases per line.


Code Base Description

G GuanineA AdenineT ThymineC Cytosine

R Purine (A or G)Y Pyrimidine (C or T or U)M Amino (A or C)K Ketone (G or T)S Strong interaction (C or G)W Weak interaction (A or T)H Not-G (A or C or T )H follows G in the alphabet

B Not-A (C or G or T )B follows A

V Not-T (Not-U) (A or C or G )V follows U

D Not-C (A or G or T )D follows C

N Any (A or C or G or T)

Nucleotide abbreviations

Gene file format:// . . / / / /http www ensembl org info data ftp

.index html

http://www.ensembl.org/info/data/ftp/index.html

http://www.ensembl.org/info/data/ftp/index.html

• Once the sequencing has been completed, the genome sequence is deposited in a database

• One of the first annotation tasks is usually to find all protein-coding genes within a newly sequenced genome.

• Location of RNA encoding genes (opposed to protein coding genes, RNA encoding genes are only transcribed, but not translated)

• As these groups are not necessarily in the same place, they need to exchange their annotation in a common format that is understandable to all of them.

Principles of genome annotation

The gene transfer format (GTF)GTF field Meaning of the field

name the name of the gene, string (e.g. "Haemoglobin")

source the source of the annotation, string (e.g. "EMBL", i.e. the name of the group or institute who supplied the annotation)

feature either "Exon", "CDS", "Start_Codon" or "Stop_Codon"

start start position of the feature, integer number (e.g. 12340, if the feature starts at position 12340)

end end position of the feature, integer number (e.g. 12355, if the feature ends at position 12355) .

score The score field indicates a degree of confidence in the feature's existence and coordinates. rational number (e.g. 6.4231), use "." if no score is used. It is rarely used.

strand either a "+" or a "-", indicating the DNA strand

frame only used for features "CDS", "Start_Codon" and "Stop_Codon", either "0", "1" or "2 ."

gene_nr number of that gene, integer (e.g. "42")

transcript_nr number of the transcript of that gene, integer (e.g. "421")

exon_nr number of the exon on which this feature falls (the first "Exon" gets number "1") .

The next figure symbolically shows a protein-coding gene consisting of three exons which falls on the reverse strand:

Chr22 src Exon 649 700 . - . gene_id 1; transcript_id 1; exon_number 1Chr22 src CDS 649 700 . - ? gene_id 1; transcript_id 1; exon_number 1Chr22 src Exon 351 500 . - . gene_id 1; transcript_id 1; exon_number 2Chr22 src CDS 351 500 . - ? gene_id 1; transcript_id 1; exon_number 2Chr22 src Exon 150 250 . - . gene_id 1; transcript_id 1; exon_number 3Chr22 src CDS 153 250 . - ? gene_id 1; transcript_id 1; exon_number 3Chr22 src Start_Codon 698 700 . - 0 gene_id 1; transcript_id 1; exon_number 1Chr22 src Stop_Codon 150 152 . - 0 gene_id 1; transcript_id 1; exon_number 3

Examples of the GTF format

The following figure symbolically shows a protein-coding gene consisting of five exonsA valid description of this gene in GTF format would be:

Chr1 src Exon 150 200 . + . gene_id 1; transcript_id 1; exon_number 1Chr1 src Exon 300 401 . + . gene_id 1; transcript_id 1; exon_number 2Chr1 src CDS 380 401 . + 0 gene_id 1; transcript_id 1; exon_number 2Chr1 src Exon 501 650 . + . gene_id 1; transcript_id 1; exon_number 3Chr1 src CDS 501 650 . + 2 gene_id 1; transcript_id 1; exon_number 3Chr1 src Exon 700 800 . + . gene_id 1; transcript_id 1; exon_number 4Chr1 src CDS 700 707 . + 2 gene_id 1; transcript_id 1; exon_number 4Chr1 src Exon 900 1000 . + . gene_id 1; transcript_id 1; exon_number 5Chr1 src Start_Codon 380 382 . + 0 gene_id 1; transcript_id 1; exon_number 2Chr1 src Stop_Codon 708 709 . + 0 gene_id 1; transcript_id 1; exon_number 4

Examples of the GTF format

Protein databank (PDB) format

GrainGenes• A Genomic Database for Triticeae

and Avena– Genetic maps– Genes– Alleles– Genetic markers– Phenotypic data– Quantitative trait loci studies– Experimental protocols – Publications

Literature databases1. PubMed:

http://www.ncbi.nlm.nih.gov/pubmed/2. PubMed Central:

http://www.pubmedcentral.gov/3. HighWire Press: http://highwire.stanford.edu/4. DRIVER Project:

http://www.driver-community.eu/5. Web of Science®: http://isiknowledge.com6. arXiv: http://www.arxiv.org7. CiteSeer: http://citeseer.ist.psu.edu

http://www.ncbi.nlm.nih.gov/pubmed/

http://www.pubmedcentral.gov/

http://highwire.stanford.edu/

http://www.driver-community.eu/

http://isiknowledge.com/

http://www.arxiv.org/

http://citeseer.ist.psu.edu/

Create a local database

Thank You

Bioinformatics

Documents

Transcript of Bioinformatics