Databases in Bioinformatics
description
Transcript of Databases in Bioinformatics
![Page 1: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/1.jpg)
Databases in Bioinformatics
From a ppt by Mark PallenProf. Of Bacterial pathogenesis
Univ. Birmingham
![Page 2: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/2.jpg)
Databases in Bioinformatics
• Sequence databases• Sequence analysis• Functional genomics• Literature databases• Structural databases• Metabolic pathway databases• Specialised databases
![Page 3: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/3.jpg)
The definitive source….
http://nar.oxfordjournals.org/content/vol34/suppl_1/index.dtl
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
![Page 4: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/4.jpg)
DNA Sequence databases
• Main repositories:– GenBank (US)
• (http://www.ncbi.nlm.nih.gov/Genbank/index.html)
– EMBL (Europe)• (http://www.ebi.ac.uk/embl/)
– DDBJ (Japan)• (http://www.ddbj.nig.ac.jp/)
• Primary databases– DNA sequences are identical
![Page 5: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/5.jpg)
![Page 6: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/6.jpg)
www.ncbi.nlm.nih.gov
![Page 7: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/7.jpg)
PubMed is… • National Library of Medicine's search service• >14 million citations in MEDLINE• links to participating online journals• PubMed tutorial (via side bar)
![Page 8: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/8.jpg)
Entrez integrates…
• the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; • population study data sets; • assemblies of complete genomes
ENTREZ THE LIFE SCIENCES ENGINE
![Page 9: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/9.jpg)
Entrez is a search and retrieval system that integrates NCBI databases
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
![Page 10: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/10.jpg)
OMIM is…
•Online Mendelian Inheritance in Man•catalog of human genes and genetic disorders•edited by Dr. Victor McKusick, others at JHU
…John Hopkins University
![Page 11: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/11.jpg)
Taxonomy Browser is…
• browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses)• taxonomy information such as genetic codes• molecular data on extinct organisms
![Page 12: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/12.jpg)
Structure site includes… Molecular Modelling Database (MMDB)• biopolymer structures obtained from the Protein Data Bank (PDB)• Cn3D (a 3D-structure viewer)• vector alignment search tool (VAST)
http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml
![Page 13: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/13.jpg)
How can I use PubMed at NCBIto find literatureinformation?
![Page 14: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/14.jpg)
PubMed is the NCBI gateway to MEDLINE.
MEDLINE contains bibliographic citations and author abstracts from over 4,000 journals published in the United States and in 70 foreign countries.
It has 12 million records dating back to 1966.
![Page 15: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/15.jpg)
MeSH is the acronym for "Medical Subject Headings."
MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM. MeSH vocabulary is used for indexing journal articles for MEDLINE.
The MeSH controlled vocabulary imposes uniformity and consistency to the indexing of biomedical literature.
![Page 16: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/16.jpg)
![Page 17: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/17.jpg)
![Page 18: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/18.jpg)
PubMed search strategies
• Try the tutorial on the left sidebar
• Use boolean queries– lipocalin AND disease
• Try using “limits”
• Try “LinkOut” to find external resources
![Page 19: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/19.jpg)
lipocalin AND disease(96 results)
lipocalin OR disease(1.9 million results)
lipocalin NOT disease(729 results)
1 AND 2
1 OR 2
1 NOT 2
1
1
1
2
2
2
![Page 20: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/20.jpg)
Fulltext Literature Databases
• Highwire• Google Scholar• Google Print• Useful for finding
information about genes buried in tables in papers, invisible to PubMed
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
![Page 21: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/21.jpg)
From Highwire
...Stanford University
![Page 22: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/22.jpg)
![Page 23: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/23.jpg)
What is an accession number?
An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4):
X02775 GenBank genomic DNA sequenceNT_030059 Genomic contigRs7079946 dbSNP (single nucleotide polymorphism)
N91759.1 An expressed sequence tag (1 of 170)NM_006744 RefSeq DNA sequence (from a transcript)
NP_007635 RefSeq proteinAAC02945 GenBank proteinQ28369 SwissProt protein1KT7 Protein Data Bank structure record
protein
DNA
RNA
![Page 24: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/24.jpg)
How can I use NCBI(or other sites)to find informationabout a proteinor gene?
![Page 25: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/25.jpg)
![Page 26: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/26.jpg)
![Page 27: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/27.jpg)
![Page 28: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/28.jpg)
![Page 29: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/29.jpg)
![Page 30: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/30.jpg)
FASTA format
![Page 31: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/31.jpg)
Graphics format
![Page 32: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/32.jpg)
Question #4:How can I find information about a particular disease?
Answer:Try OMIM
![Page 33: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/33.jpg)
![Page 34: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/34.jpg)
![Page 35: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/35.jpg)
Sequence Databases
• Annotated sequence databases– SWISS-PROT, GenBank etc…– Usage: identifying function, retrieving information
• Low-annotation sequence databases– EST databases, high-throughput genome sequences– Usage: discovery of new genes
![Page 36: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/36.jpg)
General Protein Databases
• SWISS-PROT– Manually curated– high-quality annotations, less data
• GenPept/TREMBL– Translated coding sequences from GenBank/EMBL– Few annotations, more up to date
• PIR– Phylogenetic-based annotations
• All 3 now combining efforts to form UniProt (http://www.uniprot.org)
![Page 37: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/37.jpg)
Low-annotation Databases
• ESTs (Expressed Sequence Tags)– Low quality sequences generated by
high -volume sequencing the 3’ or 5’ end of cDNAs
– http://www-users.med.cornell.edu/~jawagne/cDNA_cloning.html
• High-throughput genome sequences– Produced by mass-sequencing of
genomic DNA
![Page 38: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/38.jpg)
Non-redundant Databases
• Sequence data only: cannot be browsed, can only be searched using a sequence
• Combine sequences from more than one database
• Examples:– NR Nucleic (genbank+EMBL+DDBJ+PDB
DNA)– NR Protein (SWISS-
PROT+TrEMBL+GenPept+PDB protein)
![Page 39: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/39.jpg)
Sequence & Structure Databases
• PDB (Protein Databank)– Stores 3-dimensional atomic coordinates for biological
molecules including protein and nucleic acids– Data obtained by X-ray crystallography, NMR, or computer
modeling– http://www.rcsb.org/pdb/
• MMDB (Molecular Modeling database)– Over 28,000 3D macromolecular structures, including
proteins and polynucleotides – (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure
)• SCOP (Structural Classification of Proteins)
– Classification of proteins according to structural and evolutionary relationships
![Page 40: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/40.jpg)
File Formats• GenBank/GB, genbank flatfile format • NBRF format • EMBL, EMBL flatfile format • Swissprot• GCG, single sequence format of GCG software • DNAStrider, for common Mac program • Pearson/Fasta, a common format used by Fasta programs and
others • Phylip3.2, sequential format for Phylip programs • Phylip, interleaved format for Phylip programs (v3.3, v3.4) • Plain/Raw, sequence data only (no name, document, numbering) • MSF multi sequence format used by GCG software • PAUP"s multiple sequence (NEXUS) format • ASN.1 format used by NCBI
![Page 41: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/41.jpg)
ID TRBG361 standard; mRNA; PLN; 1859 BP.XXAC X56734; S46826;XXSV X56734.1XXDT 12-SEP-1991 (Rel. 29, Created)DT 15-MAR-1999 (Rel. 59, Last updated, Version 9)XXDE Trifolium repens mRNA for non-cyanogenic beta-glucosidaseXXKW beta-glucosidase.XXOS Trifolium repens (white clover)OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids;OC eurosids I; Fabales; Fabaceae; Papilionoideae; Trifolieae; Trifolium.XXRN [5]RP 1-1859RX MEDLINE; 91322517.RX PUBMED; 1907511.RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.;RT "Nucleotide and derived amino acid sequence of the cyanogenicRT beta-glucosidase (linamarase) from white clover (Trifolium repens L.).";RL Plant Mol. Biol. 17(2):209-219(1991).XXRN [6]RP 1-1859RA Hughes M.A.;RT ;RL Submitted (19-NOV-1990) to the EMBL/GenBank/DDBJ databases.RL M.A. Hughes, UNIVERSITY OF NEWCASTLE UPON TYNE, MEDICAL SCHOOL, NEW CASTLERL UPON TYNE, NE2 4HH, UKXXDR GOA; P26204.DR MENDEL; 11000; Trirp;1162;11000.DR SWISS-PROT; P26204; BGLS_TRIRP.XX
FH Key Location/QualifiersFHFT source 1..1859FT /db_xref="taxon:3899"FT /mol_type="mRNA"FT /organism="Trifolium repens"FT /tissue_type="leaves"FT /clone_lib="lambda gt10"FT /clone="TRE361"FT CDS 14..1495FT /db_xref="GOA:P26204"FT /db_xref="SWISS-PROT:P26204"FT /note="non-cyanogenic"FT /EC_number="3.2.1.21"FT /product="beta-glucosidase"FT /protein_id="CAA40058.1"FT /translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFIFT FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMKFT DQNMDSYRFSISWPRILPKGKLSGGINHEGIKYYNNLINELLANGIQPFVTLFHWDLPQFT VLEDEYGGFLNSGVINDFRDYTDLCFKEFGDRVRYWSTLNEPWVFSNSGYALGTNAPGRFT CSASNVAKPGDSGTGPYIVTHNQILAHAEAVHVYKTKYQAYQKGKIGITLVSNWLMPLDFT DNSIPDIKAAERSLDFQFGLFMEQLTTGDYSKSMRRIVKNRLPKFSKFESSLVNGSFDFFT IGINYYSSSYISNAPSHGNAKPSYSTNPMTNISFEKHGIPLGPRAASIWIYVYPYMFIQFT EDFEIFCYILKINITILQFSITENGMNEFNDATLPVEEALLNTYRIDYYYRHLYYIRSAFT IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD"FT mRNA 1..1859FT /evidence=EXPERIMENTALXXSQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other;aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt 60cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag 120tcggagcagt tttcctcgtg gcttcatctt tggtgctgga tcttcagcat accaatttga 180aggtgcagta aacgaaggcg gtagaggacc aagtatttgg gataccttca cccataaata 240tccagaaaaa ataagggatg gaagcaatgc agacatcacg gttgaccaat atcaccgcta 300caaggaagat gttgggatta tgaaggatca aaatatggat tcgtatagat tctcaatctc 360ttggccaaga atactcccaa agggaaagtt gagcggaggc ataaatcacg aaggaa
http://www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html
EMBL Format
![Page 42: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/42.jpg)
LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds.ACCESSION U49845VERSION U49845.1 GI:1293613KEYWORDS .SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces.REFERENCE 1 (bases 1 to 5028) AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. TITLE Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10 (11), 1503-1509 (1994) MEDLINE 95176709 PUBMED 7871890REFERENCE 2 (bases 1 to 5028) AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M. TITLE Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein JOURNAL Genes Dev. 10 (7), 777-793 (1996) MEDLINE 96194260 PUBMED 8846915REFERENCE 3 (bases 1 to 5028) AUTHORS Roemer,T. TITLE Direct Submission JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New Haven, CT, USAFEATURES Location/Qualifiers source 1..5028 /organism="Saccharomyces cerevisiae" /db_xref="taxon:4932" /chromosome="IX"
gene 687..3158 /gene="AXL2" CDS 687..3158 /gene="AXL2" /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA98666.1" /db_xref="GI:1293615" /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPE TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYG DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQ DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIA CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLN NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDS YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL VDFSNKSNVNVGQVKDIHGRIPEMLBASE COUNT 1510 a 1074 c 835 g 1609 tORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct 121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa 181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg 241
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
Genbank Format
![Page 43: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/43.jpg)
http://us.expasy.org/sprot/userman.html
Swissprot format
![Page 44: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/44.jpg)
Specialized Sequence Databases
• Focus on a specific type of sequences
• Sequences are often modified or specially annotated
• Usage depends on the database• Examples:
– Ribosomal RNA databases– Immunology databases
![Page 45: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/45.jpg)
• Pfam (http://www.sanger.ac.uk/Software/Pfam/)Collection of multiple sequence alignments and hidden Markov models covering
many common protein domains and families • SMART (a Simple Modular Architecture Research Tool)
Identification and annotation of genetically mobile domains and the analysis of domain architectures
(http://smart.embl-heidelberg.de/help/smart_about.shtml• CDD (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi)
Combines SMART and Pfam databasesEasier and quicker search
Protein domain databases
![Page 46: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/46.jpg)
Sequence Motif Databases
• Scan Prosite (http://www.expassy.org/prosite) and PRINTS (http://bioinf.man.ac.uk/dbbrowser/PRINTS/)– Store conserved motifs occurring in nucleic acid
or protein sequences– Motifs can be stored as consensus sequences,
alignments, or using statistical representations such as residue frequency tables
![Page 47: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/47.jpg)
Ribosomal RNA Databases
• RDP (Michigan State University, USA)– http://rdp.cme.msu.edu/html/
• rRNA database (University of Antwerp, Belgium)– http://rrna.uia.ac.be/
• ribosomal RNA sequences are pre-aligned according to their secondary structure
• Usage: creating data sets for molecular phylogeny, especially for microbial taxonomy and identification
![Page 48: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/48.jpg)
Immunological Sequence Databases
• The Kabat Database of Sequences of Proteins of Immunological Interest– www.hgmp.mrc
.ac.uk/Bioinformatics/Databases/kabatp-help.html
– Sequences are classified according to antigen specificity, and available in pre-aligned format
• The Immunogenetics database (IMGT)– http://imgt.cnusc.fr:8104/– Focuses on immunoglobulins, T-cell receptors
and MHC genes
![Page 49: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/49.jpg)
Genome Databases
• Focus on one organism or group of organisms• Examples:
– Colibase (E. coli and related species)– http://colibase.bham.ac.uk/ – GDB (human)
• http://www.gdb.org/– Flybase (Drosophila)
• http://flybase.bio.indiana.edu/– WormBase (C. elegans)
• http://wormbase.org– AtDB (Arabidopsis)
• http://www.arabidopsis.org– SGD (S. cerevisiae)
• http://genome-www.stanford.edu/Saccharomyces/
![Page 50: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/50.jpg)
Expression Databases
• RNA expression– Results of microarray experiments measuring the
change in specific mRNA content under certain conditions
– Array Express (EBI) and Geo (NCBI)– Not user friendly
• Proteome databases– 2D gel electrophoresis images representing the
protein content of a cell or tissue under specific conditions
– SWISS 2D PAGE at http://us.expasy.org/ch2d/
![Page 51: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/51.jpg)
Other Database Types
• Literature– MEDLINE (http://ncbi.nlm.nih.gov/PubMed/)– HighWire (http://www.highwire.org)
• Variation– dbSNP (http://ncbi.nlm.nih.gov/SNP/)– HGBase (http://hgbase/interactiva/de)
• Metabolic pathways– KEGG (http://kegg.genome.ad.jp/kegg/)– WIT (http://wit.mcs/anl.gov/WIT2)
• Organisms and nomenclature– Taxonomies (e.g.: http://ncbi.nlm.nih.gov/Taxonomy/ )– Mendel (http://mbclserver.rutgers.edu/CPGN)
![Page 52: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/52.jpg)
Methods for Accessing Data
• local installation• screen scraping• BioPerl• FTP sitesScreen scraping is a technique in which a computer program extracts data from the display output of another program. The
program doing the scraping is called a screen scraper. The key element that distinguishes screen scraping from regular parsing is that the output being scraped was intended for final display to a human user, rather than as input to another program, and is therefore usually neither documented nor structured for convenient parsing. Screen scraping often involves ignoring binary data (usually images or multimedia data) and formatting elements that obscure the essential, desired text data. Optical character recognition software is a kind of visual scraper.There are a number of synonyms for screen scraping, including: Data scraping, data extraction, web scraping, page scraping, web page wrapping and HTML scraping (the last four being specific to scraping web
pages).DAS
![Page 53: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/53.jpg)
• Screen scraping is a technique in which a computer program extracts data from the display output of another program. The program doing the scraping is called a screen scraper. The key element that distinguishes screen scraping from regular parsing is that the output being scraped was intended for final display to a human user, rather than as input to another program, and is therefore usually neither documented nor structured for convenient parsing. Screen scraping often involves ignoring binary data (usually images or multimedia data) and formatting elements that obscure the essential, desired text data. Optical character recognition software is a kind of visual scraper.There are a number of synonyms for screen scraping, including: Data scraping, data extraction, web scraping, page scraping, web page wrapping and HTML scraping (the last four being specific to scraping web pages).
![Page 54: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/54.jpg)
Local Installations
• SRS– Need to obtain license from Lion
Biosceinces
• Download data from FTP sites• Ensembl
– "framework to organise biology around the sequences of large genomes"
– www.ensembl.org
![Page 55: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/55.jpg)
Screen Scraping
• URL spoofing– construction of URLs that replicate the query
• html parsing– extraction of results from html pages
returned by query
• Requirements– html module– knowlege of query mechanism
• Method NOT advocated by most data providers
![Page 56: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/56.jpg)
BioPerl
• BioPerl is a collection of modules that facilitates the development of Perl scripts for bioinformatics applications.
• www.bioperl.org
![Page 57: Databases in Bioinformatics](https://reader035.fdocuments.us/reader035/viewer/2022081416/56814967550346895db6baa0/html5/thumbnails/57.jpg)
• Converts input DNA/AA sequence to specified format
Usage: readseq my1st.seq my2nd.seq -all - format=genbank -output=my.gb
Online Manual:http://www.psc.edu/general/software/packages/readseq/manual.html
ReadSeq