Post on 29-May-2020
Biological Databases
What will we discuss today? • Why are databases the backbone of bioinforma7cs? • The basic structure of a database • Data versus annota7on • Types of DBs: Genbank, PubMed and NCBI • Query strategies • Quality of data
issues
http://techcrunch.com/2012/11/25/the-big-data-fallacy-data-%E2%89%A0-information-%E2%89%A0-insights/
Biologists Collect Lots of Data
• Hundreds of thousands of species • Millions of ar7cles in scien7fic journals • Gene7c informa7on:
– gene names (thousands) – phenotype of mutants (infinite?) – loca7on of genes/muta7ons on chromosmes – linkage (distances between genes)
• High Throughput technology – Rapid inexpensive DNA sequencing
– Many methods of collec7ng genotype data • Assays for specific polymorphisms • Genome-‐wide SNP chips
• Must have data quality assessment prior to analysis
One sequencer => 1-2Tb/week !!
Curated Biological Data DNA, nucleotide sequences
Gene boundaries, topology Gene structure
Introns, exons, ORFs, splicing
Expression data Mass spectometry
Mass spectometry (metabolomics, proteomics)
Post-Translational protein Modification (PTM)
Curated Biological Data Proteins, residue sequences
MCTUYTCUYFSTYRCCTYFSCD Extended sequence information
Secondary structure
Hydrophobicity, motif data
Protein-protein interaction
Curated Biological data 3D Structures, folds
WHAT is a database? • A collec7on of data that needs to be:
– Structured – Searchable – Updated (periodically) – Cross referenced
• Challenge: – To change “meaningless” data into useful informa7on that can be
accessed and analysed the best way possible.
For example: HOW would YOU organize all biological sequences so that the biological informa7on is op7mally accessible?
http://en.wikibooks.org/wiki/Data_Management_in_Bioinformatics
A Spreadsheet can be a Database
• columns are Fields • Rows are Records • Can search for a term within just one field
• Or combine searches across several fields
SNP ID SNPSeq ID!
Gene +primer -primer Hap A Hap B Hap C
D1Mit160_1" 10.MMHAP67FLD1.seq"
lymphocyte antigen 84"
AAGGTAAAAGGCAATCAGCACAGCC"
TCAACCTGGAGTCAGAGGCT"
C — A
M-05554_1" 12.MMHAP31FLD3.seq"
procollagen, type III, alpha "
TGCGCAGAAGCTGAAGTCTA"
TTTTGAGGTGTTAATGGTTCT"
C — A
M-05554_2" X60184" complement component factor i"
ACTTCCAGCCCTGGCTCT"
ATATGCCACCAAGAAGCA"
A C —
M-09947_3" AF067835" caspase 8" TCACAGAGGGAAACATGAAG"
CTCCACATTGAACCAAAGCA"
G C T
M-11415_1" U02023" insulin-like growth factor binding protein "
GGGAAAAGCCTGAAAGAAGC"
AGCTGAAACCGGACATCAAT"
T G —
D1Mit284_3"
J05234" nucleolin" TGTTGGAACCGACTTCTTCA"
AAGAGTCAAAGAATTTATGGAATGA"
G T T
DBMS
• Internal organiza7on – Controls speed and flexibility
• A unity of programs that – Store – Extract – Modify
Database
Store Extract Modify
USER(S)
DBMS organisa7on types • Flat file databases (flat DBMS)
– Simple, restric7ve, table
• Hierarchical databases (hierarchical DBMS) – Simple, restric7ve, tables
• Rela7onal databases (RDBMS) – Complex,versa7le, tables
• Object-‐oriented databases (ODBMS) – Complex, versa7le, objects
• Data Warehouses and Distributed Databases
Information system
Query system
Storage System
Data
Why are flat files s7ll used?
Structured Data
• Repository of informa7on
• managed and accessed differently
• Flat-‐file (text) • Rela7onal (key) • “talk” to each other
Rela7onal databases
• Data is stored in mul7ple related tables
• Data rela7onships across tables can be either many-‐to-‐one or many-‐to-‐many
• A few rules allow the database to be viewed in many ways
Rela7onal Databases
• What have we achieved? – No repea7ng informa7on – Less storage space – Be`er reality representa7on – Easy modifica7on/management – Easy usage of any combina7on of records
Three reasons to care …
• Database prolifera7on – Dozens to hundreds at the moment
• More and more scien7fic discoveries result from inter-‐database analysis and mining
• Rising complexity of required data-‐combina7ons – E.g. transla7onal medicine: “from bench to bedside” (genomic data vs. clinical data)
Standard Data Formats • DNA sequence = ACGT, but what about gaps, unknown le`ers, etc. – How many le`ers per line ??? – ?? Spaces, numbers, headers, etc. – Store as a string, code as binary numbers, etc.
• Use a completely different format for proteins?
Need standard formats!!
FASTA Format • William Pearson (1985)
• The FASTA format is now universal for all databases and sohware that handles DNA and protein sequences
>URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 ..!CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA!ACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTT!GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACC!CACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTG!TGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA!GGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT!CTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA!TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG
One header line, starts with > with a [return] at end All other characters are part of sequence.
Mul7-‐Sequence FASTA file >FBpp0074027 type=protein; loc=X:complement(16159413..16159860,16160061..16160497); ID=FBpp0074027; name=CG12507-‐PA;
parent=FBgn0030729,FBtr0074248; dbxref=FlyBase:FBpp0074027,FlyBase_Annota7on_IDs:CG12507 PA,GB_protein:AAF48569.1,GB_protein:AAF48569; MD5=123b97d79d04a06c66e12fa665e6d801; release=r5.1; species=Dmel; length=294;
MRCLMPLLLANCIAANPSFEDPDRSLDMEAKDSSVVDTMGMGMGVLDPTQ PKQMNYQKPPLGYKDYDYYLGSRRMADPYGADNDLSASSAIKIHGEGNLA SLNRPVSGVAHKPLPWYGDYSGKLLASAPPMYPSRSYDPYIRRYDRYDEQ YHRNYPQYFEDMYMHRQRFDPYDSYSPRIPQYPEPYVMYPDRYPDAPPLR DYPKLRRGYIGEPMAPIDSYSSSKYVSSKQSDLSFPVRNERIVYYAHLPE IVRTPYDSGSPEDRNSAPYKLNKKKIKNIQRPLANNSTTYKMTL >FBpp0082232 type=protein; loc=3R:complement(9207109..9207225,9207285..9207431); ID=FBpp0082232; name=mRpS21-‐PA;
parent=FBgn0044511,FBtr0082764; dbxref=FlyBase:FBpp0082232,FlyBase_Annota7on_IDs:CG32854-‐PA,GB_protein:AAN13563.1,GB_protein:AAN13563; MD5=dcf91821f75ffab320491d124a0d816c; release=r5.1; species=Dmel; length=87;
MRHVQFLARTVLVQNNNVEEACRLLNRVLGKEELLDQFRRTRFYEKPYQV RRRINFEKCKAIYNEDMNRKIQFVLRKNRAEPFPGCS >FBpp0091159 type=protein; loc=2R:complement(2511337..2511531,2511594..2511767,2511824..2511979,2512032..2512082); ID=FBpp0091159;
name=CG33919-‐PA; parent=FBgn0053919,FBtr0091923; dbxref=FlyBase:FBpp0091159,FlyBase_Annota7on_IDs:CG33919-‐PA,GB_protein:AAZ52801.1,GB_protein:AAZ52801; MD5=c91d880b654cd612d7292676f95038c5; release=r5.1; species=Dmel; length=191;
MKLVLVVLLGCCFIGQLTNTQLVYKLKKIECLVNRTRVSNVSCHVKAINW NLAVVNMDCFMIVPLHNPIIRMQVFTKDYSNQYKPFLVDVKIRICEVIER RNFIPYGVIMWKLFKRYTNVNHSCPFSGHLIARDGFLDTSLLPPFPQGFY QVSLVVTDTNSTSTDYVGTMKFFLQAMEHIKSKKTHNLVHN >FBpp0070770 type=protein; loc=X:join(5584802..5585021,5585925..5586137,5586198..5586342,5586410..5586605); ID=FBpp0070770; name=cv-‐PA;
parent=FBgn0000394,FBtr0070804; dbxref=FlyBase:FBpp0070770,FlyBase_Annota7on_IDs:CG12410-‐PA,GB_protein:AAF46063.1,GB_protein:AAF46063; MD5=0626ee34a518f248bbdda11a211f9b14; release=r5.1; species=Dmel; length=257;
MEIWRSLTVGTIVLLAIVCFYGTVESCNEVVCASIVSKCMLTQSCKCELK NCSCCKECLKCLGKNYEECCSCVELCPKPNDTRNSLSKKSHVEDFDGVPE LFNAVATPDEGDSFGYNWNVFTFQVDFDKYLKGPKLEKDGHYFLRTNDKN LDEAIQERDNIVTVNCTVIYLDQCVSWNKCRTSCQTTGASSTRWFHDGCC ECVGSTCINYGVNESRCRKCPESKGELGDELDDPMEEEMQDFGESMGPFD GPVNNNY …
Reformaung Data Files
• Much of the rou7ne (yet annoying) work of bioinforma7cs involves messing around with data files to get them into formats that will work with various sohware
• Then messing around with the results produced by that sohware to create a useful summary…
Accession Numbers!! (keys) • Databases are designed to be searched by accession numbers (and locus IDs)
• These are guaranteed to be non-‐redundant, accurate, and not to change.
• Searching by gene names and keywords is doomed to frustra7on and probable failure
Neither scien7sts nor computers can be trusted to accurately and consistently annotate database entries!!
Accessing database informa7on
• A request for data from a database is called a query
• Queries can be of three forms: – Choose from a list of parameters – Query by example (QBE) – Structured Query Language (SQL)
Web Query
• Most databases have a web-‐based query tool
• It may be simple…
… or complex
Query Languages • The standard
– SQL (Structured Query Language) originally called SEQUEL (Structured English QUEry Language)
– Developed by IBM in 1974; introduced commercially in 1979 by Oracle Corp.
– Standard interac7ve and programming language for geung informa7on from and upda7ng a database.
– RDMS (SQL), ODBMS (Java, C++, OQL etc)
Database Searching A database can only be searched in ways that it was designed to be searched
Boolean: "AND" and "OR" searches
Bad to search for "human hemoglobin" in a 'Descrip-on' field
Much be`er to search for "homo sapiens in 'Organism' AND "HBB" in 'gene name'
Strategies
• Use accession numbers whenever possible • Start with broad keywords and narrow the
search using more specific terms • Try variants of spelling, numbers, etc. • Search all relevant databases
• Be persistent!!
Data versus metadata (annota7on)
Primary vs derived data
Heterogeneity in data (Scien7fic data domains)
Genome Ontology • Biology is a messy science
• Assortment of names, mutants, odd phenotypes – “sonic hedgehog”
• Genome Ontology – Molecular func7on (specific tasks) – Biological process (broad biological goal) – Cellular component (loca7on)
GiGo: Data Quality Ma`ers
AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, !
ARR, AsDb, BBDB, BCGD, Beanref, Biolmage,!BioMagResBank, BIOMDB, BLOCKS, BovGBASE,!
BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,!CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,!
ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,!CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,!Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,!ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,!ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,!
GCRDB, GDB, GENATLAS, Genbank, GeneCards,!Genline, GenLink, GENOTK, GenProtEC, GIFTS,!
GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,!HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,!
HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,!HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,!
KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,!Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5!
Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,!MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,!OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,!PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,!
PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,!PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,!
SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,!SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,!
SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-!MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,!TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,!VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD,!
YPM, etc .................. !!!!!!
Some Biological databases …
Some sta7s7cs • More than 1000 different databases • Generally accessible through the web (useful link: www.expasy.ch/alinks.html) • Variable size: <100Kb to >10Gb
– DNA: > 10 Gb – Protein: 1 Gb – 3D structure: 5 Gb – Other: smaller
• Update frequency: daily to annually
NAR Database Issue
• Online collec7on of biological databases: h`p://www.oxfordjournals.org/nar/database/c/
GenBank
DDBJ EMBL
EMBL
Entrez
SRS
getentry
NIG CIB EBI
NCBI
NIH
• Submissions • Updates
• Submissions • Updates
• Submissions • Updates
Public Sequence Databases Same sequence information in all three, but different tools for searching and retrieval
GenBank • Contains all DNA and protein sequences described in the scien7fic literature or collected in publicly funded research
• Fla{ile: Composed en7rely of text • Each submi`ed sequence is a record • Had fields for Organism, Date, Author, etc. • Unique iden7fier for each sequence
– Locus and Accession #
Growth of Genbank
GenBank Flat File (GBFF) LOCUS MUSNGH 1803 bp mRNA ROD 29-AUG-1997 DEFINITION Mouse neuroblastoma and rat glioma hybridoma cell line NG108-15 cell TA20 mRNA, complete cds. ACCESSION D25291 NID g1850791 KEYWORDS neurite extension activity; growth arrest; TA20. SOURCE Murinae gen. sp. mouse neuroblastma-rat glioma hybridoma cell_line:NG108-15 cDNA to mRNA. ORGANISM Murinae gen. sp. Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae. REFERENCE 1 (sites) AUTHORS Tohda,C., Nagai,S., Tohda,M. and Nomura,Y. TITLE A novel factor, TA20, involved in neuronal differentiation: cDNA cloning and expression JOURNAL Neurosci. Res. 23 (1), 21-27 (1995) MEDLINE 96064354 REFERENCE 3 (bases 1 to 1803) AUTHORS Tohda,C. TITLE Direct Submission JOURNAL Submitted (18-NOV-1993) to the DDBJ/EMBL/GenBank databases. Chihiro Tohda, Toyama Medical and Pharmaceutical University, Research Institute for Wakan-yaku, Analytical Research Center for Ethnomedicines; 2630 Sugitani, Toyama, Toyama 930-01, Japan (E-mail:CHIHIRO@ms.toyama-mpu.ac.jp, Tel:+81-764-34-2281(ex.2841), Fax:+81-764-34-5057) COMMENT On Feb 26, 1997 this sequence version replaced gi:793764. FEATURES Location/Qualifiers source 1..1803 /organism="Murinae gen. sp." /note="source origin of sequence, either mouse or rat, has not been identified" /db_xref="taxon:39108" /cell_line="NG108-15" /cell_type="mouse neuroblastma-rat glioma hybridoma" misc_signal 156..163 /note="AP-2 binding site" GC_signal 647..655 /note="Sp1 binding site" TATA_signal 694..701 gene 748..1311 /gene="TA20" CDS 748..1311 /gene="TA20" /function="neurite extensiion activity and growth arrest effect" /codon_start=1 /db_xref="PID:d1005516" /db_xref="PID:g793765" /translation="MMKLWVPSRSLPNSPNHYRSFLSHTLHIRYNNSLFISNTHLSRR KLRVTNPIYTRKRSLNIFYLLIPSCRTRLILWIIYIYRNLKHWSTSTVRSHSHSIYRL RPSMRTNIILRCHSYYKPPISHPIYWNNPSRMNLRGLLSRQSHLDPILRFPLHLTIYY RGPSNRSPPLPPRNRIKQPNRIKLRCR" polyA_site 1803 BASE COUNT 507 a 458 c 311 g 527 t ORIGIN 1 tcagtttttt tttttttttt tttttttttt tttttttttt tttttttttg ttgattcatg 61 tccgtttaca tttggtaagt tcacaggcct cagtcaacac aattggactg ctcaggaaat 121 cctccttggt gaccgcagta tacttggcct atgaacccaa gccacctatg gctaggtagg 181 agaagctcaa ctgtagggct gactttggaa gagaatgcac atggctgtat cgacatttca 241 catggtggac ctctggccag agtcagcagg ccgagggttc tcttccgggc tgctccctca 301 ctgcttgact ctgcgtcagt gcgtccatac tgtgggcgga cgttattgct atttgccttc 361 cattctgtac ggcattgcct ccatttagct ggagagggac agagcctggt tctctagggc 421 gtttccattg gggcctggtg acaatccaaa agatgagggc tccaaacacc agaatcagaa 481 ggcccagcgt atttgtaaaa acaccttctg gtgggaatga atggtacagg ggcgtttcag 541 gacaaagaac agcttttctg tcactcccat gagaaccgtc gcaatcactg ttccgaagag 601 gaggagtcca gaatacacgt gtatgggcat gacgattgcc cggagagagg cggagcccat 661 ggaagcagaa agacgaaaaa cacacccatt atttaaaatt attaaccact cattcattga 721 cctacctgcc ccatccaaca tttcatcatg atgaaacttt gggtcccttc taggagtctg 781 cctaatagtc caaatcatta caggtctttt cttagccata cactacacat cagatacaat 841 aacagccttt tcatcagtaa cacacatttg tcgagacgta aattacgggt gactaatccg 901 atatatacac gcaaacggag cctcaatatt ttttatttgc ttattccttc atgtcggacg 961 aggcttatat tatggatcat atacatttat agaaacctga aacattggag tacttctact 1021 gttcgcagtc atagccacag catttatagg ctacgtcctt ccatgaggac aaatatcatt 1081 ctgaggtgcc acagttatta caaacctcct atcagccatc ccatatattg gaacaaccct 1141 agtcgaatga atttgagggg gcttctcagt agacaaagcc accttgaccc gattcttcgc 1201 tttccacttc atcttaccat ttattatcgc ggccctagca atcgttcacc tcctcttcct 1261 ccacgaaaca ggatcaaaca acccaacagg attaaactca gatgcagata aaattccatt 1321 tcacccctac tatacatcaa agatatccta ggtatcctaa tcatattctt aattctcata 1381 accctagtat tatttttccc agacatacta ggagacccag acaactacat accagctaat 1441 ccactaaaca ccccacccca tattaaaccc gaatgatatt tcctatttgc atacgccatt 1501 ctacgctcaa tccccaataa actaggaggt gtcctagcct taatcttatc tatcctaatt 1561 ttagccctaa tacctttcct tcatacctca aagcaacgaa gcctaatatt ccgcccaatc 1621 acacaaattt tgtactgaat cctagtagcc aacctactta tcttaacctg aattgggggc 1681 caaccagtag acacccattt attatcattg gccaactagc ctccatctca tacttctcaa 1741 tcatcttaat tcttatacca atctcaggaa ttatcgaaga caaaatacta aaattatatc 1801 cat //
Features (AA seq)
DNA Sequence
Header • Title • Taxonomy • Citation
Fields
h`p://www.ncbi.nlm.nih.gov/Genbank
• Once upon a time, GenBank mailed out sequences on CD-ROM disks a few times per year.
• At least doubles in size every 18 months
• There are approximately 130,671,233,801 bases, from 142,284,608 reported sequences in the tradi7onal GenBank divisions as of August 2011.
Distribu7on of sequence databases
• Books, ar7cles 1968 -‐> 1985 • Computer tapes 1982 -‐>1992 • Floppy disks 1984 -‐> 1990 • CD-‐ROM 1989 -‐> ? • FTP 1989 -‐> ? • On-‐line services 1982 -‐> 1994 • WWW 1993 -‐> ? • DVD 2001 -‐> ? • Mailing hard drives 2009 -‐> ?
• Many sequences in GenBank correspond to the same gene
• genomic clones, full length mRNA, various kinds of ESTs, submi`ed by different inves7gators
• RefSeq is the “Reference Sequence” for a gene -‐ as determined by GenBank curators – best guess given the current evidence, can change – usually based on the longest mRNA – usually has both 5’ and 3’ UTR
• Not necessarily reliable – A lot is not yet known… eg, alterna7ve splicing
Last thoughts on Genbank ...
• Ohen only use FASTA files (eg for BLAST) • GBFF are simply human readable versions of these records
• GBFF have become a vehicle for a lot more informa7on than they where meant to do
• Keep in mind that GenBank is DNA centric and is a poor vehicle for protein and mRNA expression/interac7on informa7on
Many Datasets at NCBI • The NCBI hosts a huge interconnected database system that, in addi7on to DNA and protein, includes: – Journal Ar7cles (PubMed) – Gene7c Diseases (OMIM) – Polymorphisms (dbSNP) – Gene Expression (GEO) – Cytogene7cs (CGH/SKY/FISH & CGAP) – Taxonomy – Chemistry (PubChem)
Ensembl at EBI/EMBL
http://genome.cshlp.org/content/14/5/971.full
KEGG: Kyoto Encylopedia of
Genes and Genomes • Enzyma7c and regulatory pathways • Mapped out by EC number and cross-‐referenced to genes in all known organisms (wherever sequence informa7on exits)
• Parallel maps of regulatory pathways
http://www.wwpdb.org
Golden Rules
• Use published databases and methods – Supported, maintained, trusted by community
• Document what you have done !!! – Sequence iden7fica7on numbers – Server, database, program VERSION – Program parameters
• Assess reliability of results
Bio-‐databases: A short word on problems
• Even today we face some key limita7ons – There is no single standard format
• Every database or program has its own format
– There is no standard nomenclature • Every database has its own names
– Data is not fully op7mized • Some datasets have missing informa7on without indica7ons of it
– Data errors • Data is some7mes of poor quality, erroneous, misspelled • Error propaga7on resul7ng from computer annota7on
What to take home • Databases are a collec7on of data
– Need to access and maintain easily and flexibly
• Biological informa7on is vast and some7mes very redundant
• Computers can only create data, they do not give answers
• Learn to use the big reliable databases (e.g. NCBI)
• Open access to sequences is not only essen7al for all of the work we do, if it was not there, there would be no bioinforma7cs, no BLAST, no Computa7onal Bioscience Program
• Open access to sequence informa7on is not all that needs to be open. We also need open access to the literature.
http://mibiol.biol.lu.se.webbhotell.ldc.lu.se/Bioinformatics/Exercises/databases.html
http://wiki.bio.dtu.dk/teaching/index.php/Exercise:_Searching_the_GenBank_database
http://biocourse.sanbi.ac.za/wp-content/uploads/2013/02/Biological-Databases-Exercises.pdf
RECOMMENDED EXERCISES