Molecular Biology Databases
description
Transcript of Molecular Biology Databases
![Page 1: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/1.jpg)
Lawrence Hunter, Ph.D.Director, Computational Bioscience ProgramUniversity of Colorado School of Medicine
[email protected]://compbio.uchsc.edu/Hunter
Molecular Biology Databases
![Page 2: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/2.jpg)
Tour of the major molecular biology
databases• A database is an indexed collection of
information• There is a tremendous amount of
information about biomolecules in publicly available databases.
• Today, we will just look at some of the main databases and what kind of information they contain.
![Page 3: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/3.jpg)
Data about Databases• Nucleic Acids research publishes an
annual database issue. 2009 issue lists 1170 editorially selected databases (link on course web site)
• Small excerpt from the A's:– AARSDB: Aminoacyl-tRNA synthetase
sequences– ABCdb: ABC transporters– AceDB: C. elegans, S. pombe, and human
sequences and genomic information– ACTIVITY: Functional DNA/RNA site activity– ALFRED: Allele frequencies and DNA
polymorphisms
![Page 4: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/4.jpg)
Located Sequence Features
• Indexing relevant data isn’t always easy– Naming schemes are always in flux, vary across
communities, and are often controversial. – Descriptions of phenotypes are very difficult to
standardize (even many clinical ones)• Genome sequences provide a clear
reference– A “located sequence feature” (place on a
chromosome) is unambiguous and biologically meaningful
– Closely related to the molecular concept of a gene.
![Page 5: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/5.jpg)
What can be discovered about a gene by a database search?
• Best to have specific informational goals:– Evolutionary information: homologous genes,
taxonomic distributions, allele frequencies, synteny, etc.
– Genomic information: chromosomal location, introns, UTRs, regulatory regions, shared domains, etc.
– Structural information: associated protein structures, fold types, structural domains
– Expression information: expression specific to particular tissues, developmental stages, phenotypes, diseases, etc.
– Functional information: enzymatic/molecular function, pathway/cellular role, localization, role in diseases
![Page 6: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/6.jpg)
Using a database• How to get information out of a database:
– Summaries: how many entries, average or extreme values; rates of change, most recent entries, etc.
– Browsing: getting a sense of the kind and quality of information available, e.g. checking familiar records
– Search: looking for specific, predefined information
• “Key” to searching a database:– Must identify the element(s) of the database
that are of interest somehow:• Gene name, symbol, location or other identifying
information.• Sequences of genes, mRNAs, proteins, etc.• A crossreference from another database or database
generated id.
![Page 7: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/7.jpg)
Searching for informationabout genes and their
products• Gene and gene product databases are
often organized by sequence– Genomic sequence encodes all traits of an
organism. – Gene products are uniquely described by their
sequences.– Similar sequences among biomolecules
indicates both similar function and an evolutionary relationship
• Macromolecular sequences provide biologically meaningful keys for searching databases
![Page 8: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/8.jpg)
Searching sequence databases
• Starting from a sequence alone, find information about it
• Many kinds & sources of input sequences– Genomic, expressed, protein (amino acid vs.
nucleic acid) – Complete or fragmentary sequences
• Goal is to retrieve a set of similar sequences.– Exact matches are rare, and not always
interesting– Both small differences (mutations) and large
(not required for function) within “similar” sequences can be biologically important.
![Page 9: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/9.jpg)
What might we want to know about a sequence?
• Is this sequence similar to any known genes? How close is the best match? Significance?
• What do we know about that gene?– Genomic (chromosomal location, allelic
information, regulatory regions, etc.)– Structural (known structure? structural
domains? etc.)– Functional (molecular, cellular & disease)
• Evolutionary information: – Is this gene found in other organisms? – What is its taxonomic tree?
![Page 10: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/10.jpg)
NCBI and Entrez• One of the most useful and
comprehensive database collections is the NCBI, part of the National Library of Medicine.– Home to GenBank, PubMed & many other
familiar DBs.• NCBI provides interesting summaries,
browsers, and search tools• Entrez is their database search interface
http://www.ncbi.nlm.nih.gov/Entrez• Can search on gene names, chromosomal
location, diseases, articles, keywords...
![Page 11: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/11.jpg)
![Page 12: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/12.jpg)
BLAST: Searching with a sequence
• Goals is to find other sequences that are more similar to the query than would be expected by chance (and therefore are likely homologous).
• Can start with nucleotide or amino acid sequence, and search for either (or both)
• Many options– E.g. ignore low information (repetitive)
sequence, set significance critical value– Defaults are not always appropriate: READ THE
NCBI EDUCATION PAGES!
![Page 13: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/13.jpg)
Main BLAST page
![Page 14: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/14.jpg)
A demonstration sequence
atgcacttgagcagggaagaaatccacaaggactcaccagtctcctggtctgcagagaagacagaatcaacatgagcacagcaggaaaagtaatcaaatgcaaagcagctgtgctatgggagttaaagaaacccttttccattgaggaggtggaggttgcacctcctaaggcccatgaagttcgtattaagatggtggctgtaggaatctgtggcacagatgaccacgtggttagtggtaccatggtgaccccacttcctgtgattttaggccatgaggcagccggcatcgtggagagtgttggagaaggggtgactacagtcaaaccaggtgataaagtcatcccactcgctattcctcagtgtggaaaatgcagaatttgtaaaaacccggagagcaactactgcttgaaaaacgatgtaagcaatcctcaggggaccctgcaggatggcaccagcaggttcacctgcaggaggaagcccatccaccacttccttggcatcagcaccttctcacagtacacagtggtggatgaaaatgcagtagccaaaattgatgcagcctcgcctctagagaaagtctgtctcattggctgtggattttcaactggttatgggtctgcagtcaatgttgccaaggtcaccccaggctctacctgtgctgtgtttggcctgggaggggtcggcctatctgctattatgggctgtaaagcagctggggcagccagaatcattgcggtggacatcaacaaggacaaatttgcaaaggccaaagagttgggtgccactgaatgcatcaaccctcaagactacaagaaacccatccaggaggtgctaaaggaaatgactgatggaggtgtggatttttcatttgaagtcatcggtcggcttgacaccatgatggcttccctgttatgttgtcatgaggcatgtggcacaagtgtcatcgtaggggtacctcctgattcccaaaacctctcaatgaaccctatgctgctactgactggacgtacctggaagggagctattcttggtggctttaaaagtaaagaatgtgtcccaaaacttgtggctgattttatggctaagaagttttcattggatgcattaataacccatgttttaccttttgaaaaaataaatgaaggatttgacctgcttcactctgggaaaagtatccgtaccattctgatgttttgagacaatacagatgttttcccttgtggcagtcttcagcctcctctaccctacatgatctggagcaacagctgggaaatatcattaattctgctcatcacagattttatcaataaattacatttgggggctttccaaagaaatggaaattgatgtaaaattatttttcaagcaaatgtttaaaatccaaatgagaactaaataaagtgttgaacatcagctggggaattgaagccaataaaccttccttcttaaccatt
![Page 15: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/15.jpg)
• Major choices:– Translatio
n– Database– Filters– Restrictio
ns– Matrix
![Page 16: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/16.jpg)
Formatted blast output
![Page 17: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/17.jpg)
Close hit: Macaque ADH alpha
![Page 18: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/18.jpg)
Distant hit:L-threonine 3-
dehydrogenase from a thermophilic bacterium
![Page 19: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/19.jpg)
Parameters
![Page 20: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/20.jpg)
Click on:
![Page 21: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/21.jpg)
…
![Page 22: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/22.jpg)
Taxonomy report(link from “Results of BLAST” page)
![Page 23: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/23.jpg)
What did we just do?• Identify loci (genes) associated with
the sequence. Input was human Alcohol Dehydrogenase 1A
• For each particular “hit”, we can look at that sequence and its alignment in more detail.
• See similar sequences, and the organisms in which they are found.
• But there’s much more that can be found on these genes, even just inside NCBI…
![Page 24: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/24.jpg)
Blink: Precomputed blast
![Page 25: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/25.jpg)
Conserved domains
![Page 26: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/26.jpg)
NCBI version of KEGG & EcoCyc
![Page 27: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/27.jpg)
![Page 28: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/28.jpg)
More from Entrez Gene
![Page 29: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/29.jpg)
And more…
![Page 30: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/30.jpg)
![Page 31: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/31.jpg)
PubMed
![Page 32: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/32.jpg)
![Page 33: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/33.jpg)
Gene Expression
![Page 34: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/34.jpg)
Detailed expression information
![Page 35: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/35.jpg)
Genome map view
![Page 36: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/36.jpg)
OMIM
![Page 37: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/37.jpg)
![Page 38: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/38.jpg)
![Page 39: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/39.jpg)
NCBI is not all there is...• Links to non-NCBI databases (see also “Link
Out”)– Reactome for pathways (also KEGG)– HGNC for nomenclature– HPRD protein information– Regulatory / binding site DBs (e.g. CREB; some not
linked)– IHOP (information hyperlinked over proteins)
• Other important gene/protein resources not linked:– UniProt (most carefully annotated)– PDB (main macromolecular structure repository)– UCSC (best genome viewer & many useful ‘tracks’)– DIP / MINT (protein-protein interactions)– More: InterPro, MetaCyc, Enzyme, etc. etc.
![Page 40: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/40.jpg)
![Page 41: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/41.jpg)
![Page 42: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/42.jpg)
Gene Names (not easy!)
![Page 43: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/43.jpg)
Protein reference db
![Page 44: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/44.jpg)
![Page 45: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/45.jpg)
![Page 46: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/46.jpg)
…
…
![Page 47: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/47.jpg)
![Page 48: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/48.jpg)
Take home messages• There are a lot of molecular biology
databases, containing a lot of valuable information
• Not even the best databases have everything (or the best of everything)
• These databases are moderately well cross-linked, and there are “linker” databases
• Sequence is a good identifier, maybe even better than gene name!
![Page 49: Molecular Biology Databases](https://reader035.fdocuments.us/reader035/viewer/2022062310/568160f8550346895dd03659/html5/thumbnails/49.jpg)
Homework• Pick a favorite gene (or, if you don’t know any,
how about looking up one of my favorites, PPAR-Delta) and gather information about it from at least five distinct resources.
• Readings:– Nucleic Acids Research online Molecular
Biology Database Collection in 2009 Nucl. Acids Res. 2009 37: D1-D4doi:10.1093/nar/gkn942 • also, browse some of the entries themselves.
– NCBI tutorial, Entrez: Making use of its power.