BTN323: INTRODUCTION TO BIOLOGICAL DATABASES
description
Transcript of BTN323: INTRODUCTION TO BIOLOGICAL DATABASES
![Page 1: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/1.jpg)
BTN323:INTRODUCTION TO
BIOLOGICAL DATABASES
Day1: NCBI Databases and Entrez
Lecturer: Junaid Gamieldien, PhD
NOTE: Most slides derived from NCBI’s field guide
http://www.sanbi.ac.za/training-2/undergraduate-training/
![Page 2: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/2.jpg)
WHAT YOU NEED TO LEARN:
What is a database and what are the features of an ideal db?
What are the relationships/differences between primary and derived sequence databases?
What are the benefits of RefSeq?
Why is data integration useful?
![Page 3: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/3.jpg)
THE NATIONAL CENTER FOR BIOTECHNOLOGY INFORMATION
Created in 1988 as a part of theNational Library of Medicine at NIH
– Establish public databases– Research in computational biology– Develop software tools for sequence analysis– Disseminate biomedical information
Bethesda,MD
![Page 4: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/4.jpg)
WEB ACCESS: WWW.NCBI.NLM.NIH.GOV
New HomepageCommon footerCommon footer
New pages!New pages!
![Page 5: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/5.jpg)
WHAT ARE DATABASES?
Structured collection of information.
Consists of basic units called records or entries.
Each record consists of fields, which hold pre-defined data related to the record.
For example, a protein database would have protein entries as records and protein properties as fields (e.g., name of protein, length, amino-acid sequence)
![Page 6: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/6.jpg)
THE ‘PERFECT’ DATABASE
Comprehensive, but easy to search.
Annotated, but not “too annotated”.
A simple, easy to understand structure.
Cross-referenced.
Minimum redundancy.
Easy retrieval of data.
![Page 7: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/7.jpg)
THE CENTRAL DOGMA & BIOLOGICAL DATA
Protein structures-Experiments-Models (homologues)
Literature information
Original DNA Sequences(Genomes)
Protein Sequences-Inferred -Direct sequencing
Expressed DNA sequences( = mRNA Sequences= cDNA sequences)Expressed Sequence Tags (ESTs)
![Page 8: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/8.jpg)
NCBI DATABASES AND SERVICES
GenBank primary sequence database
Free public access to biomedical literature PubMed free Medline (3 million searches per day) PubMed Central full text online access
Entrez integrated molecular and literature databases
![Page 9: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/9.jpg)
TYPES OF MOLECULAR DATABASES
Primary Databases Original submissions by experimentalists Content controlled by the submitter
Examples: GenBank, Trace, SRA, SNP, GEO
Derivative Databases Derived from primary data Content controlled by third party (NCBI)
Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets, UniGene, Homologene, Structure, Conserved Domain
![Page 10: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/10.jpg)
PRIMARY VS. DERIVATIVE SEQUENCE DATABASES
GenBankGenBank
SequencingSequencingCentersCenters
GA
GAGA
ATT
ATTCCGAGA
ATT
ATTCC
AT
GAGA
ATTCC GAGA
ATTCC
TTGACAAT
TGACTA
ACGTGC
TTGACA
CGTGAATTGACTA
TATAGCCG
ACGTGC
ACGTGCACGTGC
TTGACA
TTGACA
CGTGA
CGTGA
CGTGA
ATTGACTA
ATTGACTAATTGACTA
ATTGACTA
TATAGCCG
TATAGCCGTATAGCCGTATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCG
CATT
GAGA
ATTCC GAGA
ATTCC LabsLabs
AlgorithmsAlgorithms
UniGene
CuratorsCurators
RefSeq
GenomeAssembly
TATAGCCGAGCTCCGATACCGATGACAA
Updated continually by NCBI
Updated ONLY by submitters
![Page 11: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/11.jpg)
SEQUENCE DATABASES AT NCBI
Primary GenBank: NCBI’s primary sequence database Trace Archive: reads from capillary sequencers Sequence Read Archive: next generation data
Derivative GenPept (GenBank translations) Outside Protein (UniProt—Swiss-Prot, PDB) NCBI Reference Sequences (RefSeq)
![Page 12: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/12.jpg)
GENBANK - PRIMARY SEQUENCE DB
Nucleotide only sequence database
Archival in nature Historical Reflective of submitter point of view (subjective) Redundant
Data Direct submissions (traditional records) Batch submissions FTP accounts (genome data)
![Page 13: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/13.jpg)
GENBANK - PRIMARY SEQUENCE DB (2) Three collaborating databases
1. GenBank
2. DNA Database of Japan (DDBJ)
3. European Molecular Biology Laboratory (EMBL) Database
![Page 14: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/14.jpg)
TRADITIONAL GENBANK RECORD
ACCESSION U07418
VERSION U07418.1 GI:466461
ACCESSION U07418
VERSION U07418.1 GI:466461
Accession•Stable•Reportable•Universal
Accession•Stable•Reportable•Universal
VersionTracks changes in sequenceVersionTracks changes in sequence
GI numberNCBI internal useGI numberNCBI internal use
well annotatedwell annotated
the sequence is the datathe sequence is the data
![Page 15: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/15.jpg)
DERIVATIVE SEQUENCE DATABASES
![Page 16: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/16.jpg)
FEATURES Location/Qualifiers source 1..2484 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="3" /map="3p22-p23" gene 1..2484 /gene="MLH1" CDS 22..2292 /gene="MLH1" /note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession Number P14242), S. cerevisiae MLH1 (GenBank Accession Number U07187), E. coli MUTL (Swiss-Prot Accession Number P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession Number P14161) and Streptococcus pneumoniae (Swiss-Prot Accession Number P14160)" /codon_start=1 /product="DNA mismatch repair protein homolog" /protein_id="AAC50285.1" /db_xref="GI:463989" /translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS
GENPEPT: GENBANK CDS TRANSLATIONS
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
![Page 17: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/17.jpg)
REFSEQ: DERIVATIVE SEQUENCE DATABASE Curated transcripts and proteins
Model transcripts and proteins
Assembled Genomic Regions
Chromosome records Human genome microbial organelle
ftp://ftp.ncbi.nih.gov/refseq/release/
![Page 18: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/18.jpg)
SELECTED REFSEQ ACCESSION NUMBERS
mRNAs and Proteins
NM_123456 Curated mRNANP_123456 Curated ProteinNR_123456 Curated non-coding RNAXM_123456 Predicted mRNAXP_123456 Predicted Protein XR_123456 Predicted non-coding RNAGene RecordsNG_123456 Reference Genomic SequenceChromosomeNC_123455 Microbial replicons, organelle genomes, human chromosomesAC_123455 Alternate assembliesAssembliesNT_123456 Contig NW_123456 WGS Supercontig
![Page 19: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/19.jpg)
GENBANK TO REFSEQ
![Page 20: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/20.jpg)
REFSEQS: ANNOTATION REAGENTS
Genomic DNAGenomic DNA((NCNC, , NT, NWNT, NW))
Model mRNAModel mRNA (XM)(XM)(XR)(XR)
Curated mRNACurated mRNA (NM)(NM)(NR)(NR)
Model protein Model protein (XP)(XP)
Curated ProteinCurated Protein (NP)(NP)
Scanning....
= ?
GenBankSequences
RefSeq
![Page 21: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/21.jpg)
REFSEQ BENEFITS
Non-redundancy
Updates to reflect current sequence data and biology
Data validation
Format consistency
Distinct accession series
Stewardship by NCBI staff and collaborators
![Page 22: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/22.jpg)
OTHER DERIVATIVE DATABASES
Expressed Sequences
dbSNP
Structure
Gene
and more…
![Page 23: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/23.jpg)
ENTREZ
FINDING RELEVANT INFORMATION IN NCBI
DATABASES
![Page 24: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/24.jpg)
ENTREZ: A DISCOVERY SYSTEM
Gene
Taxonomy
PubMed abstracts
Nucleotide sequences
Protein sequences
3-D Structure
3 -D Structure
Word weight
VAST
BLASTBLAST
Phylogeny
Hard LinkNeighborsRelated Sequences
NeighborsRelated SequencesBLinkDomains
NeighborsRelated Structures
Pre-computed and pre-compiled data.
•A potential “gold mine” of undiscovered relationships.
•Used less than expected.
Pre-computed and pre-compiled data.
•A potential “gold mine” of undiscovered relationships.
•Used less than expected.
![Page 25: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/25.jpg)
GLOBAL QUERY: ALL NCBI DATABASES
The Entrez system: 38 (and counting) integrated databasesThe Entrez system: 38 (and counting) integrated databases
![Page 26: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/26.jpg)
TRADITIONAL METHOD: THE LINKS MENU
DNA Sequence
Nucleotide – Protein Link
Related Proteins
Protein – Structure Link
3-D Structure
![Page 27: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/27.jpg)
THE PROBLEM
Rapidly growing databases with complex and changing relationships
Rapidly changing interfaces to match the above
Result Many people don’t know:
Where to begin Where to click on a Web page Why it might be useful to click there
![Page 28: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/28.jpg)
GLOBAL NCBI (ENTREZ) SEARCH
colon cancercolon cancer
![Page 29: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/29.jpg)
GLOBAL ENTREZ SEARCH RESULTS
![Page 30: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/30.jpg)
ENTREZ TIP: START SEARCHES IN GENE
Other Entrez DBs
HomoloGene
Entrez Protein
Gene
UniGene
BLink
Homologene:Gene Neighbors
![Page 31: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/31.jpg)
PRECISE RESULTS
MLH1[Gene Name] AND Human[Organism]MLH1[Gene Name] AND Human[Organism]
![Page 32: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/32.jpg)
MLH1 GENE RECORD
![Page 33: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/33.jpg)
MLH1:LINKS TO SEQUENCE
![Page 34: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/34.jpg)
GENEVIEW: HUMAN MLH1 VARIATIONS
ATPase domain
![Page 35: BTN323: INTRODUCTION TO BIOLOGICAL DATABASES](https://reader036.fdocuments.us/reader036/viewer/2022062314/568144f1550346895db1c1c1/html5/thumbnails/35.jpg)
‘TAKE HOME MESSAGE’ ADVANTAGES OF DATA INTEGRATION More relevant inter-related information in one
place
Makes it easier to find additional relevant information related to your initial query
Potentially find information indirectly linked, but relevant to your subject of interest uncover non-obvious genetic features that
explain phenotype or disease
Easier to build a ‘story’ based on multiple pieces of biological evidence