The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the...
Transcript of The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the...
![Page 1: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/1.jpg)
25/05/2007
1
A Very short Introduction to Databases
Question:
What is a database?
The dictionary definition
Function: nounDate: circa 1962
: a usually large collection of data organizedespecially for rapid search and retrieval (as by a computer)
- Webster dictionary
WHAT is a database?
A collection of data that needs to be:StructuredSearchable
and should beUpdated (periodically)Cross referenced
WHY to use a database?
Challenge:To organise data into useful information that can be accessed and analysed the best way possiblebest way possible.
For example: HOW would YOU organise all biological sequences so that the biological information is optimally accessible?
Question:
Which BIOLOGICAL/MEDICAL database do you know?
How could you classify them?How could you classify them?
![Page 2: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/2.jpg)
25/05/2007
2
Type of dataPrimary (experimental results)
GenomesProtein SequencesProtein StructuresInteractionsExpression
Be careful about the experiment
Expression……
Secondary (derived information/classification)Protein foldsProtein familiesGenome comparisons……….
Be careful about the source of the primary data
Type of dataComplete
Can contain many repeating entry with the same primary data (e.g different experiments for the same protein)
Non redundantRepeating or similar (in a given sense) data are reducedRepeating or similar (in a given sense) data are reduced
Be careful about the criteria
AnnotationRaw data
Primary data without further information but the details of the experimental procedure)
Annotated Contain more information besides the primary
( fexperimental data (Es: Protein sequence + function, localization, expression information)Manually (Es. Swiss Prot)Automatically (Es. TrEMBL)
Cross linkedA particular form of annotation
AccessibilityPublic
PrivateCareful curation can be expensive
UsefulnessUsefulUnuseful
Subjective issue, of course
Collection of data are NOT databases
We want to rapidly search for them
Need a structure
![Page 3: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/3.jpg)
25/05/2007
3
Data organisation
Fundamental unit
"entry" or “record”
Identified by a “name” and/or an “accession
b ”number”
Associated informations
Schema: structural description of the type of
facts contained in the database
SCHEMA
PROTEINName/Accession NumberOrganismExperimental procedureExperimental procedureAuthorPublication……..Sequence
MODELS
A schema is organised in a modelFlat fileRelationalHierarchicHierarchicObject oriented
A model can be written in different formats
FLAT FILE Model (Swiss Prot)
FLAT FILE ModelIdentificationID Identification Mandatory, 1AC Accession number(s) Mandatory, 1
GeneralDT Date Mandatory, 3 (insertion, version, current release)DE Description Mandatory, 1 or more, manuaGN Gene name Facultative, 1 or more, manual
TaxonomyOS Organism species Mandatory, 1 OG Organelle Facultative, 1 or moreOC Organism classification Mandatory 1OC Organism classification Mandatory, 1OX Organism NCBI_TaxID Facultative
Reference (facultative)RN Reference number RP Reference positionRC Reference commentRX Reference cross-reference RA Reference authors RT Reference title RL Reference location
![Page 4: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/4.jpg)
25/05/2007
4
CommentCC Comments or notes Facultative, 1 or more. ManualDR Database cross-references Facultative, 1 or more. ManualKW Keywords Mandatory, 1 or more. ManualFT Feature table data Facultative, 1 or more, Manual
SequenceSQ Sequence header Mandatory, 1
Amino Acid Sequence Mandatory, 1// Termination line Mandatory, 1
More than a FLAT FILE: ASN.1 Format
More than a FLAT FILE: ASN.1 Format More than a FLAT FILE: XML Format
More than a FLAT FILE: XML Format
FASTAFASTA
MMDBMMDB
UniProtUniProtEMBLEMBL
ASN.1ASN.1
GraphicalGraphicalGenPeptGenPept
GenBankGenBankMMDBMMDB
XMLXML
BINDBIND
![Page 5: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/5.jpg)
25/05/2007
5
FLAT FILE: FASTA Format
>gi|57013850|sp|P69905|HBA_HUMAN Hemoglobin subunit alpha MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNA VAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSK YR
FLAT FILE for exams
SN:Peter SS:Smith F:Math. E:Analysis TN:Michael TS:Red M:30SN:Peter SS:Smith F:Math. E:Physics TN:Robert TS:White M:28SN:Mary SS:Baker F:Math. E:Analysis TN:Michael TS:Red M:30SN:Mary SS:Baker F:Math E:Physics TN:Robert TS:White M:30SN:John SS:Jones F:Phys E:Analysis TN:Michael TS:Red M:30SN L i SS S ith F Ph E Ph TN C l TS Bl M 29SN:Louise SS:Smith F:Phys E:Phys TN:Carol TS:Blue M:29
SN = Student’sName SS= Student’sSurnameF= Faculty E= ExamTN= Teacher’s Name TS= Teacher’s surnameM= Mark
Pitfalls of the FLAT FILE modelEach operation on the database (extraction, update, deletion) requires to read all the file
It is difficult to assure the consistencyDifferent entries with the same codeDuplicated entriespComplete update of the same information in different records..
RedundancyE.g. taxonomy repeated for all the entries from the same organism
Relational databases
Data is stored in tables ( relations)Each row (tuple) of the table represents an entryEach column of the table represents an attribute
Data relationships across tables can be either many-to-one or many-to-many
A few rules allow the database to be viewed in many ways
Relational DB of students and exams
Student’sName
Student’sSurname
Student’s ID
Faculty Exam Teacher’s Name
Teacher’s surname
Teacher’s ID
Mark
Peter Smith 12 Math. Analysis Michael Red 23 30
Peter Smith 12 Math. Physics Robert White 21 28
Mary Baker 13 Math. Analysis Michael Red 23 30
Mary Baker 13 Math Physics Robert White 21 30
John Jones 45 Physics Analysis Michael Red 23 30
Louise Smith 24 Physics Physics Carol Blue 54 29
Attributes
Eliminating redundancy
Student’s ID
Exam Teacher’s ID
Mark
12 Analysis 23 30
12 Physics 21 28
Student’s ID
Student’s Name
Student’s Surname
Faculty
12 Peter Smith Math
13 Mary Baker Math
24 Louise Smith Physics
13 Analysis 23 30
13 Physics 21 30
45 Analysis 23 30
24 Physics 54 29
45 John Jones Physics
Teacher’s ID
Teacher’s Name
Teacher’s Surname
21 Robert White
23 Michael Red
54 Carol Blue
![Page 6: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/6.jpg)
25/05/2007
6
Eliminating redundancy
Student’s ID ExamC Mark
12 AnalysisM 30
12 PhysicsM 28
13 AnalysisM 30
Student’s ID
Student’s Name
Student’s Surname
Faculty
12 Peter Smith Math
13 Mary Baker Math
24 Louise Smith Physics
45 John Jones Physics
Teacher’s ID Teacher’s Name
Teacher’s Surname
13 AnalysisM 30
13 PhysicsM 30
45 AnalysisP 30
24 PhysicsP 29
21 Robert White
23 Michael Red
54 Carol Blue
ExamC Exam Faculty Teacher’s ID
AnalysisM Analysis Math 23
PhysicsM Physics Math 21
AnalysisP Analysis Physics 23
PhysicsP Physics Physics 54
Relational Databases
What have we achieved?No repeating informationLess storage spaceBetter reality representationEasy modification/managementy gEasy usage of any combination of records
Relational Operations Relational Operations
Object Oriented Model DBMS
Internal organizationControls speed and flexibility
DatabaseDatabase
A unity of programs that StoreExtractModify
StoreStore ExtractExtract ModifyModify
USER(S)USER(S)
![Page 7: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/7.jpg)
25/05/2007
7
DBMS
Has to manage big amount of recordsHas to control contemporary accesses to the databaseHas to assure the persistence of data after usageHas to manage the access permissions to data and privacy-related issuesMust be as efficient as possible (in term of time and memory requirements)Must control backup and recovery events
Server and Browser
SERVER:Mediates operations and communications
DatabaseDatabase
StoreStore ExtractExtract ModifyModify
BROWSER Transmits requestsRenders answers
USER(S)USER(S)
Server
Browser
Accessing database information
A request for data from a database is called a query
Queries are usually performed with a queryQueries are usually performed with a query language that, for example, perform the basic relational operation (SELECT, PROJECT, JOIN)
Query Languages
The standard SQL (Structured Query Language) originally called SEQUEL (Structured English QUEry Language)Developed by IBM in 1974; introduced p y ;commercially in 1979 by Oracle Corp.Standard interactive and programming language for getting information from and updating a database.
Query by Accession code or Keyword
Query by Accession code is the easiest BUT usually the accession code is not known
Query by Keyword (on all the attributes) can retrieve big amounts of entriesretrieve big amounts of entries
(for example a search for “E.Coli” on PubMed would report also the papers of Elisabetta Coli)
Restrict the search by keyword on the suitable attribute
Rescticted searches
By particular Sintax: E.coli [auth], E.coli[ti]By selecting appropriate options on the browser
![Page 8: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/8.jpg)
25/05/2007
8
Boolean operators to refine the query
AND (&)
OR (|)OR (|)
NOT(!)
Complex queries
Help you with the Euler-Venn diagrams
Examples
Write an expression for finding Horse liver alcohol dehydrogenasesMammal dehydrogenases not acting on alcoholsArchaeal and Bacterial globins that does notArchaeal and Bacterial globins that does not contain heme…
Complex queries
By particular sintax or by browser options
Integrating different databases
Different DB for the same type of information requires collaboration
USANUCLEOTIDE SEQUENCES
JAPAN
EUROPE
![Page 9: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/9.jpg)
25/05/2007
9
Since 1988, the institutions in charge of the three databases are linked in the International Collaboration of DNA Sequence Databases
The format or their entries have common features that allow the sharing of the sequences
Researches send their results to one of the databases
The database send the entry to the other databases
The first database maintain exclusive rights for the modification or the update of the entry
Distributed databases
From local to global attitudeData appears to be in one location but is most definitely not
Two or more data files in different locations, periodically synchronized by the DBMS to keep data in all locations consistent (A B C)consistent (A,B,C)
An intricate network for combining and sharing information
Different DB for the same type can be integrated in a unified DBSwiss Prot (Swiss Intitute of Bioinformatics)Protein Sequences Manually annotated
TrEMBL(EBI)Protein sequences automatically annotated
PIR(Georgetown University)Protein Sequences Manually annotated
Distributed annotations
Many different databases, distributed in many locations, annotate different aspects of the same protein (or gene or structure..)
It is possible to help the researcher to collect the information in a simple way?
The solution of many private companies: Data warehouse
Periodically, one imports data from databases and store it (locally) in the data warehouse.
Now a local database can be created, containing for instance protein family data (sequence, structure, function and pathway/process data integrated with the gene expression and p y p g g pother experimental data).
Disadvantage: expensive, intensive, needs to be updated.
Advantage: easy control of integrated data-mining pipeline.
Integration (and cross linking) among databases
Quite easy among the databases curated by the same institution:NCBI: GenBank, PubMed, SNPs, OMIM, …….EBI: EMBL nucleotide, UniProt, TrEMBL, ArrayExpress, ….
Quite easy also among the most important databasesUniProt, PDB, GO, Pfam
Internal cross-links in the entries
Common Query interfacesEntrez, SRS
![Page 10: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/10.jpg)
25/05/2007
10
Common query interface: ENTREZ ENTREZ crosslinks
An open collaborative solution: Distributed Annotation
Different institutions independently annotate sequences and locally store the results in a database
The information are shared by means of a common Accession Code
The user access the information using a single server that asks all the connected servers (selected by the user)
Pitfalls: can contain many heterogeneous and sometimes contradictory information (e.g. different predictions for the same feature performed with different, non agreeing, methods)
Distributed Annotation Systems
The Distributed Annotation System (DAS)defines a communication protocol used toexchange biological sequence annotations.
Data distribution, performed by DASservers, is separated from visualization,which is done by DAS clients.which is done by DAS clients.
DAS is a client-server system in which asingle client integrates information frommultiple servers. It allows a single machine togather up sequence annotation informationfrom multiple distant web sites, collate theinformation, and display it to the user in asingle view.
BioSapiens is a Network of Excellence, funded by the
European Union's 6th Framework Programme,
and made up of bioinformatics researchers
from 25 institutions based in 14 t i th h t14 countries throughout
Europe. The objective of the
BioSapiens is to provide a large scale, concerted effort to annotate genome data by
laboratories distributed around Europe, using both informatics tools and input
from experimentalists.
BioSapiens DAS portal
![Page 11: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/11.jpg)
25/05/2007
11
Biological databaseMaybe more than 1000 different DB
From less than 100Kb to more than 100 GbDNA sequences: > 100 GbProtein sequences: 1.5 GbProtein structures: 5 Gb
Some numbers
Protein structures: 5 Gb
Update: Daily (GenBank),Weekly (PDB),… …., NEVER
AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb, BBDB, BCGD,Beanref, Biolmage,BioMagResBank, BIOMDB, BLOCKS,BovGBASE,BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,CARBHYD, CATH,CAZY, CCDC, CD4OLbase, CGAP,ChickGBASE, Colibri, COPE, CottonDB, CSNDB,CUTG,CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,Picty_cDB, DIP,DOGS, DOMO, DPD, DPlnteract, ECDC,ECGC, EC02DBASE, EcoCyc, EcoGene,EMBL, EMD db,ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView, GCRDB, GDB,GENATLAS, Genbank, GeneCards, Genline, GenLink, GENOTK, GenProtEC,GIFTS GPCRDB GRAP GRBase gRNAsdb GRR GSDB HAEMB HAMSTERS
Some Biological Database
GIFTS, GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB, HAEMB, HAMSTERS,HEART-2DPAGE, HEXAdb, HGMD,HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN,HPDB,HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,KDNA, KEGG, Klotho,LGIC, MAD, MaizeDb, MDB,Medline, Mendel, MEROPS, MGDB, MGI,MHCPEP5Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,MPDB, MRR,MutBase, MycDB, NDB, NRSub, 0-lycBase,OMIA, OMIM, OPD, ORDB, OWL,PAHdb, PatBase, PDB,PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,PPDB,PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,PROTOMAP, RatMAP, RDP,REBASE, RGP, SBASE,SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,SPAD,SRNA db, SRPDB, STACK, StyGene,Sub2D,SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,TOPS,TRANSFAC, TRR, UniGene, URNADB, V BASE,VDRR, VectorDB, WDCM, WIT,WormPep, YEPD, YPD, YPM………….
Nucleic Acids Reseach: Database Issueon January, each yearReports a database collection, classified in 14 categories :
Nucleotide Sequence DatabasesRNA sequence databasesProtein sequence databasesStructure DatabasesGenomics Databases (non-vertebrate)Metabolic and Signaling PathwaysHuman and other Vertebrate GenomesHuman Genes and DiseasesHuman Genes and DiseasesMicroarray Data and other Gene Expression DatabasesProteomics ResourcesOther Molecular Biology DatabasesOrganelle DatabasesPlant DatabasesImmunological Databases
858 in 2006, 968 in 2007
http://nar.oxfordjournals.org/cgi/content/full/35/suppl_1/D3/DC1
![Page 12: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/12.jpg)
25/05/2007
12
Bateman’s suggestion for a good DB(Editor of the NAR DB issue)
When thinking of a name for your database do check if anyone else is using that name already. Calling your new database PDB is almost certainly going to cause confusion. It is also worth checking search engines with your database name, you may be surprised at what it means in other languages.
Do make your data as comprehensive as possible. Try to avoid making the data collection overspecialized. For example, a database of promoters for RNA genes in a single organism is not going to have a wide appeal, but a database of promoters for RNA genes in all organisms would be of wide interest and utility.
Do attribute the original sources of derived data.
Do make sure that you are not breaching any license terms by redistributing data.
Do include estimates of confidence in the data items if applicable.
Do make data available for bulk download as flat files or relational database tables with associated documentation
Bateman’s suggestion for a good DB
relational database tables with associated documentation.
Web services and DAS are becoming popular ways to make databases programmatically available. Making these available can stop your website being ground to a halt by users trying to screen scrape all your data.
Do allow users to provide feedback on your data and submit new data. Do respond to user feedback in a timely manner.
• The main six database categories • sequences
• proteins;• nucleic acids;
• mapping• genomes;• chromosomes;
3D
>sp|P56478|IL7_RATMFHVSFRYIFGIPPLILVLLPVTSSDCHIKDKDGKAFGSVLMISINQLDKMTGTDSDCPNNEPNFFKKHLCDDTKEAAFLNRAARKLRQFLKMNISEEFNDHLLRVSDGTQTLVNCTSKEEKTIKEQKKNDPCFLKRLLREIKTCWNKILKGSI
SEQUENCES
ONTOLOGIES• chromosomes;• …
• 3D structures• Expression• Function/Interaction• Literature, Ontologies
EXPRESSIONMAPPING
LS125-4R14523CYC223
FUNCTION/INTERACTIONS
LITERATURE
Slide:D.Raimondo
Type of dataPrimary (experimental results)
GenomesProtein SequencesProtein StructuresInteractionsExpression
Be careful about the experiment
Expression……
Secondary (derived information/classification)Protein foldsProtein familiesGenome comparisons……….
Be careful about the source of the primary data
AnnotationRaw data
Primary data without further information but the details of the experimental procedure)
Annotated Contain more information besides the primary
( fexperimental data (Es: Protein sequence + function, localization, expression information)ManuallyAutomatically
Cross linkedA particular form of annotation
A short word on problems
Even today we face some key limitationsThere is no standard format
Every database or program has its own formatThere is no standard nomenclature
Every database has its own namesEvery database has its own namesData is not fully optimized
Some datasets have missing information without indications of it
Data errorsData is sometimes of poor quality, erroneous, misspelledError propagation resulting from computer annotation
![Page 13: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/13.jpg)
25/05/2007
13
How we got the sequence
Sanger chain termination method
Primary DNA sourcesTrace files repositoriesSingle read: 500-1000 bp (~golf ball size / jig saw puzzle)Variable quality
WashU-Merck Human EST Project / Trace files”Base-calling” non-trivial
G, C or nothing?
Assembly is Non-trivial!Assembly is Non-trivial!
NCBI (National Center for Biotechnology Information)http://www.ncbi.nlm.nih.gov/
Institutions
EBI (European Bioinformatics Institute)http://www.ebi.ac.uk/
European Bioinformatics Institutewww.ebi.ac.uk
Branch of the EMBL. Started in 1980 in Heidelberg. Since 1995, at the WellcomeTrust Genome Campus
Mission (from the EBI website) To provide freely available data and bioinformatics servicesTo provide freely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress To contribute to the advancement of biology through basic investigator-driven research in bioinformatics To provide advanced bioinformatics training to scientists at all levels, from PhD students to independent investigators To help disseminate cutting-edge technologies to industry
![Page 14: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/14.jpg)
25/05/2007
14
Databases at EBI
Many different DB are curated by EBI. Among the others:
EMBL Nucleotide Database - Europe’s primary collection of nucleotide sequences is maintained in collaboration with Genbank (USA) and DDBJ (Japan) UniProt Knowledgebase - a complete annotated protein sequence database Macromolecular Structure Database - European Project for theMacromolecular Structure Database - European Project for the management and distribution of data on macromolecular structures ArrayExpress - for gene expression data Ensembl - Providing up to date completed metazoic genomes and the best possible automatic annotation. IntAct - Provides a freely available, open source database system and analysis tools for protein interaction data.
Databases at EBI
Many different Retrieval systems are available. Among them:
BioMartBioMart is a simple and robust data integration system for large scale data querying, providing researchers with fast and flexible access to biological databases. Integr8The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes.proteomes.Query ArrayExpressSearch the ArrayExpress microarray database.SRSThe Sequence Retrieval System can be used to browse the various biological sequence and literature databases the EBI has available.UniProt DASThe distributed annotation system (DAS) is a client-server system in which a single client integrates information from multiple
SRS: Sequence Retrieval SystemMore than 100 DB browsable
SRS: Sequence Retrieval SystemMore than 100 DB browsable
![Page 15: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/15.jpg)
25/05/2007
15
SRS: Sequence Retrieval System
This interface allows you to: perform simple and complex queries across one or several databases;view your results in different formats;
May 17th - Introduction to Biological Databases
y ;create your own views for your results; save results to file;launch analysis tools on results;link results to different databases.
National Center for Biotechnological Information
Created by The National Institutes of Health in 1988. Mission (from the NCBI website)
conducts research on fundamental biomedical problems at the molecular level using mathematical and computational methodsmaintains collaborations with several NIH institutes, academia, industry, and other governmental agenciessupports training on basic and applied research in computational pp g pp pbiology for postdoctoral fellows through the NIH Intramural Research Programengages members of the international scientific community in informatics research and training through the Scientific Visitors Programdevelops, distributes, supports, and coordinates access to a variety of databases and software for the scientific and medical communitiesdevelops and promotes standards for databases, data deposition and exchange, and biological nomenclature
Databases at NCBI ENTREZ: Sequence Retrieval System
Term frequency statistics
Literature citations in
Literature
MEDLINE abstracts
1993
Amino acid sequence similarityCoding region
features
Nucleotide sequence similarity
citations in sequence databases
citations in sequence databases
Nucleotide sequences
Protein sequences
The challenge of the information space:
Nucleotide records 68,739,698 Protein sequences 7,861,5303D structures in PDB 43,421BIND Interactions 202,695KEGG pathways 35,211Human Unigene Cluster 66,488 Completed Genome projects 320Different taxonomy Nodes 296,377dbSNP records 26,430,220 RefSeq Genomic records: 536,571 RefSeq RNA Records: 625 928
Feb 11th 2006
RefSeq RNA Records: 625,928 RefSeq Protein Records: 2,273,764 PubMed records 16,082,339Free PubMed records 1,212,220 OMIM records 16,521
From Fig 1 ofEntrez search and retrieval systemJim OstellChapter 14, the NCBI Handbook.
2003
![Page 16: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/16.jpg)
25/05/2007
16
2004
http://www.ncbi.nih.gov/Database/datamodel/
• The main six database categories • sequences
• proteins;• nucleic acids;
• mapping• genomes;• chromosomes;
3D
>sp|P56478|IL7_RATMFHVSFRYIFGIPPLILVLLPVTSSDCHIKDKDGKAFGSVLMISINQLDKMTGTDSDCPNNEPNFFKKHLCDDTKEAAFLNRAARKLRQFLKMNISEEFNDHLLRVSDGTQTLVNCTSKEEKTIKEQKKNDPCFLKRLLREIKTCWNKILKGSI
SEQUENCES
ONTOLOGIES• chromosomes;• …
• 3D structures• Expression• Function/Interaction• Literature, Ontologies
EXPRESSIONMAPPING
LS125-4R14523CYC223
FUNCTION/INTERACTIONS
LITERATURE
Slide:D.Raimondo
PubMed
MEDLINE is a Bibliographic DBMEDLINE + ENTREZ = PubMED
http://www.ncbi.nlm.nih.gov/pubmedhttp://www.ncbi.nlm.nih.gov/pubmed
http://www.pubmed.govhttp://www.pubmed.gov
PubMed
What is PubMed?a literature database specialised in life sciencesa literature search systemPubMed is developed and maintained by the National Center for Biotechnology Information (NCBI) of the National Library of Medicine (NLM) – Bethesda - USANational Library of Medicine (NLM) Bethesda USAcovers several fields such as
medicine;dentistry;veterinary sciences;clinical sciences;biological sciences;…
PubMed
includes 16 million citations from 1865 to NOW
more than 4,500 journals are referenced
82,028,000 queries in March 2006 (163,000 in January 1997)
![Page 17: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/17.jpg)
25/05/2007
17
PMID- 16381842OWN - NLMSTAT- MEDLINEDA - 20051229DCOM- 20060228PUBM- PrintIS - 1362-4962 (Electronic)VI - 34IP - Database issueDP - 2006 Jan 1TI - The Universal Protein Resource (UniProt): an expanding universe of
protein information.PG - D187-91AB - The Universal Protein Resource (UniProt) provides a central
resource on protein sequences and functional annotation with three
database…
AD - Department of Biochemistry and Molecular Biology, Georgetown
University Medical Center 3900 Reservoir Road NW
FAU - Magrane, MicheleAU - Magrane MFAU - Martin, Maria JAU - Martin MJFAU - Mazumder, RajaAU - Mazumder RFAU - O'Donovan, ClaireAU - O'Donovan CFAU - Redaschi, NicoleAU - Redaschi NFAU - Suzek, BarisAU - Suzek BLA - engGR - 1 U01 HG02712-01/HG/NHGRIGR - 1R01HGO2273-01/HG/NHGRIGR - HHSN266200400061C/HS/AHCPRPT - Journal ArticlePL - EnglandTA - Nucleic Acids ResJT - Nucleic acids research.
• PubMed unique identifier (PMID)• Article identifiers (AID)• Publication date (DP or PDAT)• Added to PubMed (EDAT)• Title (TI)• Abstract (AB)
• Journal title (TA and JT)
May 17th - Introduction to Biological Databases
University Medical Center, 3900 Reservoir Road, NW, Washington,
DC 20057-1414, USA.FAU - Wu, Cathy HbAU - Wu CHFAU - Apweiler, RolfAU - Apweiler RFAU - Bairoch, AmosAU - Bairoch AFAU - Natale, Darren AAU - Natale DAFAU - Barker, Winona CAU - Barker WCFAU - Boeckmann, BrigitteAU - Boeckmann BFAU - Ferro, SerenellaAU - Ferro SFAU - Gasteiger, ElisabethAU - Gasteiger EFAU - Huang, HongzhanAU - Huang HFAU - Lopez, RodrigoAU - Lopez R
JID - 0411011RN - 0 (Proteins)RN - 0 (Proteome)SB - IMMH - *Databases, ProteinMH - InternetMH - Proteins/chemistry/classification/physiologyMH - Proteome/chemistryMH - Research Support, N.I.H., ExtramuralMH - Research Support, Non-U.S. Gov'tMH - Research Support, U.S. Gov't, Non-P.H.S.MH - Sequence Analysis, ProteinMH - Systems IntegrationMH - User-Computer InterfaceEDAT- 2005/12/31 09:00MHDA- 2006/03/01 09:00AID - 34/suppl_1/D187 [pii]AID - 10.1093/nar/gkj161 [doi]PST - ppublishSO - Nucleic Acids Res. 2006 Jan 1;34(Database issue):D187-91.
• Affiliation (AD)• Authors (AU and AUF)
• MeSH terms (MH)
• Citation (SO)
• The MeSH database• MeSH (Medical Subject Headings) is a controlled vocabulary thesaurus used for indexing PubMed articles. An article, which deals with “Down syndrome” will be indexed with the corresponding MeSH term. Each bibliographic reference is associated with a set of MeSH terms that describe the content of the item. Similarly, search queries use MeSH vocabulary to find items on a desired topic.
• The MeSH database• MeSH (Medical Subject Headings) is a controlled vocabulary thesaurus used for indexing PubMed articles. An article, which deals with “Down syndrome” will be indexed with the corresponding MeSH term.
All MeSH categories
Diseases
Nervous System Diseases Congenital, Hereditary, and Neonatal Diseases and Abnormalities
Neurologic Manifestations
Mental Retardation
Neurobehavioral Manifestations
D O W N S Y N D R O M E
Abnormalities
Abnormalities, Multiple Chromosome Disorders
Genetic Disorders, Inborn
Chromosome Disorders
Functional Categorization
Gene Ontology (GO) HierarchicalControlled vocabulary
Tools, as AmiGO (http://www.genedb.org/amigo/perl/go.cgi)allows to browse the GO hierarchy and to retrieve genes annotated with a given code
Functional Categorization
Gene Ontology (GO) http://www.geneontology.org/
Molecular Function - the tasks performed by individual gene products; examples are transcription factor and DNA helicaseBiological Process - broad biological goals, such asBiological Process broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functionsCellular Component - subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex
Hierarchic structure
![Page 18: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/18.jpg)
25/05/2007
18
Hierarchic structure
Evidence CodesIC: Inferred by CuratorIDA: Inferred from Direct AssayIEA: Inferred from Electronic AnnotationIEP: Inferred from Expression PatternIGC: Inferred from Genomic Context
The association between a gene and a GO term is labelled with an evidence code
IGI: Inferred from Genetic InteractionIMP: Inferred from Mutant PhenotypeIPI: Inferred from Physical InteractionISS: Inferred from Sequence or Structural SimilarityNAS: Non-traceable Author StatementND: No biological Data availableRCA: inferred from Reviewed Computational AnalysisTAS: Traceable Author StatementNR: Not Recorded
• The main six database categories • sequences
• proteins;• nucleic acids;
• mapping• genomes;• chromosomes;
3D
>sp|P56478|IL7_RATMFHVSFRYIFGIPPLILVLLPVTSSDCHIKDKDGKAFGSVLMISINQLDKMTGTDSDCPNNEPNFFKKHLCDDTKEAAFLNRAARKLRQFLKMNISEEFNDHLLRVSDGTQTLVNCTSKEEKTIKEQKKNDPCFLKRLLREIKTCWNKILKGSI
SEQUENCES
ONTOLOGIES• chromosomes;• …
• 3D structures• Expression• Function/Interaction• Literature, Ontologies
EXPRESSIONMAPPING
LS125-4R14523CYC223
FUNCTION/INTERACTIONS
LITERATURE
Slide:D.Raimondo
Primary Sequence Databases
Primary DNADDBJ/EMBL/GenBank
Primary proteinGenPept/TrEMBL/UniProt
Annotated protein sequencesSwiss-Prot & PIR -> UniProt
Protein StructureProtein Data Bank
GenBank (@NCBI)
71,802,595 loci; 75,742,041,056 bases
![Page 19: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/19.jpg)
25/05/2007
19
Sequence repositories - GenBankGenBank / EMBL / DDBJ
Highly redundant (many versions of same gene)Cross-updated dailyVersion history is recorded
Previous sequence records can be retrievedContigs/HTGS (100-200 kb) finishing at different stages
Draft FinishedIncludes genomic DNA, cDNA, ESTs, translated peptides
Non Annotated protein sequences
4,300,304 sequence; 1,677,167,127 amino acids (not including SNPs, alternative splicing)
TrEMBL
Curated database: UniProt/SwissProt
SIB - Swiss Institute of Bioinformatics Protein Knowledgebase / Sequence Database
Highly curatedHighly curatedExperimental evidence evaluated (e.g. modifications)All the entries checked by Amos Bairoch himself ;-)
ExPASy - Expert Protein Analysis System
Proteomics tools: links + local servers
SwissProt
265,950 sequence entries, 97,521,944 amino acids abstracted from 154,820 references
![Page 20: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/20.jpg)
25/05/2007
20
Structure databases / Protein Data Bank (PDB)
X-ray , NMR biomolecular structuresProtein Data Bank (PDB)http://www.rcsb.org/pdb/
39,540 structures from 16,306 different sequences
Structure databases / Protein Data Bank (PDB)
GENOMES
Sequenced genomes
Complete Draft/ Assembly
In Progress
Archaea 38 6 29 73
Bacteria 449 330 413 1,192
Eukaryota 27 138 184 349
Total 514 474 626 1,614
NCBI, May 2007
Genome Browsers - Portals to the Genomic World
UCSC – Univ. California – Santa Cruz (U.S.)http://genome.ucsc.edu/
NCBI – National Center for Biotechnology Information (U.S.)
http://www.ncbi.nlm.nih.gov/Genomes/index.htmlEnsEmbl – European Molecular Biology Laboratory (E.U.)
http://www.ensembl.org/
UCSC – Genome Browser UCSC – Genome Browser II
![Page 21: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/21.jpg)
25/05/2007
21
NCBI NCBI
EnsEmbl – Genome Browser
EnsEmbl – Genome Browser EnsEmbl – Genome Browser
![Page 22: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,](https://reader033.fdocuments.us/reader033/viewer/2022042209/5ead69f1e1e2973ae62f64b6/html5/thumbnails/22.jpg)
25/05/2007
22
EnsEmbl – Genome Browser EnsEmbl – Genome Browser
EnsEmbl – Genome Browser