The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the...

22
25/05/2007 1 A Very short Introduction to Databases Question: What is a database? The dictionary definition Function: noun Date: circa 1962 : a usually large collection of data organized especially for rapid search and retrieval (as by a computer) - Webster dictionary WHAT is a database? A collection of data that needs to be: Structured Searchable and should be Updated (periodically) Cross referenced WHY to use a database? Challenge: To organise data into useful information that can be accessed and analysed the best way possible best way possible. For example: HOW would YOU organise all biological sequences so that the biological information is optimally accessible? Question: Which BIOLOGICAL/MEDICAL database do you know? How could you classify them? How could you classify them?

Transcript of The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the...

Page 1: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

1

A Very short Introduction to Databases

Question:

What is a database?

The dictionary definition

Function: nounDate: circa 1962

: a usually large collection of data organizedespecially for rapid search and retrieval (as by a computer)

- Webster dictionary

WHAT is a database?

A collection of data that needs to be:StructuredSearchable

and should beUpdated (periodically)Cross referenced

WHY to use a database?

Challenge:To organise data into useful information that can be accessed and analysed the best way possiblebest way possible.

For example: HOW would YOU organise all biological sequences so that the biological information is optimally accessible?

Question:

Which BIOLOGICAL/MEDICAL database do you know?

How could you classify them?How could you classify them?

Page 2: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

2

Type of dataPrimary (experimental results)

GenomesProtein SequencesProtein StructuresInteractionsExpression

Be careful about the experiment

Expression……

Secondary (derived information/classification)Protein foldsProtein familiesGenome comparisons……….

Be careful about the source of the primary data

Type of dataComplete

Can contain many repeating entry with the same primary data (e.g different experiments for the same protein)

Non redundantRepeating or similar (in a given sense) data are reducedRepeating or similar (in a given sense) data are reduced

Be careful about the criteria

AnnotationRaw data

Primary data without further information but the details of the experimental procedure)

Annotated Contain more information besides the primary

( fexperimental data (Es: Protein sequence + function, localization, expression information)Manually (Es. Swiss Prot)Automatically (Es. TrEMBL)

Cross linkedA particular form of annotation

AccessibilityPublic

PrivateCareful curation can be expensive

UsefulnessUsefulUnuseful

Subjective issue, of course

Collection of data are NOT databases

We want to rapidly search for them

Need a structure

Page 3: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

3

Data organisation

Fundamental unit

"entry" or “record”

Identified by a “name” and/or an “accession

b ”number”

Associated informations

Schema: structural description of the type of

facts contained in the database

SCHEMA

PROTEINName/Accession NumberOrganismExperimental procedureExperimental procedureAuthorPublication……..Sequence

MODELS

A schema is organised in a modelFlat fileRelationalHierarchicHierarchicObject oriented

A model can be written in different formats

FLAT FILE Model (Swiss Prot)

FLAT FILE ModelIdentificationID Identification Mandatory, 1AC Accession number(s) Mandatory, 1

GeneralDT Date Mandatory, 3 (insertion, version, current release)DE Description Mandatory, 1 or more, manuaGN Gene name Facultative, 1 or more, manual

TaxonomyOS Organism species Mandatory, 1 OG Organelle Facultative, 1 or moreOC Organism classification Mandatory 1OC Organism classification Mandatory, 1OX Organism NCBI_TaxID Facultative

Reference (facultative)RN Reference number RP Reference positionRC Reference commentRX Reference cross-reference RA Reference authors RT Reference title RL Reference location

Page 4: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

4

CommentCC Comments or notes Facultative, 1 or more. ManualDR Database cross-references Facultative, 1 or more. ManualKW Keywords Mandatory, 1 or more. ManualFT Feature table data Facultative, 1 or more, Manual

SequenceSQ Sequence header Mandatory, 1

Amino Acid Sequence Mandatory, 1// Termination line Mandatory, 1

More than a FLAT FILE: ASN.1 Format

More than a FLAT FILE: ASN.1 Format More than a FLAT FILE: XML Format

More than a FLAT FILE: XML Format

FASTAFASTA

MMDBMMDB

UniProtUniProtEMBLEMBL

ASN.1ASN.1

GraphicalGraphicalGenPeptGenPept

GenBankGenBankMMDBMMDB

XMLXML

BINDBIND

Page 5: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

5

FLAT FILE: FASTA Format

>gi|57013850|sp|P69905|HBA_HUMAN Hemoglobin subunit alpha MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNA VAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSK YR

FLAT FILE for exams

SN:Peter SS:Smith F:Math. E:Analysis TN:Michael TS:Red M:30SN:Peter SS:Smith F:Math. E:Physics TN:Robert TS:White M:28SN:Mary SS:Baker F:Math. E:Analysis TN:Michael TS:Red M:30SN:Mary SS:Baker F:Math E:Physics TN:Robert TS:White M:30SN:John SS:Jones F:Phys E:Analysis TN:Michael TS:Red M:30SN L i SS S ith F Ph E Ph TN C l TS Bl M 29SN:Louise SS:Smith F:Phys E:Phys TN:Carol TS:Blue M:29

SN = Student’sName SS= Student’sSurnameF= Faculty E= ExamTN= Teacher’s Name TS= Teacher’s surnameM= Mark

Pitfalls of the FLAT FILE modelEach operation on the database (extraction, update, deletion) requires to read all the file

It is difficult to assure the consistencyDifferent entries with the same codeDuplicated entriespComplete update of the same information in different records..

RedundancyE.g. taxonomy repeated for all the entries from the same organism

Relational databases

Data is stored in tables ( relations)Each row (tuple) of the table represents an entryEach column of the table represents an attribute

Data relationships across tables can be either many-to-one or many-to-many

A few rules allow the database to be viewed in many ways

Relational DB of students and exams

Student’sName

Student’sSurname

Student’s ID

Faculty Exam Teacher’s Name

Teacher’s surname

Teacher’s ID

Mark

Peter Smith 12 Math. Analysis Michael Red 23 30

Peter Smith 12 Math. Physics Robert White 21 28

Mary Baker 13 Math. Analysis Michael Red 23 30

Mary Baker 13 Math Physics Robert White 21 30

John Jones 45 Physics Analysis Michael Red 23 30

Louise Smith 24 Physics Physics Carol Blue 54 29

Attributes

Eliminating redundancy

Student’s ID

Exam Teacher’s ID

Mark

12 Analysis 23 30

12 Physics 21 28

Student’s ID

Student’s Name

Student’s Surname

Faculty

12 Peter Smith Math

13 Mary Baker Math

24 Louise Smith Physics

13 Analysis 23 30

13 Physics 21 30

45 Analysis 23 30

24 Physics 54 29

45 John Jones Physics

Teacher’s ID

Teacher’s Name

Teacher’s Surname

21 Robert White

23 Michael Red

54 Carol Blue

Page 6: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

6

Eliminating redundancy

Student’s ID ExamC Mark

12 AnalysisM 30

12 PhysicsM 28

13 AnalysisM 30

Student’s ID

Student’s Name

Student’s Surname

Faculty

12 Peter Smith Math

13 Mary Baker Math

24 Louise Smith Physics

45 John Jones Physics

Teacher’s ID Teacher’s Name

Teacher’s Surname

13 AnalysisM 30

13 PhysicsM 30

45 AnalysisP 30

24 PhysicsP 29

21 Robert White

23 Michael Red

54 Carol Blue

ExamC Exam Faculty Teacher’s ID

AnalysisM Analysis Math 23

PhysicsM Physics Math 21

AnalysisP Analysis Physics 23

PhysicsP Physics Physics 54

Relational Databases

What have we achieved?No repeating informationLess storage spaceBetter reality representationEasy modification/managementy gEasy usage of any combination of records

Relational Operations Relational Operations

Object Oriented Model DBMS

Internal organizationControls speed and flexibility

DatabaseDatabase

A unity of programs that StoreExtractModify

StoreStore ExtractExtract ModifyModify

USER(S)USER(S)

Page 7: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

7

DBMS

Has to manage big amount of recordsHas to control contemporary accesses to the databaseHas to assure the persistence of data after usageHas to manage the access permissions to data and privacy-related issuesMust be as efficient as possible (in term of time and memory requirements)Must control backup and recovery events

Server and Browser

SERVER:Mediates operations and communications

DatabaseDatabase

StoreStore ExtractExtract ModifyModify

BROWSER Transmits requestsRenders answers

USER(S)USER(S)

Server

Browser

Accessing database information

A request for data from a database is called a query

Queries are usually performed with a queryQueries are usually performed with a query language that, for example, perform the basic relational operation (SELECT, PROJECT, JOIN)

Query Languages

The standard SQL (Structured Query Language) originally called SEQUEL (Structured English QUEry Language)Developed by IBM in 1974; introduced p y ;commercially in 1979 by Oracle Corp.Standard interactive and programming language for getting information from and updating a database.

Query by Accession code or Keyword

Query by Accession code is the easiest BUT usually the accession code is not known

Query by Keyword (on all the attributes) can retrieve big amounts of entriesretrieve big amounts of entries

(for example a search for “E.Coli” on PubMed would report also the papers of Elisabetta Coli)

Restrict the search by keyword on the suitable attribute

Rescticted searches

By particular Sintax: E.coli [auth], E.coli[ti]By selecting appropriate options on the browser

Page 8: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

8

Boolean operators to refine the query

AND (&)

OR (|)OR (|)

NOT(!)

Complex queries

Help you with the Euler-Venn diagrams

Examples

Write an expression for finding Horse liver alcohol dehydrogenasesMammal dehydrogenases not acting on alcoholsArchaeal and Bacterial globins that does notArchaeal and Bacterial globins that does not contain heme…

Complex queries

By particular sintax or by browser options

Integrating different databases

Different DB for the same type of information requires collaboration

USANUCLEOTIDE SEQUENCES

JAPAN

EUROPE

Page 9: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

9

Since 1988, the institutions in charge of the three databases are linked in the International Collaboration of DNA Sequence Databases

The format or their entries have common features that allow the sharing of the sequences

Researches send their results to one of the databases

The database send the entry to the other databases

The first database maintain exclusive rights for the modification or the update of the entry

Distributed databases

From local to global attitudeData appears to be in one location but is most definitely not

Two or more data files in different locations, periodically synchronized by the DBMS to keep data in all locations consistent (A B C)consistent (A,B,C)

An intricate network for combining and sharing information

Different DB for the same type can be integrated in a unified DBSwiss Prot (Swiss Intitute of Bioinformatics)Protein Sequences Manually annotated

TrEMBL(EBI)Protein sequences automatically annotated

PIR(Georgetown University)Protein Sequences Manually annotated

Distributed annotations

Many different databases, distributed in many locations, annotate different aspects of the same protein (or gene or structure..)

It is possible to help the researcher to collect the information in a simple way?

The solution of many private companies: Data warehouse

Periodically, one imports data from databases and store it (locally) in the data warehouse.

Now a local database can be created, containing for instance protein family data (sequence, structure, function and pathway/process data integrated with the gene expression and p y p g g pother experimental data).

Disadvantage: expensive, intensive, needs to be updated.

Advantage: easy control of integrated data-mining pipeline.

Integration (and cross linking) among databases

Quite easy among the databases curated by the same institution:NCBI: GenBank, PubMed, SNPs, OMIM, …….EBI: EMBL nucleotide, UniProt, TrEMBL, ArrayExpress, ….

Quite easy also among the most important databasesUniProt, PDB, GO, Pfam

Internal cross-links in the entries

Common Query interfacesEntrez, SRS

Page 10: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

10

Common query interface: ENTREZ ENTREZ crosslinks

An open collaborative solution: Distributed Annotation

Different institutions independently annotate sequences and locally store the results in a database

The information are shared by means of a common Accession Code

The user access the information using a single server that asks all the connected servers (selected by the user)

Pitfalls: can contain many heterogeneous and sometimes contradictory information (e.g. different predictions for the same feature performed with different, non agreeing, methods)

Distributed Annotation Systems

The Distributed Annotation System (DAS)defines a communication protocol used toexchange biological sequence annotations.

Data distribution, performed by DASservers, is separated from visualization,which is done by DAS clients.which is done by DAS clients.

DAS is a client-server system in which asingle client integrates information frommultiple servers. It allows a single machine togather up sequence annotation informationfrom multiple distant web sites, collate theinformation, and display it to the user in asingle view.

BioSapiens is a Network of Excellence, funded by the

European Union's 6th Framework Programme,

and made up of bioinformatics researchers

from 25 institutions based in 14 t i th h t14 countries throughout

Europe. The objective of the

BioSapiens is to provide a large scale, concerted effort to annotate genome data by

laboratories distributed around Europe, using both informatics tools and input

from experimentalists.

BioSapiens DAS portal

Page 11: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

11

Biological databaseMaybe more than 1000 different DB

From less than 100Kb to more than 100 GbDNA sequences: > 100 GbProtein sequences: 1.5 GbProtein structures: 5 Gb

Some numbers

Protein structures: 5 Gb

Update: Daily (GenBank),Weekly (PDB),… …., NEVER

AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb, BBDB, BCGD,Beanref, Biolmage,BioMagResBank, BIOMDB, BLOCKS,BovGBASE,BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,CARBHYD, CATH,CAZY, CCDC, CD4OLbase, CGAP,ChickGBASE, Colibri, COPE, CottonDB, CSNDB,CUTG,CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,Picty_cDB, DIP,DOGS, DOMO, DPD, DPlnteract, ECDC,ECGC, EC02DBASE, EcoCyc, EcoGene,EMBL, EMD db,ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView, GCRDB, GDB,GENATLAS, Genbank, GeneCards, Genline, GenLink, GENOTK, GenProtEC,GIFTS GPCRDB GRAP GRBase gRNAsdb GRR GSDB HAEMB HAMSTERS

Some Biological Database

GIFTS, GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB, HAEMB, HAMSTERS,HEART-2DPAGE, HEXAdb, HGMD,HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN,HPDB,HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,KDNA, KEGG, Klotho,LGIC, MAD, MaizeDb, MDB,Medline, Mendel, MEROPS, MGDB, MGI,MHCPEP5Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,MPDB, MRR,MutBase, MycDB, NDB, NRSub, 0-lycBase,OMIA, OMIM, OPD, ORDB, OWL,PAHdb, PatBase, PDB,PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,PPDB,PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,PROTOMAP, RatMAP, RDP,REBASE, RGP, SBASE,SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,SPAD,SRNA db, SRPDB, STACK, StyGene,Sub2D,SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,TOPS,TRANSFAC, TRR, UniGene, URNADB, V BASE,VDRR, VectorDB, WDCM, WIT,WormPep, YEPD, YPD, YPM………….

Nucleic Acids Reseach: Database Issueon January, each yearReports a database collection, classified in 14 categories :

Nucleotide Sequence DatabasesRNA sequence databasesProtein sequence databasesStructure DatabasesGenomics Databases (non-vertebrate)Metabolic and Signaling PathwaysHuman and other Vertebrate GenomesHuman Genes and DiseasesHuman Genes and DiseasesMicroarray Data and other Gene Expression DatabasesProteomics ResourcesOther Molecular Biology DatabasesOrganelle DatabasesPlant DatabasesImmunological Databases

858 in 2006, 968 in 2007

http://nar.oxfordjournals.org/cgi/content/full/35/suppl_1/D3/DC1

Page 12: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

12

Bateman’s suggestion for a good DB(Editor of the NAR DB issue)

When thinking of a name for your database do check if anyone else is using that name already. Calling your new database PDB is almost certainly going to cause confusion. It is also worth checking search engines with your database name, you may be surprised at what it means in other languages.

Do make your data as comprehensive as possible. Try to avoid making the data collection overspecialized. For example, a database of promoters for RNA genes in a single organism is not going to have a wide appeal, but a database of promoters for RNA genes in all organisms would be of wide interest and utility.

Do attribute the original sources of derived data.

Do make sure that you are not breaching any license terms by redistributing data.

Do include estimates of confidence in the data items if applicable.

Do make data available for bulk download as flat files or relational database tables with associated documentation

Bateman’s suggestion for a good DB

relational database tables with associated documentation.

Web services and DAS are becoming popular ways to make databases programmatically available. Making these available can stop your website being ground to a halt by users trying to screen scrape all your data.

Do allow users to provide feedback on your data and submit new data. Do respond to user feedback in a timely manner.

• The main six database categories • sequences

• proteins;• nucleic acids;

• mapping• genomes;• chromosomes;

3D

>sp|P56478|IL7_RATMFHVSFRYIFGIPPLILVLLPVTSSDCHIKDKDGKAFGSVLMISINQLDKMTGTDSDCPNNEPNFFKKHLCDDTKEAAFLNRAARKLRQFLKMNISEEFNDHLLRVSDGTQTLVNCTSKEEKTIKEQKKNDPCFLKRLLREIKTCWNKILKGSI

SEQUENCES

ONTOLOGIES• chromosomes;• …

• 3D structures• Expression• Function/Interaction• Literature, Ontologies

EXPRESSIONMAPPING

LS125-4R14523CYC223

FUNCTION/INTERACTIONS

LITERATURE

Slide:D.Raimondo

Type of dataPrimary (experimental results)

GenomesProtein SequencesProtein StructuresInteractionsExpression

Be careful about the experiment

Expression……

Secondary (derived information/classification)Protein foldsProtein familiesGenome comparisons……….

Be careful about the source of the primary data

AnnotationRaw data

Primary data without further information but the details of the experimental procedure)

Annotated Contain more information besides the primary

( fexperimental data (Es: Protein sequence + function, localization, expression information)ManuallyAutomatically

Cross linkedA particular form of annotation

A short word on problems

Even today we face some key limitationsThere is no standard format

Every database or program has its own formatThere is no standard nomenclature

Every database has its own namesEvery database has its own namesData is not fully optimized

Some datasets have missing information without indications of it

Data errorsData is sometimes of poor quality, erroneous, misspelledError propagation resulting from computer annotation

Page 13: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

13

How we got the sequence

Sanger chain termination method

Primary DNA sourcesTrace files repositoriesSingle read: 500-1000 bp (~golf ball size / jig saw puzzle)Variable quality

WashU-Merck Human EST Project / Trace files”Base-calling” non-trivial

G, C or nothing?

Assembly is Non-trivial!Assembly is Non-trivial!

NCBI (National Center for Biotechnology Information)http://www.ncbi.nlm.nih.gov/

Institutions

EBI (European Bioinformatics Institute)http://www.ebi.ac.uk/

European Bioinformatics Institutewww.ebi.ac.uk

Branch of the EMBL. Started in 1980 in Heidelberg. Since 1995, at the WellcomeTrust Genome Campus

Mission (from the EBI website) To provide freely available data and bioinformatics servicesTo provide freely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress To contribute to the advancement of biology through basic investigator-driven research in bioinformatics To provide advanced bioinformatics training to scientists at all levels, from PhD students to independent investigators To help disseminate cutting-edge technologies to industry

Page 14: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

14

Databases at EBI

Many different DB are curated by EBI. Among the others:

EMBL Nucleotide Database - Europe’s primary collection of nucleotide sequences is maintained in collaboration with Genbank (USA) and DDBJ (Japan) UniProt Knowledgebase - a complete annotated protein sequence database Macromolecular Structure Database - European Project for theMacromolecular Structure Database - European Project for the management and distribution of data on macromolecular structures ArrayExpress - for gene expression data Ensembl - Providing up to date completed metazoic genomes and the best possible automatic annotation. IntAct - Provides a freely available, open source database system and analysis tools for protein interaction data.

Databases at EBI

Many different Retrieval systems are available. Among them:

BioMartBioMart is a simple and robust data integration system for large scale data querying, providing researchers with fast and flexible access to biological databases. Integr8The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes.proteomes.Query ArrayExpressSearch the ArrayExpress microarray database.SRSThe Sequence Retrieval System can be used to browse the various biological sequence and literature databases the EBI has available.UniProt DASThe distributed annotation system (DAS) is a client-server system in which a single client integrates information from multiple

SRS: Sequence Retrieval SystemMore than 100 DB browsable

SRS: Sequence Retrieval SystemMore than 100 DB browsable

Page 15: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

15

SRS: Sequence Retrieval System

This interface allows you to: perform simple and complex queries across one or several databases;view your results in different formats;

May 17th - Introduction to Biological Databases

y ;create your own views for your results; save results to file;launch analysis tools on results;link results to different databases.

National Center for Biotechnological Information

Created by The National Institutes of Health in 1988. Mission (from the NCBI website)

conducts research on fundamental biomedical problems at the molecular level using mathematical and computational methodsmaintains collaborations with several NIH institutes, academia, industry, and other governmental agenciessupports training on basic and applied research in computational pp g pp pbiology for postdoctoral fellows through the NIH Intramural Research Programengages members of the international scientific community in informatics research and training through the Scientific Visitors Programdevelops, distributes, supports, and coordinates access to a variety of databases and software for the scientific and medical communitiesdevelops and promotes standards for databases, data deposition and exchange, and biological nomenclature

Databases at NCBI ENTREZ: Sequence Retrieval System

Term frequency statistics

Literature citations in

Literature

MEDLINE abstracts

1993

Amino acid sequence similarityCoding region

features

Nucleotide sequence similarity

citations in sequence databases

citations in sequence databases

Nucleotide sequences

Protein sequences

The challenge of the information space:

Nucleotide records 68,739,698 Protein sequences 7,861,5303D structures in PDB 43,421BIND Interactions 202,695KEGG pathways 35,211Human Unigene Cluster 66,488 Completed Genome projects 320Different taxonomy Nodes 296,377dbSNP records 26,430,220 RefSeq Genomic records: 536,571 RefSeq RNA Records: 625 928

Feb 11th 2006

RefSeq RNA Records: 625,928 RefSeq Protein Records: 2,273,764 PubMed records 16,082,339Free PubMed records 1,212,220 OMIM records 16,521

From Fig 1 ofEntrez search and retrieval systemJim OstellChapter 14, the NCBI Handbook.

2003

Page 16: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

16

2004

http://www.ncbi.nih.gov/Database/datamodel/

• The main six database categories • sequences

• proteins;• nucleic acids;

• mapping• genomes;• chromosomes;

3D

>sp|P56478|IL7_RATMFHVSFRYIFGIPPLILVLLPVTSSDCHIKDKDGKAFGSVLMISINQLDKMTGTDSDCPNNEPNFFKKHLCDDTKEAAFLNRAARKLRQFLKMNISEEFNDHLLRVSDGTQTLVNCTSKEEKTIKEQKKNDPCFLKRLLREIKTCWNKILKGSI

SEQUENCES

ONTOLOGIES• chromosomes;• …

• 3D structures• Expression• Function/Interaction• Literature, Ontologies

EXPRESSIONMAPPING

LS125-4R14523CYC223

FUNCTION/INTERACTIONS

LITERATURE

Slide:D.Raimondo

PubMed

MEDLINE is a Bibliographic DBMEDLINE + ENTREZ = PubMED

http://www.ncbi.nlm.nih.gov/pubmedhttp://www.ncbi.nlm.nih.gov/pubmed

http://www.pubmed.govhttp://www.pubmed.gov

PubMed

What is PubMed?a literature database specialised in life sciencesa literature search systemPubMed is developed and maintained by the National Center for Biotechnology Information (NCBI) of the National Library of Medicine (NLM) – Bethesda - USANational Library of Medicine (NLM) Bethesda USAcovers several fields such as

medicine;dentistry;veterinary sciences;clinical sciences;biological sciences;…

PubMed

includes 16 million citations from 1865 to NOW

more than 4,500 journals are referenced

82,028,000 queries in March 2006 (163,000 in January 1997)

Page 17: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

17

PMID- 16381842OWN - NLMSTAT- MEDLINEDA - 20051229DCOM- 20060228PUBM- PrintIS - 1362-4962 (Electronic)VI - 34IP - Database issueDP - 2006 Jan 1TI - The Universal Protein Resource (UniProt): an expanding universe of

protein information.PG - D187-91AB - The Universal Protein Resource (UniProt) provides a central

resource on protein sequences and functional annotation with three

database…

AD - Department of Biochemistry and Molecular Biology, Georgetown

University Medical Center 3900 Reservoir Road NW

FAU - Magrane, MicheleAU - Magrane MFAU - Martin, Maria JAU - Martin MJFAU - Mazumder, RajaAU - Mazumder RFAU - O'Donovan, ClaireAU - O'Donovan CFAU - Redaschi, NicoleAU - Redaschi NFAU - Suzek, BarisAU - Suzek BLA - engGR - 1 U01 HG02712-01/HG/NHGRIGR - 1R01HGO2273-01/HG/NHGRIGR - HHSN266200400061C/HS/AHCPRPT - Journal ArticlePL - EnglandTA - Nucleic Acids ResJT - Nucleic acids research.

• PubMed unique identifier (PMID)• Article identifiers (AID)• Publication date (DP or PDAT)• Added to PubMed (EDAT)• Title (TI)• Abstract (AB)

• Journal title (TA and JT)

May 17th - Introduction to Biological Databases

University Medical Center, 3900 Reservoir Road, NW, Washington,

DC 20057-1414, USA.FAU - Wu, Cathy HbAU - Wu CHFAU - Apweiler, RolfAU - Apweiler RFAU - Bairoch, AmosAU - Bairoch AFAU - Natale, Darren AAU - Natale DAFAU - Barker, Winona CAU - Barker WCFAU - Boeckmann, BrigitteAU - Boeckmann BFAU - Ferro, SerenellaAU - Ferro SFAU - Gasteiger, ElisabethAU - Gasteiger EFAU - Huang, HongzhanAU - Huang HFAU - Lopez, RodrigoAU - Lopez R

JID - 0411011RN - 0 (Proteins)RN - 0 (Proteome)SB - IMMH - *Databases, ProteinMH - InternetMH - Proteins/chemistry/classification/physiologyMH - Proteome/chemistryMH - Research Support, N.I.H., ExtramuralMH - Research Support, Non-U.S. Gov'tMH - Research Support, U.S. Gov't, Non-P.H.S.MH - Sequence Analysis, ProteinMH - Systems IntegrationMH - User-Computer InterfaceEDAT- 2005/12/31 09:00MHDA- 2006/03/01 09:00AID - 34/suppl_1/D187 [pii]AID - 10.1093/nar/gkj161 [doi]PST - ppublishSO - Nucleic Acids Res. 2006 Jan 1;34(Database issue):D187-91.

• Affiliation (AD)• Authors (AU and AUF)

• MeSH terms (MH)

• Citation (SO)

• The MeSH database• MeSH (Medical Subject Headings) is a controlled vocabulary thesaurus used for indexing PubMed articles. An article, which deals with “Down syndrome” will be indexed with the corresponding MeSH term. Each bibliographic reference is associated with a set of MeSH terms that describe the content of the item. Similarly, search queries use MeSH vocabulary to find items on a desired topic.

• The MeSH database• MeSH (Medical Subject Headings) is a controlled vocabulary thesaurus used for indexing PubMed articles. An article, which deals with “Down syndrome” will be indexed with the corresponding MeSH term.

All MeSH categories

Diseases

Nervous System Diseases Congenital, Hereditary, and Neonatal Diseases and Abnormalities

Neurologic Manifestations

Mental Retardation

Neurobehavioral Manifestations

D O W N S Y N D R O M E

Abnormalities

Abnormalities, Multiple Chromosome Disorders

Genetic Disorders, Inborn

Chromosome Disorders

Functional Categorization

Gene Ontology (GO) HierarchicalControlled vocabulary

Tools, as AmiGO (http://www.genedb.org/amigo/perl/go.cgi)allows to browse the GO hierarchy and to retrieve genes annotated with a given code

Functional Categorization

Gene Ontology (GO) http://www.geneontology.org/

Molecular Function - the tasks performed by individual gene products; examples are transcription factor and DNA helicaseBiological Process - broad biological goals, such asBiological Process broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functionsCellular Component - subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex

Hierarchic structure

Page 18: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

18

Hierarchic structure

Evidence CodesIC: Inferred by CuratorIDA: Inferred from Direct AssayIEA: Inferred from Electronic AnnotationIEP: Inferred from Expression PatternIGC: Inferred from Genomic Context

The association between a gene and a GO term is labelled with an evidence code

IGI: Inferred from Genetic InteractionIMP: Inferred from Mutant PhenotypeIPI: Inferred from Physical InteractionISS: Inferred from Sequence or Structural SimilarityNAS: Non-traceable Author StatementND: No biological Data availableRCA: inferred from Reviewed Computational AnalysisTAS: Traceable Author StatementNR: Not Recorded

• The main six database categories • sequences

• proteins;• nucleic acids;

• mapping• genomes;• chromosomes;

3D

>sp|P56478|IL7_RATMFHVSFRYIFGIPPLILVLLPVTSSDCHIKDKDGKAFGSVLMISINQLDKMTGTDSDCPNNEPNFFKKHLCDDTKEAAFLNRAARKLRQFLKMNISEEFNDHLLRVSDGTQTLVNCTSKEEKTIKEQKKNDPCFLKRLLREIKTCWNKILKGSI

SEQUENCES

ONTOLOGIES• chromosomes;• …

• 3D structures• Expression• Function/Interaction• Literature, Ontologies

EXPRESSIONMAPPING

LS125-4R14523CYC223

FUNCTION/INTERACTIONS

LITERATURE

Slide:D.Raimondo

Primary Sequence Databases

Primary DNADDBJ/EMBL/GenBank

Primary proteinGenPept/TrEMBL/UniProt

Annotated protein sequencesSwiss-Prot & PIR -> UniProt

Protein StructureProtein Data Bank

GenBank (@NCBI)

71,802,595 loci; 75,742,041,056 bases

Page 19: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

19

Sequence repositories - GenBankGenBank / EMBL / DDBJ

Highly redundant (many versions of same gene)Cross-updated dailyVersion history is recorded

Previous sequence records can be retrievedContigs/HTGS (100-200 kb) finishing at different stages

Draft FinishedIncludes genomic DNA, cDNA, ESTs, translated peptides

Non Annotated protein sequences

4,300,304 sequence; 1,677,167,127 amino acids (not including SNPs, alternative splicing)

TrEMBL

Curated database: UniProt/SwissProt

SIB - Swiss Institute of Bioinformatics Protein Knowledgebase / Sequence Database

Highly curatedHighly curatedExperimental evidence evaluated (e.g. modifications)All the entries checked by Amos Bairoch himself ;-)

ExPASy - Expert Protein Analysis System

Proteomics tools: links + local servers

SwissProt

265,950 sequence entries, 97,521,944 amino acids abstracted from 154,820 references

Page 20: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

20

Structure databases / Protein Data Bank (PDB)

X-ray , NMR biomolecular structuresProtein Data Bank (PDB)http://www.rcsb.org/pdb/

39,540 structures from 16,306 different sequences

Structure databases / Protein Data Bank (PDB)

GENOMES

Sequenced genomes

Complete Draft/ Assembly

In Progress

Archaea 38 6 29 73

Bacteria 449 330 413 1,192

Eukaryota 27 138 184 349

Total 514 474 626 1,614

NCBI, May 2007

Genome Browsers - Portals to the Genomic World

UCSC – Univ. California – Santa Cruz (U.S.)http://genome.ucsc.edu/

NCBI – National Center for Biotechnology Information (U.S.)

http://www.ncbi.nlm.nih.gov/Genomes/index.htmlEnsEmbl – European Molecular Biology Laboratory (E.U.)

http://www.ensembl.org/

UCSC – Genome Browser UCSC – Genome Browser II

Page 21: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

21

NCBI NCBI

EnsEmbl – Genome Browser

EnsEmbl – Genome Browser EnsEmbl – Genome Browser

Page 22: The dictionary definition WHAT is a database? · ppyp g g pathway/process data integrated with the gene expression and other experimental data). Disadvantage: expensive, intensive,

25/05/2007

22

EnsEmbl – Genome Browser EnsEmbl – Genome Browser

EnsEmbl – Genome Browser