Advanced Databases Storage and File Structure Instructor: Mr.Eyad Almassri.
Structure Databases
-
Upload
lacy-wiley -
Category
Documents
-
view
27 -
download
0
description
Transcript of Structure Databases
![Page 1: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/1.jpg)
Structure Databases
DNA/Protein structure-function analysis and prediction
Lecture 6
Bioinformatics Section, Vrije Universiteit, AmsterdamSome pics were token from http://www.umanitoba.ca/afs/plant_science/courses
![Page 2: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/2.jpg)
The dictionary definition
Main Entry: da·ta·base
Pronunciation: 'dA-t&-"bAs, 'da- also 'dä-Origin: circa 1962
: a usually large collection of data organized especially for rapid search and retrieval (as by a computer)
- Webster dictionary
![Page 3: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/3.jpg)
WHAT is a database?
A collection of data that needs to be: Structured (standardized data representation) Searchable Updated (periodically) Cross referenced
Challenge: To change “meaningless” data into useful information
that can be accessed and analysed the best way possible.
![Page 4: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/4.jpg)
Organizing data into knowledge
HOW would YOU organise all biological sequences so that the biological information is optimally accessible?
You need an appropriate database management system (DBMS)
![Page 5: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/5.jpg)
DBMS
Internal organization Controls speed and
flexibility
A unity of programs that Store Extract Modify
DatabaseDatabase
StoreStore ExtractExtract ModifyModify
USER(S)USER(S)
![Page 6: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/6.jpg)
DBMS organisation types
Flat file databases (flat DBMS) Simple, restrictive, table
Hierarchical databases (hierarchical DBMS) Simple, restrictive, tables
Relational databases (RDBMS) Complex,versatile, tables
Object-oriented databases (ODBMS) Complex, versatile, objects
![Page 7: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/7.jpg)
..
..
..
..
A flat file database
Cell_Stock : "SK11.pEA215.3"Species "Escherichia coli"Plasmid "pEA215.3"Experiment "SK11"Freezer "AG334 -80C"Box "Pisum ESTs II"Gridded "Rack(BF7) Box(Pisum ESTs II)"
Cell_Stock : "SK11.pI206KS"Species "Escherichia coli"Plasmid "pI206KS"Experiment "SK11"Freezer "AG334 -80C"Box "Pisum ESTs II"Gridded "Rack(BF7) Box(Pisum ESTs II)"
Cell_Stock : "SK11.pEA46.2"Species "Escherichia coli"Plasmid "pEA46.2"Experiment "SK11"Freezer "AG334 -80C"Box "Pisum ESTs II"Gridded "Rack(BF7) Box(Pisum ESTs II)"
Collection of records, each containing several data fields.
Disadvantageous Redundancy Force single view of the
data (‘organizer’ and ‘attributes’)
![Page 8: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/8.jpg)
Relational databases
Data is stored in multiple related tables
Data relationships across tables can be either many-to-one or many-to-many
A few rules allow the database to be viewed in many ways
Lets convert the “course details” to a relational database
![Page 9: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/9.jpg)
Student 1 Chemistry Biology A B B A C …..Student 1 Chemistry Biology A B B A C …..
Student 2 Ecology Maths A D A A A …..Student 2 Ecology Maths A D A A A …..
..
..
..
..
Course detailsCourse detailsFLAT DATABASE 2FLAT DATABASE 2
Student 2 Ecology Biology A B A A A …..Student 2 Ecology Biology A B A A A …..
Student 1 Chemistry English A A A A A …..Student 1 Chemistry English A A A A A …..........
Name Depart. Course E1 E2 E3 P1 P2Name Depart. Course E1 E2 E3 P1 P2
Student 1 Chemistry Maths C C B A A …..Student 1 Chemistry Maths C C B A A …..
Our flat file database
![Page 10: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/10.jpg)
Normalization 1: remove repeating records (rows)
sID Name dIDsID Name dID
1 Student1 11 Student1 1
2 Student2 22 Student2 2
cID CoursecID Course
1 Biology1 Biology
2 Maths 2 Maths
3 English 3 English
dID DepartmentdID Department
1 Chemistry1 Chemistry
2 Ecology 2 Ecology
1 1 A B B A C …..1 1 A B B A C …..
2 2 A D A A A …..2 2 A D A A A …..
..
..
..
..
2 1 A B A A A …..2 1 A B A A A …..
1 3 A A A A A …..1 3 A A A A A …..........
sID cID E1 E2 E3 P1 P2sID cID E1 E2 E3 P1 P2
1 2 C C B A A …..1 2 C C B A A …..
Primary keysPrimary keysForeign keysForeign keys
![Page 11: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/11.jpg)
sID Name dIDsID Name dID
1 Student1 11 Student1 1
2 Student2 22 Student2 2
cID CoursecID Course
1 Biology1 Biology
2 Maths 2 Maths
3 English 3 English gID Grade gID Grade
1 A1 A
2 B 2 B
3 C 3 C
dID Department dID Department
1 Chemistry1 Chemistry
2 Ecology 2 Ecology
wID ProjectwID Project
1 E11 E1
2 E2 2 E2
3 E3 3 E3
4 P1 4 P1
5 P2 5 P2
sID cID gID wIDsID cID gID wID
1 1 1 1 1 1 1 1 1 1 2 21 1 2 2
1 1 2 31 1 2 3
1 1 1 41 1 1 4
1 1 3 5 1 1 3 5
2 1 1 1 2 1 1 1 2 1 1 22 1 1 2
2 1 2 32 1 2 3
2 1 1 42 1 1 4
2 1 1 5 2 1 1 5
Normalization 2: remove repeating records (columns)
![Page 12: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/12.jpg)
Relational Databases
What have we achieved? No repeating information Less storage space Better reality representation Easy modification/management Easy usage of any combination of records
Remember the DBMS has programs to access and edit this information so ignore the human reading limitation of the primary keys
![Page 13: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/13.jpg)
Accessing database information
A request for data from a database is called a query
Queries can be of three forms:Choose from a list of parametersQuery by example (QBE)
• QBE build wizard allows which data to display
Query language
![Page 14: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/14.jpg)
Query Languages
The standard SQL (Structured Query Language) originally called
SEQUEL (Structured English QUEry Language) Developed by IBM in 1974; introduced commercially
in 1979 by Oracle Corp. Standard interactive and programming language for
getting information from and updating a database.
RDMS (SQL), ODBMS (Java, C++, OQL etc)
![Page 15: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/15.jpg)
Querying our biological relational database
Many view are possible …
Plasmid View
Plasmid Species Cell StockpEA25 Escherichia coli SK10.2.pEA25pEA46.2 Escherichia coli SK11.pEA46.2pEA207.2 Escherichia coli SK11.pEA207.2pEA214.6 Escherichia coli MB123.pEA214.6pEA215.3 Escherichia coli SK11.pEA215.3pEA238.2 Escherichia coli MB123.3.PEA238.2pEA238.11 Escherichia coli MB123.3.pEA238.11pEA277.11 Escherichia coli SK11.pEA277.11pEA303.4 Escherichia coli SK11.pEA303.4pEA315.2 Escherichia coli MB123.3.pEA315.2 peB4 Escherichia coli VB1.eB4
Experiment View
Experiment Cell Stock Box Freezer SK4 SK4.pPS-IAA4-5 Pisum ESTs I AG334 -80C SK4 SK4.pPS-IAA6 Pisum ESTs I AG334 -80C SK4 SK4.pTic110 Pisum ESTs I AG334 -80C SK4 SK4.pToc34 Pisum ESTs I AG334 -80C SK4 SK4.pToc86 Pisum ESTs I AG334 -80C SK5 SK5.pAB96.3 Pisum ESTs I AG334 -80C SK5 SK5.pABR17.10 Pisum ESTs I AG334 -80C SK5 SK5.pABR18.2 Pisum ESTs I AG334 -80C SK5 SK5.pI39 Pisum ESTs I AG334 -80C SK5 SK5.pI49KS Pisum ESTs I AG334 -80C SK5 SK5.pI176KS Pisum ESTs I AG334 -80C SK5 SK5.pI225KS Pisum ESTs I AG334 -80C
![Page 16: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/16.jpg)
Distributed databases
From local to global attitude Data appears to be in one location but is most
definitely not
A definition: Two or more data files in different locations, periodically synchronized by the DBMS to keep data in all locations consistent (A,B,C)
An intricate network for combining and sharing information
Administrators praise fast network technologies!!! Users praise the internet!!!
![Page 17: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/17.jpg)
Data warehouse Periodically, one imports data from databases and store it
(locally) in the data warehouse.
Now a local database can be created, containing for instance protein family data (sequence, structure, function and pathway/process data integrated with the gene expression and other experimental data).
Disadvantage: expensive, intensive, needs to be updated.
Advantage: easy control of integrated data-mining pipeline.
![Page 18: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/18.jpg)
So why do biologists care?
![Page 19: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/19.jpg)
Three main reasons
Database proliferationDozens to hundreds at the moment
More and more scientific discoveries result from inter-database analysis and mining
Rising complexity of required data-combinationsE.g. translational medicine: “from bench to
bedside” (genomic data vs. clinical data)
![Page 20: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/20.jpg)
Biological databases
Like any other databaseData organization for optimal analysis
Data is of different typesRaw data (DNA, RNA, protein
sequences)Curated data (DNA, RNA and protein
annotated sequences and structures, expression data)
![Page 21: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/21.jpg)
Raw Biological dataNucleic Acids (DNA)
![Page 22: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/22.jpg)
Raw Biological dataAmino acid residues (proteins)
![Page 23: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/23.jpg)
Curated Biological DataDNA, nucleotide sequences
Gene boundaries, topologyGene boundaries, topology Gene structureGene structure
Introns, exons, ORFs, splicingIntrons, exons, ORFs, splicing
Expression dataExpression data Mass spectometryMass spectometryIdentify unknown compoundsIdentify unknown compounds
![Page 24: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/24.jpg)
Mass spectometry Mass spectometry (metabolomics, proteomics)(metabolomics, proteomics)
Post-Translational proteinPost-Translational proteinModification (PTM)Modification (PTM)
Curated Biological DataProteins, residue sequences
MCTUYTCUYFSTYRCCTYFSCDExtended sequence information Extended sequence information
Secondary structureSecondary structure
Hydrophobicity, motif dataHydrophobicity, motif data
Protein-protein interactionProtein-protein interaction
![Page 25: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/25.jpg)
Curated Biological data3D Structures, folds
![Page 26: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/26.jpg)
Biological Databases
The NAR Database Issue: http://www.oxfordjournals.org/nar/database/c/
![Page 27: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/27.jpg)
Distributed information
Pearson’s Law: The usefulness of a column of data varies as the square of the number of columns it is compared to.
![Page 28: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/28.jpg)
A few biological databases
Nucleotide DatabasesAlternative Splicing, EMBL-Bank, Ensembl, Genomes Server, Genome, MOT, EMBL-Align, Simple Queries, dbSTS Queries, Parasites, Mutations, IMGT
Genome DatabasesHuman, Mouse, Yeast, C.elegans, FLYBASE, Parasites
Protein Databases Swiss-Prot, TrEMBL, InterPro, CluSTr, IPI, GOA, GO, Proteome Analysis, HPI, IntEnz, TrEMBLnew, SP_ML, NEWT, PANDIT
Structure Databases PDB, MSD, FSSP, DALI
Microarray Database ArrayExpress
Literature Databases MEDLINE, Software Biocatalog, Flybase Archives
Alignment DatabasesBAliBASE, Homstrad, FSSP
![Page 29: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/29.jpg)
Structural Databases
Protein Data Bank (PDB) http://www.rcsb.org/pdb/
Structural Classification of Proteins (SCOP)
http://scop.berkeley.edu
http://scop.mrc-lmb.cam.ac.uk/scop/
![Page 30: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/30.jpg)
3D Macromolecular structural data
Data originates from NMR or X-ray crystallography techniques
Total no of structures 48.891 (date: this morning)
If the 3D structure of a protein is solved ... they have it
PDB
![Page 31: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/31.jpg)
PDB content
![Page 32: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/32.jpg)
PDB information
The PDB files have a standard format
Key features
Informative descriptors
![Page 33: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/33.jpg)
PDB-mirror on the WWW
e.g.1AE5
![Page 34: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/34.jpg)
Example output: 1AE5
![Page 35: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/35.jpg)
Protein Structure Initiative (PSI)
Aims at determination of the 3D structure of all Proteins
Organize known protein sequences into families. Select family representatives as targets. Solve the 3D structure of targets by X-ray crystallography
or NMR spectroscopy. Build models for other proteins by homology to solved 3D
structures.
+ many structures solved; - many redundant structures (40%)
![Page 36: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/36.jpg)
SCOP
Structural Classification Of Proteins 3D Macromolecular structural data grouped
based on structural classification
Data originates from the PDB Current version (v1.73) 34494 PDB Entries (Feb 2008). 97178 Domains
![Page 37: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/37.jpg)
SCOP levels bottom-up1.Family: Clear evolutionarily relationshipProteins clustered together into families are clearly evolutionarily related. Generally, this means that pairwise residue identities between the proteins are 30% and greater. However, in some cases similar functions and structures provide definitive evidence of common descent in the absence of high sequence identity; for example, many globins form a family though some members have sequence identities of only 15%.
2.Superfamily: Probable common evolutionary originProteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies. For example, actin, the ATPase domain of the heat shock protein, and hexakinase together form a superfamily.
3.Fold: Major structural similarityProteins are defined as having a common fold if they have the same major secondary structures in the same arrangement and with the same topological connections. Different proteins with the same fold often have peripheral elements of secondary structure and turn regions that differ in size and conformation. In some cases, these differing peripheral regions may comprise half the structure. Proteins placed together in the same fold category may not have a common evolutionary origin: the structural similarities could arise just from the physics and chemistry of proteins favouring certain packing arrangements and chain topologies.
![Page 38: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/38.jpg)
SCOP-mirror on the WWW …
![Page 39: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/39.jpg)
Enter SCOP at the top of the hierarchy
![Page 40: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/40.jpg)
Keyword search of SCOP entries
![Page 41: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/41.jpg)
CATH
Class, derived from secondary structure content, is assigned for more than 90% of protein structures automatically.
Architecture, which describes the gross orientation of secondary structures, independent of connectivities, is currently assigned manually.
Topology level clusters structures according to their toplogical connections and numbers of secondary structures.
The Homologous superfamilies cluster proteins with highly similar structures and functions. The assignments of structures to topology families and homologous superfamilies are made by sequence and structure comparisons.
![Page 42: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/42.jpg)
CATH-mirror on the WWW …
![Page 43: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/43.jpg)
DSSP
Dictionary of secondary structure of proteins
The DSSP database comprises the secondary structures of all PDB entries
DSSP is actually software that translates the PDB structural co-ordinates into secondary (standardized) structure elements
A similar example is STRIDE
![Page 44: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/44.jpg)
WHY bother???
Researchers create and use the data Use of known information for
analyzing new data New data needs to be screened Structural/Functional information Extends the knowledge and
information on a higher level than DNA or protein sequences
![Page 45: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/45.jpg)
In the end ….
Computers can figure out all kinds of problems, except the things in the
world that just don't add up. James Magary
We should add:For that we employ the human brain,
experts and experience.
![Page 46: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/46.jpg)
Bio-databases: A short word on problems Even today we face some key limitations
There is no standard format• Every database or program has its own format
There is no standard nomenclature• Every database has its own names
Data is not fully optimized• Some datasets have missing information without
indications of it Data errors
• Data is sometimes of poor quality, erroneous, misspelled
• Error propagation resulting from computer annotation
![Page 47: Structure Databases](https://reader034.fdocuments.us/reader034/viewer/2022051401/568135d8550346895d9d473a/html5/thumbnails/47.jpg)
What to take home
Databases are a collection of data Need to access and maintain easily and flexibly
Biological information is vast and sometimes very redundant
Distributed databases bring it all together with quality controls, cross-referencing and standardization
Computers can only create data, they do not give answers
Review-suggestion: “Integrating biological databases”, Stein, Nature 2003