Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics...
-
Upload
elaine-joubert -
Category
Documents
-
view
218 -
download
0
Transcript of Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics...
Structure DatabasesStructure Databases
DNA/Protein structure-function DNA/Protein structure-function analysis and predictionanalysis and prediction
Lecture 6Lecture 6
Bioinformatics Bioinformatics SectionSection, Vrije Universiteit, Amsterdam, Vrije Universiteit, Amsterdam
The dictionary definitionThe dictionary definition
Main Entry: Main Entry: da·ta·baseda·ta·base Pronunciation: 'dA-t&-"bAs, 'da- Pronunciation: 'dA-t&-"bAs, 'da- also also 'dä-'dä-Function: Function: nounnounDate: circa 1962Date: circa 1962
:: a usually large collection of data organized a usually large collection of data organized especially for rapid search and retrieval (as by especially for rapid search and retrieval (as by a computer) a computer)
- Webster dictionary- Webster dictionary
WHAT is a database?WHAT is a database?A collection of data that needs to be:A collection of data that needs to be:
StructuredStructured SearchableSearchable Updated (periodically)Updated (periodically) Cross referencedCross referenced
Challenge:Challenge: To change “meaningless” data into useful information that can be To change “meaningless” data into useful information that can be
accessed and analysed the best way possible.accessed and analysed the best way possible.
For example: For example: HOW would YOU organise all biological sequences so that the HOW would YOU organise all biological sequences so that the biological information is optimally accessible?biological information is optimally accessible?
You need an appropriate database management system (DBMS)You need an appropriate database management system (DBMS)
DBMSDBMS
Internal organizationInternal organization Controls speed and Controls speed and
flexibilityflexibility
A unity of programs that A unity of programs that StoreStore ExtractExtract ModifyModify
DatabaseDatabase
StoreStore ExtractExtract ModifyModify
USER(S)USER(S)
DBMS organisation typesDBMS organisation types
Flat file databases (flat DBMS)Flat file databases (flat DBMS) Simple, restrictive, tableSimple, restrictive, table
Hierarchical databases (hierarchical DBMS)Hierarchical databases (hierarchical DBMS) Simple, restrictive, tablesSimple, restrictive, tables
Relational databases (RDBMS)Relational databases (RDBMS) Complex,versatile, tablesComplex,versatile, tables
Object-oriented databases (ODBMS)Object-oriented databases (ODBMS) Complex, versatile, objectsComplex, versatile, objects
Relational databasesRelational databases
Data is stored in multiple Data is stored in multiple relatedrelated tables tables
Data relationships across tables can be Data relationships across tables can be either either many-to-onemany-to-one or or many-to-manymany-to-many
A few rules allow the database to be A few rules allow the database to be viewed in many waysviewed in many waysLets convert the “course details” to a Lets convert the “course details” to a relational databaserelational database
Student 1 Chemistry Biology A B B A C …..Student 1 Chemistry Biology A B B A C …..
Student 2 Ecology Maths A D A A A …..Student 2 Ecology Maths A D A A A …..
..
..
..
..
Course detailsCourse detailsFLAT DATABASE 2FLAT DATABASE 2
Student 2 Ecology Biology A B A A A …..Student 2 Ecology Biology A B A A A …..
Student 1 Chemistry English A A A A A …..Student 1 Chemistry English A A A A A …..........
Name Depart. Course E1 E2 E3 P1 P2Name Depart. Course E1 E2 E3 P1 P2
Student 1 Chemistry Maths C C B A A …..Student 1 Chemistry Maths C C B A A …..
Our flat file databaseOur flat file database
Normalize (1NF) …Normalize (1NF) …We remove repeating records (rows)We remove repeating records (rows)
sID Name dIDsID Name dID
1 Student1 11 Student1 1
2 Student2 22 Student2 2
cID Course cID Course
1 Biology1 Biology
2 Maths 2 Maths
3 English 3 English
dID Department dID Department
1 Chemistry1 Chemistry
2 Ecology 2 Ecology
1 1 A B B A C …..1 1 A B B A C …..
2 2 A D A A A …..2 2 A D A A A …..
..
..
..
..
2 1 A B A A A …..2 1 A B A A A …..
1 3 A A A A A …..1 3 A A A A A …..........
sID cID E1 E2 E3 P1 P2sID cID E1 E2 E3 P1 P2
1 2 C C B A A …..1 2 C C B A A …..
Primary keysPrimary keysForeign keysForeign keys
sID Name dIDsID Name dID
1 Student1 11 Student1 1
2 Student2 22 Student2 2
cID Course cID Course
1 Biology1 Biology
2 Maths 2 Maths
3 English 3 English gID Grade gID Grade
1 A1 A
2 B 2 B
3 C 3 C
dID Department dID Department
1 Chemistry1 Chemistry
2 Ecology 2 Ecology
wID Project wID Project
1 E11 E1
2 E2 2 E2
3 E3 3 E3
4 P1 4 P1
5 P2 5 P2
sID cID gID wID sID cID gID wID
1 1 1 1 1 1 1 1 1 1 2 21 1 2 2
1 1 2 31 1 2 3
1 1 1 41 1 1 4
1 1 3 5 1 1 3 5
2 1 1 1 2 1 1 1 2 1 1 22 1 1 2
2 1 2 32 1 2 3
2 1 1 42 1 1 4
2 1 1 5 2 1 1 5
Normalize (2NF) …Normalize (2NF) …
We remove redundant fields (columns)We remove redundant fields (columns)
Relational DatabasesRelational Databases
What have we achieved?What have we achieved? No repeating informationNo repeating information Less storage spaceLess storage space Better reality representationBetter reality representation Easy modification/managementEasy modification/management Easy usage of any combination of recordsEasy usage of any combination of records
RememberRemember the DBMS has programs to access and edit this the DBMS has programs to access and edit this information so ignore the human reading limitation of information so ignore the human reading limitation of the primary keysthe primary keys
Accessing database informationAccessing database information
A request for data from a database is A request for data from a database is called a called a queryquery
Queries Queries can be of three forms:can be of three forms: Choose from a list of parametersChoose from a list of parameters Query by example (QBE)Query by example (QBE) Query languageQuery language
Query LanguagesQuery Languages
The standard The standard SQL (Structured Query Language) originally SQL (Structured Query Language) originally
called SEQUEL (Structured English QUEry called SEQUEL (Structured English QUEry Language)Language)
Developed by IBM in 1974; introduced Developed by IBM in 1974; introduced commercially in 1979 by Oracle Corp.commercially in 1979 by Oracle Corp.
Standard interactive and programming Standard interactive and programming language for getting information from and language for getting information from and updating a database.updating a database.
RDMS (SQL), ODBMS (Java, C++, OQL etc)RDMS (SQL), ODBMS (Java, C++, OQL etc)
Distributed databasesDistributed databases
From local to global attitudeFrom local to global attitudeData appears to be in one location but is most definitely Data appears to be in one location but is most definitely notnot
A definitionA definition: Two or more data files in different locations, : Two or more data files in different locations, periodically synchronized by the DBMS to keep data in periodically synchronized by the DBMS to keep data in all locations consistent (A,B,C)all locations consistent (A,B,C)
An intricate network for combining and sharing An intricate network for combining and sharing informationinformationAdministrators praise fast network technologies!!!Administrators praise fast network technologies!!!Users praise the internet!!!Users praise the internet!!!
Data warehouseData warehouse
Periodically, one imports data from databases and store Periodically, one imports data from databases and store it (locally) in the data warehouse.it (locally) in the data warehouse.
Now a local database can be created, containing for Now a local database can be created, containing for instance instance protein family data (sequence, structure, protein family data (sequence, structure, function and pathway/process data integrated with the function and pathway/process data integrated with the gene expression and other experimental data).gene expression and other experimental data).
Disadvantage: expensive, intensive, needs to be Disadvantage: expensive, intensive, needs to be updated. updated.
Advantage: easy control of integrated data-mining Advantage: easy control of integrated data-mining pipeline. pipeline.
So why do biologists care?So why do biologists care?
Three main reasonsThree main reasons
Database proliferationDatabase proliferation Dozens to hundreds at the momentDozens to hundreds at the moment
More and more scientific discoveries result More and more scientific discoveries result from inter-database analysis and miningfrom inter-database analysis and mining
Rising complexity of required data-Rising complexity of required data-combinationscombinations E.g. translational medicine: “from bench to E.g. translational medicine: “from bench to
bedside” (genomic data vs. clinical data)bedside” (genomic data vs. clinical data)
Biological databasesBiological databases
Like any other databaseLike any other database Data organization for optimal analysisData organization for optimal analysis
Data is of different typesData is of different types Raw data (DNA, RNA, protein sequences)Raw data (DNA, RNA, protein sequences) Curated data (DNA, RNA and protein Curated data (DNA, RNA and protein
annotated sequences and structures, annotated sequences and structures, expression data)expression data)
Raw Biological dataRaw Biological dataNucleic Acids (DNA)Nucleic Acids (DNA)
Raw Biological dataRaw Biological dataAmino acid residues (proteins)Amino acid residues (proteins)
Curated Biological DataCurated Biological Data
DNA, nucleotide sequences
Gene boundaries, topologyGene boundaries, topology Gene structureGene structure
Introns, exons, ORFs, splicingIntrons, exons, ORFs, splicing
Expression dataExpression data Mass spectometry Mass spectometry
Mass spectometry Mass spectometry (metabolomics, proteomics)(metabolomics, proteomics)
Post-Translational proteinPost-Translational proteinModification (PTM)Modification (PTM)
Curated Biological DataCurated Biological DataProteins, residue sequences
MCTUYTCUYFSTYRCCTYFSCDExtended sequence information Extended sequence information
Secondary structureSecondary structure
Hydrophobicity, motif dataHydrophobicity, motif data
Protein-protein interactionProtein-protein interaction
Curated Biological dataCurated Biological data3D Structures, folds3D Structures, folds
Biological DatabasesBiological Databases
The 2003 NAR Database Issue: http://nar.oupjournals.org/content/vol31/issue1/
Distributed informationDistributed information
Pearson’s Law:Pearson’s Law: The usefulness of a column of The usefulness of a column of data varies as the square of the number of data varies as the square of the number of columns it is compared to.columns it is compared to.
A few biological databasesA few biological databasesNucleotide DatabasesNucleotide DatabasesAlternative Splicing, EMBL-Bank, Ensembl, Genomes Server, Genome, Alternative Splicing, EMBL-Bank, Ensembl, Genomes Server, Genome, MOT, EMBL-Align, Simple Queries, dbSTS Queries, Parasites, Mutations, MOT, EMBL-Align, Simple Queries, dbSTS Queries, Parasites, Mutations, IMGTIMGTGenome DatabasesGenome DatabasesHuman, Mouse, Yeast, C.elegans, FLYBASE, ParasitesHuman, Mouse, Yeast, C.elegans, FLYBASE, ParasitesProtein DatabasesProtein Databases Swiss-Prot, TrEMBL, InterPro, CluSTr, IPI, GOA, GO, Proteome Analysis, Swiss-Prot, TrEMBL, InterPro, CluSTr, IPI, GOA, GO, Proteome Analysis, HPI, IntEnz, TrEMBLnew, SP_ML, NEWT, PANDITHPI, IntEnz, TrEMBLnew, SP_ML, NEWT, PANDITStructure DatabasesStructure Databases PDB, MSD, FSSP, DALIPDB, MSD, FSSP, DALIMicroarray DatabaseMicroarray Database ArrayExpressArrayExpressLiterature DatabasesLiterature Databases MEDLINE, Software Biocatalog, Flybase ArchivesMEDLINE, Software Biocatalog, Flybase ArchivesAlignment DatabasesAlignment DatabasesBAliBASE, Homstrad, FSSPBAliBASE, Homstrad, FSSP
Structural DatabasesStructural Databases
Protein Data Bank (PDB) Protein Data Bank (PDB) http://www.rcsb.org/pdb/http://www.rcsb.org/pdb/
Structural Classification of Proteins Structural Classification of Proteins (SCOP)(SCOP)
http://scop.berkeley.eduhttp://scop.berkeley.edu
http://scop.mrc-lmb.cam.ac.uk/scop/http://scop.mrc-lmb.cam.ac.uk/scop/
3D Macromolecular structural data3D Macromolecular structural data
Data originates from NMR or X-ray Data originates from NMR or X-ray crystallography techniquescrystallography techniques
Total nTotal noo of structures of structures 34.626 34.626 (17/01/2006)(17/01/2006)
If the 3D structure of a protein is solved ... If the 3D structure of a protein is solved ... they have itthey have it
PDBPDB
PDB contentPDB content
PDB informationPDB information
The PDB files have a standard format The PDB files have a standard format
Key featuresKey features
Informative descriptorsInformative descriptors
PDB-mirror on the WWW …PDB-mirror on the WWW …
e.g.1AE5
Example output: 1AE5Example output: 1AE5
SCOPSCOP
SStructural tructural CClassification lassification OOf f PProteinsroteins3D Macromolecular structural data grouped 3D Macromolecular structural data grouped based on structural classification based on structural classification
Data originates from the PDBData originates from the PDBCurrent version (v1.69)Current version (v1.69)25973 PDB Entries (July 2005).25973 PDB Entries (July 2005).70859 Domains 70859 Domains
SCOP levelsSCOP levels bottom-up bottom-up1.Family: Clear evolutionarily relationshipProteins clustered together into families are clearly evolutionarily related. Generally, this means that pairwise residue identities between the proteins are 30% and greater. However, in some cases similar functions and structures provide definitive evidence of common descent in the absence of high sequence identity; for example, many globins form a family though some members have sequence identities of only 15%.
2.Superfamily: Probable common evolutionary originProteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies. For example, actin, the ATPase domain of the heat shock protein, and hexakinase together form a superfamily.
3.Fold: Major structural similarityProteins are defined as having a common fold if they have the same major secondary structures in the same arrangement and with the same topological connections. Different proteins with the same fold often have peripheral elements of secondary structure and turn regions that differ in size and conformation. In some cases, these differing peripheral regions may comprise half the structure. Proteins placed together in the same fold category may not have a common evolutionary origin: the structural similarities could arise just from the physics and chemistry of proteins favouring certain packing arrangements and chain topologies.
SCOP-mirror on the WWW …SCOP-mirror on the WWW …
Enter SCOP at the top of the hierarchy
Keyword search of SCOP entries
CATHCATHCClasslass, derived from secondary structure content, is , derived from secondary structure content, is assigned for more than 90% of protein structures assigned for more than 90% of protein structures automatically. automatically. AArchitecturerchitecture, which describes the gross orientation of , which describes the gross orientation of secondary structures, independent of connectivities, is secondary structures, independent of connectivities, is currently assigned manually. currently assigned manually. TTopologyopology level clusters structures according to their level clusters structures according to their toplogical connections and numbers of secondary toplogical connections and numbers of secondary structures. structures. The The HHomologous superfamiliesomologous superfamilies cluster proteins with cluster proteins with highly similar structures and functions. The assignments highly similar structures and functions. The assignments of structures to topology families and homologous of structures to topology families and homologous superfamilies are made by sequence and structure superfamilies are made by sequence and structure comparisons.comparisons.
CATH-mirror on the WWW …CATH-mirror on the WWW …
DSSPDSSP
Dictionary of secondary structure of proteinsDictionary of secondary structure of proteins
The DSSP database comprises the secondary The DSSP database comprises the secondary structures of all PDB entriesstructures of all PDB entries
DSSP is actually software that translates the DSSP is actually software that translates the PDB structural co-ordinates into secondary PDB structural co-ordinates into secondary (standardized) structure elements(standardized) structure elements
A similar example is STRIDEA similar example is STRIDE
WHY bother???WHY bother???
Researchers create and use the dataResearchers create and use the data
Use of known information for analyzing Use of known information for analyzing new datanew data
New data needs to be screenedNew data needs to be screened
Structural/Functional informationStructural/Functional information
Extends the knowledge and information on Extends the knowledge and information on a higher level than DNA or protein a higher level than DNA or protein sequencessequences
In the end ….In the end ….
Computers can figure out all kinds of problems, except the things in the
world that just don't add up. James Magary
We should add:For that we employ the human brain,
experts and experience.
Bio-databases: A short word on Bio-databases: A short word on problemsproblems
Even today we face some key limitationsEven today we face some key limitations There is no standard formatThere is no standard format
Every database or program has its own formatEvery database or program has its own format There is no standard nomenclatureThere is no standard nomenclature
Every database has its own namesEvery database has its own names Data is not fully optimizedData is not fully optimized
Some datasets have missing information without indications Some datasets have missing information without indications of itof it
Data errorsData errorsData is sometimes of poor quality, erroneous, misspelledData is sometimes of poor quality, erroneous, misspelled
Error propagation resulting from computer annotationError propagation resulting from computer annotation
What to take homeWhat to take home
Databases are a collection of dataDatabases are a collection of data Need to access and maintain easily and flexiblyNeed to access and maintain easily and flexibly
Biological information is vast and sometimes Biological information is vast and sometimes very redundantvery redundantDistributed databases bring it all together with Distributed databases bring it all together with quality controls, cross-referencing and quality controls, cross-referencing and standardizationstandardizationComputers can only create data, they do not Computers can only create data, they do not give answersgive answersReview-suggestion: “Integrating biological Review-suggestion: “Integrating biological databases”, Stein, Nature 2003databases”, Stein, Nature 2003