Integration of Sequence and 3D structure Databases
-
Upload
shafira-franco -
Category
Documents
-
view
40 -
download
0
description
Transcript of Integration of Sequence and 3D structure Databases
EMBL-EBI
Integration of Sequence and 3D structure Databases
EMBL-BankDNA sequences
EnsEMBLHuman GenomeGene Annotation
UniprotProtein
Sequences
EMSDMacromolecularStructure Data
Array-ExpressMicroarray
Expression Data
EMBL-EBI
Integration With Uniprot
eFamily Project
Future Plans
EMBL-EBI
UniProt (Universal Protein Resource) is the world's most comprehensive catalogue of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.
Integration With UniProt
http://www.ebi.ac.uk/uniprot/index.html
EMBL-EBI
MSD / UniProt:
UniProt MSD
Agreedcommon
mechanismfor exchange of information.
ServicesServices
Two Different Database Systems
“One of the major benefits of using databases for data storage is for data sharing”
EMBL-EBI
Collaboration between MSD (Sameer Velankar, Phil McNeil) and UniProt (Virginie Mittard, Daniel Barrell) groups
Depends upon
Clean UniProt (UNP) cross references in the DBREF records for each chain (where possible)
Clean taxonomy ids for each PDB chain
Taxonomy for PDB Source and UniProt OS must be the same
MSD/Uniprot Collaboration
EMBL-EBI
Cleanup of the DBREF records in the PDB entries
Cleanup of the UniProt cross references in PDB entries
Cleanup of Source Information
NCBI Taxonomy IDs
Cleanup of the Reference information
Update UniProt entries
Source, Reference, Secondary structure information
Supply Additional Information
revision date, experimental method, resolution, R-factor
Residue-by-residue mapping between MSD and UniProt enables chimaeras to be handled correctly
EMBL-EBI
Sequence Schema
EMBL-EBI
Residue by Residue Mapping to UniProt
PDB CHAIN UNP SERIAL PDB_RES PDB_SEQ UNP_RES UNP_RES ANNOTATION
1HG1 A P06608 1 ALA 22 A NOT OBSERVED
1HG1 A P06608 2 ASP 23 D NOT OBSERVED
1HG1 A P06608 3 LYS 24 K NOT OBSERVED
1HG1 A P06608 4 4 LEU 25 L
1HG1 A P06608 5 5 PRO 26 P
1HG1 A P06608 6 6 ASN 27 N
1HG1 A P06608 7 7 ILE 28 I
1HG1 A P06608 8 8 VAL 29 V
1HG1 A P06608 9 9 ILE 30 I
1HG1 A P06608 10 10 LEU 31 L
1HG1 A P06608 11 11 ALA 32 A
EMBL-EBI
Display of Mappings
EMBL-EBI
EMBL-EBI
EMBL-EBI
IntEnz is the name for the Integrated relational Enzyme database and is the most up-to-date version of the Enzyme Nomenclature. The IntEnz relational database implemented and supported by the EBI is the master copy of the Enzyme Nomenclature data. MSD uses the UniProt accession code(s) mapped to each chain to link to the IntEnz EC number This done directly via the MSD and IntEnz Oracle
relational databases
Integration With IntEnz
http://www.ebi.ac.uk/intenz/index.html
EMBL-EBIeFamily
http://www.efamily.org.uk/
The eFamily project is designed to integrate the information contained
in five of the major protein databases.
EMBL-EBI
To integrate the information contained in the five major protein databases.
The member databases (CATH, SCOP, MSD, Interpro, and Pfam) contain information describing protein domains.
For SCOP, CATH and MSD the data is primarily concerned with 3D structures In InterPro and Pfam the focus is mainly on the sequences.
It is often difficult for biologists to navigate from protein sequence to protein structure and back again.
eFamily aims to provide the scientific community with a coherent and rich view of protein families that allow users seamlessly to navigate between the worlds of protein structure and protein sequence, by improved data resources and integration via grid technologies.
eFamily Core Activities
EMBL-EBI
UniProt
GO
InterPro
GO
PROSITE
SCOP
Pfam
CATH
GO
Curated
Curated
Curated
Common Domains definition HMM prediction
Curated
Mapping & curation
Mapping per residue
Mapping start – end
MSD mappingResidues/Sequence
DATA INTEGRATION
EMBL-EBI
InterPro-UniProt(s)
UniProt-PDBCHAIN(S)
CATH/SCOP DOMAINPDBCHAIN(S)
InterPro-CATH/SCOP
CATH/SCOP DOMAIN UniProt
Complexity of Mappings
An InterPro entry is a collection of one or more UniProt entries
Unlike PDB concept of CHAIN does not exist in UniProt
UniProt entry is always numbered from 1 to N
PDB SEQRES Residue numbering is from 1 to N
PDB CHAIN (ATOM Records) Residue numbering is not necessarily 1 to N
UniProt to PDB Mapping can be one to many
PDB CHAIN to UniProt Mapping can be one to many
EMBL-EBI
SCOP Domain
PDB Residue RangeChainsSwiss-Prot Residue Range
MSD-SCOP Mapping for 1cbw
EMBL-EBI
MSD-CATH Mapping for 1cbw
CATH DomainsPDB Residue RangeChains
Swiss-Prot Residue Range
EMBL-EBI
MSD-Pfam Mapping for 1cbw
Pfam DomainPDB Residue RangeChains
Swiss-Prot Residue Range
EMBL-EBI
Practical Applications of Database Integration
EMBL-EBI
Mappings Used in Pfam
Pfam now uses UniProt to structure mapping from MSD Search Database
Saves duplication of effort and weeks of compute
Use mapping for annotation of alignments
Pfam domains highlighted on structureof RuBisCo (8ruc)
EMBL-EBI
Mappings Used in Interpro
EMBL-EBI
Mappings Used in SCOP
EMBL-EBIComparison of SCOP, CATH and Pfam
Domains
SCOP, CATH and Pfam have developed web-services for describing their particular domain families. These services can be queried with a protein identifier, protein accession or PDB identifier.
The databases use the MSD/UniProt mapping to translate between the sequence and structure domains
EMBL-EBI
XML & Web Services
The eFamily project has developed a XML schema to describe:
Domains Annotation Sequence Alignments Structure Alignments
This will be used to provide web-services as part of the eFamily project.
More information about the XML schema is available at -http://www.efamily.org.uk/xml/efamily/documentation/efamily.shtml
We are also developing a perl based API for the eFamily XML which will be available from eFamily site as well as via bio-perl.
The MSD residue-by-residue mapping is made available in XML format based on the eFamily schema.
EMBL-EBI
Mapping Annotation
EMBL-EBI
Future Plans
EMBL-EBI
Integration of IntAct database- IntAct provides a freely available, open source database system and analysis tools for protein interaction data.
http://www.ebi.ac.uk/intact/index.jsp
EMBL-EBI
Makes use of cleaned-up cross-reference & taxonomy data, SEQRES and ATOM/HETATM records from the PDB and the sequence from the UniProt entry to align and map each residue.
Makes connected segments from the PDB ATOM/HETATM records for each chain
These are then aligned against the SEQRES records and all the alignments for the segments are merged to get the SEQRES-ATOM alignment
This enables any unobserved residues to be considered
Residue Mapping Program 1
EMBL-EBI
A similar operation is performed on the UniProt sequence and connected segments from the ATOM/HETATM records to get the UNP-ATOM alignment
The SEQRES-ATOM and UNP-ATOM alignments are then merged to get the final alignment
This is repeated for each chain in the PDB archive (with a UNP cross-reference
The mapping is loaded into the MSD relational database and validated
Residue Mapping Program 2
EMBL-EBI
Integrating data from MSD into CATH
Protocols have been developed for regular imports of a subset of MSD data warehouse into a local CATH database set up in ORACLE 9i
For example, information on the biological unit and on protein-ligand interactions will be integrated to increase functional annotations for CATH domain
families
EMBL-EBI
Two step process of data synchronisation
Data are moved from the MSD search database to the CATH-UCL site using a combination of Oracle Export/Import and SQL*Loader utilities
Subsequent updates in the MSD database are pushed to the CATH site using an incremental replication mechanism.
Data from the CATH site are pushed to the MSD site, using the same two step process
The two databases are synchronised
MSD & CATH Data Exchange
EMBL-EBI
StructureSCOPCATH
SequenceUniProt (neé Swiss-Prot
/Trembl/PIR), InterPro, Go, Pfam
FunctionIntEnz
LiteratureMedline