Pharm201 Lecture 4 2009 1
Data Representation
Pharm 201/Bioinformatics I
Philip E. Bourne
Department of Pharmacology, UCSD
Prerequisite Reading: Structural Bioinformatics Chapters 10
Pharm201 Lecture 4 2009 2
Take Home Message
• Good data representation of complex data is not a trivial undertaking
• However it is prerequisite to effective use of those data
• History often precludes the above
You should have got a sense of the first item from the assignment
Pharm201 Lecture 4 2009 3
Global Considerations in Defining a Data Representation
• Scope - breadth and depth of data to be included
• Name space• How to cast that data• What will the definition be used for?
– Archiving, schema representation, methods ...
Pharm201 Lecture 4 2009 4
Simple query, browsing and retrieval Consistent data resulting from autonomous validation and verification Simple and consistent data exchange A unified view of disparate types of data Accommodation of new knowledge as it is
discovered Inclusion of procedures (methods) to specify how a particular item of data is derived or verified.
Global Considerations – More Specifically
Pharm201 Lecture 4 2009 5
Given These Considerations – Where Does the PDB Format Fit In?
First we need to examine the format
Pharm201 Lecture 4 2009 6
The PDB Format
• A full description is at http://www.wwpdb.org/docs.html
• It was designed around an 80 column punched card!
• It was designed to be human readable• It is used by every piece of software that deals
with structural data
Pharm201 Lecture 4 2009 7
The PDB Format – An Example – The Header
Pharm201 Lecture 4 2009 8
The PDB Format - Records
• Every PDB file may be broken into a number of lines terminated by an end-of-line indicator. Each line in the PDB entry file consists of 80 columns. The last character in each PDB entry should be an end-of-line indicator.
• Each line in the PDB file is self-identifying. The first six columns of every line contain a record name, left-justified and blank-filled. This must be an exact match to one of the stated record names.
• The PDB file may also be viewed as a collection of record types. Each record type consists of one or more lines.
• Each record type is further divided into fields.
Pharm201 Lecture 4 2009 9
The PDB Format – An Example – The Atomic Coordinates
Pharm201 Lecture 4 2009 10
The Description – Atom Records
Pharm201 Lecture 4 2009 11
What is Wrong with this Approach?
• The description and the data are separate
• Parsing is a nightmare – the most complex piece of code we have in our research laboratory probably remains the PDB parser
• There are no relationships between items of data
• Some data just cannot be parsed ….
Pharm201 Lecture 4 2009 12
REMARK 3 REFINEMENT. BY THE RESTRAINED LEAST-SQUARES PROCEDURE OFREMARK 3 J. KONNERT AND W. HENDRICKSON (PROGRAM *PROLSQ*). THE RREMARK 3 VALUE IS 0.168 FOR 2680 REFLECTIONS WITH I GREATER THANREMARK 3 2.0*SIGMA(I) REPRESENTING 74 PER CENT OF THE TOTALREMARK 3 AVAILABLE DATA IN THE RESOLUTION RANGE 10.0 TO 2.0REMARK 3 ANGSTROMS.
REMARK 4 THE ERABUTOXIN A (EA) CRYSTAL STRUCTURE IS ISOMORPHOUS WITHREMARK 4 THE KNOWN STRUCTURE OF ERABUTOXIN B (PROTEIN DATA BANKREMARK 4 ENTRIES *2EBX*, *3EBX*). EA DIFFERS FROM EB BY A SINGLEREMARK 4 SUBSTITUTION - EA ASN 26 FOR EB HIS 26. THE EA STARTINGREMARK 4 MODEL WAS OBTAINED FROM A MOLECULAR REPLACEMENT STUDY INREMARK 4 WHICH COORDINATES FOR 309 OF THE 475 ATOMS IN THE EBREMARK 4 STRUCTURE (*2EBX*) WERE USED.
PDB Format - Important Components of the Data are Lost
to All But Humans
Pharm201 Lecture 4 2009 13
Enter mmCIF
• Prerequisite reading: http://www.sdsc.edu/pb/papers/methenz97.pdf
• Complete information:
http://mmcif.pdb.org
Pharm201 Lecture 4 2009 14
The macromolecular Crystallographic Information File (mmCIF) – An Approach to Addressing Problems with the PDB Format
• Has the support of a major scientific society• In the backbone of the current PDB• Provides a rich description of very complex
data• Predates any use of ontologies, Web
developments, CORBA, XML etc.• Still has some problems
Pharm201 Lecture 4 2009 15
The temperature is 30 degrees
A human would know whether that was Centigrade or Fahrenheit with additional
context. A computer would have more difficulty!
What would be the point of archiving such data if in10 years the meaning was lost
mmCIF - Initial Motivator Circa Late 1980’s
Pharm201 Lecture 4 2009 16
• All PDB data should be captured• Describe a paper’s material and methods
section• Describe biologically active molecule• Fully describe secondary structure but not
tertiary or quaternary• Describe details of chemistry (inc. 2D)• Meaningful 3D views
mmCIF – Scope of the Initial Effort
Pharm201 Lecture 4 2009 17
mmCIF - Topology
Pharm201 Lecture 4 2009 18
• Data are defined in data blocks• A global declaration spans data blocks• Data exists as name-value pairs• A data name may appear only once in a data block• Loop constructs are supported
mmCIF - STAR Encoding Rules
Pharm201 Lecture 4 2009 19
loop_ _atom_site.group_PDB _atom_site.type_symbol _atom_site.label_atom_id _atom_site.label_comp_id _atom_site.label_asym_id _atom_site.label_seq_id _atom_site.label_alt_id _atom_site.Cartn_x _atom_site.Cartn_y _atom_site.Cartn_z _atom_site.occupancy _atom_site.B_iso_or_equiv _atom_site.footnote_id _atom_site.entity_id _atom_site.entity_seq_num _atom_site.id ATOM N N VAL A 11 . 25.360 30.691 11.795 1.00 17.93 . 1 11 1 ATOM C CA VAL A 11 . 25.970 31.965 12.332 1.00 17.75 . 1 11 2 ATOM C C VAL A 11 . 25.569 32.010 13.881 1.00 17.83 . 1 11 3
mmCIF - Extract from a Data File
Pharm201 Lecture 4 2009 20
save__atom_site.Cartn_x _item_description.description; The x atom site coordinate in angstroms specified according to a set of orthogonal Cartesian axes related to the cell axes as specified by the description given in _atom_sites.Cartn_transform_axes.; _item.name '_atom_site.Cartn_x' _item.category_id atom_site _item.mandatory_code no _item_aliases.alias_name '_atom_site_Cartn_x' _item_aliases.dictionary cifdic.c94 _item_aliases.version 2.0 loop_ _item_dependent.dependent_name '_atom_site.Cartn_y' '_atom_site.Cartn_z' _item_related.related_name '_atom_site.Cartn_x_esd' _item_related.function_code associated_esd _item_sub_category.id cartesian_coordinate _item_type.code float _item_type_conditions.code esd _item_units.code angstroms
mmCIF - Extract from the Dictionary
Pharm201 Lecture 4 2009 21
The DDL category item_description holds a description for each data item. The key item for this category is item_description.name
which is defined in the parent category item. The text of the item description is held by item _item_description.description.
A single description may be provided for each data item.
The DDL for the item_description category is given in the following section.
save_ITEM_DESCRIPTION
_category.description
;
This category holds the descriptions of each data item.
;
_category.id item_description
_category.mandatory_code yes
loop_
_category_key.name '_item_description.name'
'_item_description.description'
loop_
_category_group.id 'ddl_group'
'item_group'
save_
save__item_description.description
_item_description.description
;
Text decription of the defined data item.
;
_item.name '_item_description.description'
_item.category_id item_description
_item.mandatory_code yes
_item_type.code text
save_
mmCIF Dictionary DefinitionLanguage
Pharm201 Lecture 4 2009 22
mmCIF – Topology Revisited
Pharm201 Lecture 4 2009 23
STRUCT_BIOL
STRUCT_BIOL_GEN
STRUCT_ASYM
ENTITY
ENTITY_POLY
ENTITY_POLY_SEQ
CHEM_COMP
ATOM_SITE
STRUCT_CONF
STRUCT_CONN
STRUCT_SITE_GEN
STRUCT_REF
mmCIF - The Category Group Organization of any Macromolecular Structure
Pharm201 Lecture 4 2009 24
mmCIF - Entity - Unique Chemical Component
2525
2626
Entity Information
Pharm201 Lecture 4 2009 27
STRUCT_BIOL
STRUCT_BIOL_GEN
STRUCT_ASYM
ENTITY
ENTITY_POLY
ENTITY_POLY_SEQ
CHEM_COMP
ATOM_SITE
STRUCT_CONF
STRUCT_CONN
STRUCT_SITE_GEN
STRUCT_REF
mmCIF - The Category Group Organization of any Macromolecular Structure
2828
Shows ATOM and UniProt Sequences
Pharm201 Lecture 4 2009 29
STRUCT_BIOL
STRUCT_BIOL_GEN
STRUCT_ASYM
ENTITY
ENTITY_POLY
ENTITY_POLY_SEQ
CHEM_COMP
ATOM_SITE
STRUCT_CONF
STRUCT_CONN
STRUCT_SITE_GEN
STRUCT_REF
mmCIF - The Category Group Organization of any Macromolecular Structure
Pharm201 Lecture 4 2009 30
STRUCT_BIOL
STRUCT_BIOL_GEN
STRUCT_ASYM
ENTITY
ENTITY_POLY
ENTITY_POLY_SEQ
CHEM_COMP
ATOM_SITE
STRUCT_CONF
STRUCT_CONN
STRUCT_SITE_GEN
STRUCT_REF
mmCIF - The Category Group Organization of any Macromolecular Structure
Pharm201 Lecture 4 2009 31
STRUCT_BIOL
STRUCT_BIOL_GEN
STRUCT_ASYM
ENTITY
ENTITY_POLY
ENTITY_POLY_SEQ
CHEM_COMP
ATOM_SITE
STRUCT_CONF
STRUCT_CONN
STRUCT_SITE_GEN
STRUCT_REF
mmCIF - The Category Group Organization of any Macromolecular Structure
Pharm201 Lecture 4 2009 32
STRUCT_BIOL
STRUCT_BIOL_GEN
STRUCT_ASYM
ENTITY
ENTITY_POLY
ENTITY_POLY_SEQ
CHEM_COMP
ATOM_SITE
STRUCT_CONF
STRUCT_CONN
STRUCT_SITE_GEN
STRUCT_REF
mmCIF - The Category Group Organization of any Macromolecular Structure
Pharm201 Lecture 4 2009 33
STRUCT_BIOL
STRUCT_BIOL_GEN
STRUCT_ASYM
ENTITY
ENTITY_POLY
ENTITY_POLY_SEQ
CHEM_COMP
ATOM_SITE
STRUCT_CONF
STRUCT_CONN
STRUCT_SITE_GEN
STRUCT_REF
mmCIF - The Category Group Organization of any Macromolecular Structure
Pharm201 Lecture 4 2009 34
STRUCT_BIOL
STRUCT_BIOL_GEN
STRUCT_ASYM
ENTITY
ENTITY_POLY
ENTITY_POLY_SEQ
CHEM_COMP
ATOM_SITE
STRUCT_CONF
STRUCT_CONN
STRUCT_SITE_GEN
STRUCT_REF
mmCIF - The Category Group Organization of any Macromolecular Structure
mmCIF - Defining Secondary Structure
35
mmCIF - Other Interactions
36Pharm201 Lecture 4 2009
mmCIF – Defining the Biological Assembly
Pharm201 Lecture 4 2009 38
_pdbx_struct_assembly.id 1 _pdbx_struct_assembly.details author_and_software_defined_assembly _pdbx_struct_assembly.method_details PISA _pdbx_struct_assembly.oligomeric_details dimeric _pdbx_struct_assembly.oligomeric_count 2 # _pdbx_struct_assembly_gen.assembly_id 1 _pdbx_struct_assembly_gen.oper_expression 1,2 _pdbx_struct_assembly_gen.asym_id_list A,B,C,D,E,F,G,H
loop__pdbx_struct_oper_list.id _pdbx_struct_oper_list.type _pdbx_struct_oper_list.name _pdbx_struct_oper_list.symmetry_operation _pdbx_struct_oper_list.matrix[1][1] _pdbx_struct_oper_list.matrix[1][2] _pdbx_struct_oper_list.matrix[1][3] _pdbx_struct_oper_list.vector[1] _pdbx_struct_oper_list.matrix[2][1] _pdbx_struct_oper_list.matrix[2][2] _pdbx_struct_oper_list.matrix[2][3] _pdbx_struct_oper_list.vector[2] _pdbx_struct_oper_list.matrix[3][1] _pdbx_struct_oper_list.matrix[3][2] _pdbx_struct_oper_list.matrix[3][3] _pdbx_struct_oper_list.vector[3] 1 'identity operation' 1_555 x,y,z 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 2 'crystal symmetry operation' 4_565 x,-y+1,-z 1.0 0.0 0.0 0.0 0.0 -1.0 0.0 106.344 0.0 0.0 -1.0 0.0
AsymmetricUnit(monomer)
BiologicalAssembly(dimer)
The pdbx_struct_assembly category describes the components andsymmetry operators required to generate the biological assembly
The pdbx_struct_oper category describes thesymmetry operations. Each symmetry operation consists of a 3x3 rotation matrix and a translation vector.
http://www.pdb.org/pdb/static.do?p=education_discussion/Looking-at-Structures/biounit_tutorial.html
Example:PDB 3C70
Pharm201 Lecture 4 2009 29
mmCIF – Defining non-standard Amino Acids
Pharm201 Lecture 4 2009 39
mmCIF - Problems• Two sets of chain identifier and residue numbering schemes (cif
and author defined)– Example: _atom_site.label_asym_id, _atom_site.author_asym_id– PDB files use the author defined scheme
• Poor data typing• Redundant or unnecessary data (i.e. amino acid info)• Use of free text instead of controlled vocabulary
– IMAGE PLATE (RAXIS V) IMAGE PLATE (RAXIS-V) IMAGE PLATE RAXIS V IMAGE PLATE RAXIS-V RAXISIV
• Little software support due to complexity of mmCIF format• Parsing issues
Examples of File Parsing Issues
• Non-standard quoting rules for strings
• Items that require sub-parsing of expressions
Pharm201 Lecture 4 2009 40
loop__pdbx_struct_assembly_gen.assembly_id _pdbx_struct_assembly_gen.oper_expression _pdbx_struct_assembly_gen.asym_id_list 1 '(1-60)(61-88)' A,B,C 4 '(1,2,6,10,23,24)(61-88)' A,B,C 6 '(1,10,23)(61,62,69-88)' A,B,C PAU '(P)(61-88)' A,B,C
TURN_P TURN_P1 'A'"' VAL A 31 ? ILE A 34 ? VAL A 31 ILE A 34 ?;TYPE I';?
Pharm201 Lecture 4 2009 42
Summary• mmCIF has provided the PDB with a robust data
representation which serves as conceptual and physical schema upon which the current RCSB, PDBe and PDBj are built
• This work predated XML and XML-schema but embodies the important concepts inherent in these descriptions
• mmCIF was later exactly converted into XML and is now used more than mmCIF, but much less than the old PDB format
Pharm201 Lecture 4 2009 43
Take Home Message
• Good data representation of complex data is not a trivial undertaking
• However it is prerequisite to effective use of those data
Top Related