Quality of Protein Crystal Structures in the PDB Eric. N Brown, Lokesh Gakhar and S. Ramaswamy.

Post on 01-Apr-2015

214 views 2 download

Transcript of Quality of Protein Crystal Structures in the PDB Eric. N Brown, Lokesh Gakhar and S. Ramaswamy.

Quality of Protein Crystal Structures in the PDB

Eric. N Brown, Lokesh Gakhar

and

S. Ramaswamy.

Between objectivity and subjectivityCarl-Ivar Bränd´en & T. Alwyn Jones

Department of Molecular Biology, Uppsala Biomedical Center, PO Box 590, S-751 24 Uppsala, Sweden.

Protein crystallography is an exacting trade, and the results may contain errors that are difficult to identify. It is the crystallographer's responsibility to make sure that incorrect protein structures do not reach the literature.

Nature 343, 687 - 689 (22 February 1990)

Amplitudes and Phases - Bias.

Animal stories - by Kevin Cowtan

Amplitudes and Phases - Bias.

More animal stories.

Stolen from Bernhard Rupp website without permission

How much of what we think?

QuickTime™ and aYUV420 codec decompressor

are needed to see this picture.

QuickTime™ and aYUV420 codec decompressor

are needed to see this picture.

Stolen from --- James Holton, Berkeley, without permission.

VALIDATION Based on GeometryWHATIFPROCHECKMOLPROBITYRAMACHANDRAN PLOT.

STRUCTURE VALIDATIONValidation based on fit to DATA R-factor/R-freeReal space fit, Etc.Problem: Data to parameter ratio.

ADD Geometric Restraints - or Chemical Knowledge

COMPOSITE VALIDATION:ASTRAL - SPACIhttp://astral.Berkeley.edu/spaci.html

WHY MORE?

DON’T WE HAVE ENOUGH VALIDATION TOOLS?

WHAT IS COMMON BETWEEN ALL EXISTING VALIDATION TECHNIQUES?THERE IS AN ABSOLUTE CORRECT ANSWER

WE KNOW THERE IS NO CORRECT ANSWER

THINK DIFFERENTLY

• All crystallographers want to deposit the correct structure.

• There is subjectivity and bias - all of which are random

AVERAGE IS BEST !!

QUALITY & AVERAGE

• How different are you from the average is a measure of quality

HOW DO YOU DESCRIBE THE AVERAGE?

Quality of Model

Independent Variables

Date submitted to PDB

Maximum resolution

X-Ray Source

Number of atoms

Similarity Index

Cross Terms

Dependent Variables

R-factor

R-free

Real-space R-value

Real-space CC

Outliers

Ramachandran Violations

Predictive Models

Example: How To determine weight for 5’7” male . . .

. . . make up an equation . . .

. . . choose a group of males . . .

. . . fit the equation to their weight . . .

. . . evaluate equation.

Open problems

• What independent variables?Quality = f(resolution)Quality = f(resolution, date, x-ray

source)• What equation?

Quality = a x resolution + b x date + cQuality = a x res + log

b2(date) + c

• How to fit it to observations?- Least squares vs. Maximum likelihood- Outliers

Choose model based on LL

Start with Metric = a x resolution + C Add or remove terms iteratively to decrease LL Use BIC to decide if a new parameter contributes to significant

decrease in LL or not

RESULT: An equation that predicts a given metric…

Data is all structures in the PDB that have all independent and dependent variables (16,609)

PICK ALL AVAILABLE METRICS (R-factor/R-free etc.. )

and FOR EACH METRIC

R factor =C + rhigh + S + N + I + rhigh × (S + N) + N × I

Rreal −space = C+ rhigh + S+D + N + rhigh ×(S+D + I + N )+D ×(I + S)+ I × N + rhigh ×(S×D + I × N )

EQUATIONS FOR METRICS!

INFORMATION INHERENT IN THE MODEL

Model can tell us immediately What independent variables affect what metrics (dependent variables) and by how much?

Example: R-factor Vs time R-factor Vs source & resolution

UNEXPLORED QUESTIONSIN THE MODEL?

Unexplored Independent Variables :• R-sym and Redundancy• Space group and volume of unit cell?• Refinement protocol• Solvent modeling and B-factor modeling.• Temperature of data collection.• Complexity - as a function of number of

chains of macromolecules.

Nine - metrics to ONEPrincipal component

analysis

• We took the nine metrics and combined them to form one metric accounting for co-relations and redundancy. Now we have one metric which is what we can call Quality-values.

• CONSTRUCTION of the Q-value of the average is zero. Negative numbers mean better than average - positive numbers worse than the average. Standard deviation is one.

USE OF THE MODEL

• COMPARE STRUCTURES WITH THE AVERAGE - INDIVIDUALLY AND AS A GROUP.

Q- value is now independent of all the independent variables used to make the model. (Resolution, number of atoms, date of data collection, novelty of structure etc..)

Better indicator of quality than any one of the dependent variables.

STRUCTURAL GENOMICS (updated - Jan 2008)

MCSG over Time!

MORE-SG groups!

Quality Vs. Journals

percentage better than global average

0 10 20 30 40 50 60 70 80 90 100

ImmunityNature immunology

Cell biochemistry and biophysicsMolecular and cellular biology

ScienceNucleic acids research

Journal of virologyBiochemical and biophysical research communications

The EMBO journal.Nature

Journal of immunology (Baltimore - Md)Nature structural biology

Journal of structural biologyDie Pharmazie

Chemistry (Weinheim an der Bergstrasse - Germany)EMBO reports

Plant & cell physiologyJournal of medicinal chemistry

Bioorganic & medicinal chemistry lettersBiochemistry. Biokhimiia

Bioorganic chemistryStructure (London - England)

The Journal of biological chemistryBiological chemistry

Journal of biological inorganic chemistry : JBICInorganic chemistryBiophysical journal

OTHERJournal of the American Chemical Society

Molecular microbiologyJournal of molecular biology

Acta crystallographica. Section D - BiologicalChembiochem : a European journal of chemical biology

Archives of biochemistry and biophysicsFEBS letters

Journal of bacteriologyProtein science : a publication of the Protein Society

ProteinsProtein engineering

Biochemical pharmacologyJournal of inorganic biochemistry

European journal of biochemistry / FEBS

WHAT CAN WE DO?

• Beam lines.• Best practices.• Protocols and methodologies.• Countries.• Institutions.• Funding mechanisms.• Investigators.

Is this the best we can do?

WE CAN DO BETTERWe improve quality of structures by better

design of experiments and refinement protocols if we know what independent variables affect what dependent variables and how?

BEFORE WE DO THIS - FIX PROBLEMS THAT WE FOUND.

•Too much dependence of external databases!

•Problems with unknown atoms.

•Develop methods for missing data correction.

OTHER DATABASES - NMRSome thoughts on independent variables.• Spectrometers• Samples - size, tags, buffers etc..• Completeness of Assignments - percentage of

backbone assigned etc..• Actual Data Used in Structural Calculations -

NOE distance restraints, Hydrogen bond distance restraints (experimental vs. inferred), Torsion angle restraints, Dipolar coupling restraint, Paramagnetic restraint.

• Structural Statistics• Date of structure determination.• Relaxation measurements?

OTHER DATABASES - NMR

DEPENDENT VARIABLES.• RMS deviation of Ensemble• Packing (Molprobity score?)• Ramachandran violations• Recall, Precision, F-measure (Huang, Powers and

Montelione).• Agreement with high resolution X-ray

structures• Other??

AFTER Today's LECTURES

HOW ABOUT THE MODEL DATABASE?

I am sure out modeling experts can think of the dependent and independent variables….

THANK YOU

ACKNOWLEDGEMENT

X-ray work - Eric N Brown and Lokesh Gakhar

The R-statistical package!

NMR work - Liping Yu and Andrew Fowler

Thanks to Brian Fox for inviting me - though I am not a member of any SG initiative.

Questions and Accusations.