UniProtKB-PDB residue level mapping · UniProt and the PDB Matching sequence and structure at the...
Transcript of UniProtKB-PDB residue level mapping · UniProt and the PDB Matching sequence and structure at the...
UniProt and the PDB Matching sequence and structure at the residue level
Paul J. Gane1 and UniProt Consortium1,2,3
1EMBL-European Bioinformatics Institute, Cambridge, UK
2SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
UniProtKB
3Protein Information Resource, Georgetown University, Washington DC & University of Delaware, USA
Email: [email protected] URL: www.uniprot.org
Funding
UniProt is funded by the European Molecular Biology Laboratory, National Institutes of Health, European Union, Swiss Federal Government, British Heart Foundation and National Science Foundation.
UniProt
ID CRYAB_HUMAN Reviewed; 175 AA.
AC P02511; B0YIX0; O43416; Q9UC37; Q9UC38; Q9UC39; Q9UC40; Q9UC41;
DT 21-JUL-1986, integrated into UniProtKB/Swiss-Prot.
DE RecName: Full=Alpha-crystallin B chain;
DE AltName: Full=Alpha(B)-crystallin;
DE AltName: Full=Heat shock protein beta-5;
DE Short=HspB5;
DE AltName: Full=Renal carcinoma antigen NY-REN-27;
DE AltName: Full=Rosenthal fiber component;
GN Name=CRYAB; Synonyms=CRYA2;
OS Homo sapiens (Human).
OX NCBI_TaxID=9606;
DR PDB; 2KLR; NMR; -; A/B=1-175.
DR PDB; 2WJ7; X-ray; 2.63 A; A/B/C/D/E=67-157.
DR PDB; 2Y1Y; X-ray; 2.00 A; A=71-157.
DR PDB; 2Y1Z; X-ray; 2.50 A; A/B=67-157.
DR PDB; 2Y22; X-ray; 3.70 A; A/B/C/D/E/F=67-157.
DR PDB; 2YGD; EM; 9.40 A; A/B/C/D/E/F/G/H/I/J/K/L/M/N/O/P/Q/R/S/T/U/V/W/X=1-175.
DR PDB; 3L1G; X-ray; 3.32 A; A=68-162.
DR PDB; 3SGM; X-ray; 1.70 A; A/B/C/D=90-100.
DR PDB; 3SGN; X-ray; 2.81 A; A/B=90-100.
DR PDB; 3SGO; X-ray; 2.56 A; A=90-100.
DR PDB; 3SGP; X-ray; 1.40 A; A/B/C/D=92-100.
DR PDB; 3SGR; X-ray; 2.17 A; A/B/C/D/E/F=92-100.
DR PDB; 3SGS; X-ray; 1.70 A; A=95-100.
FT CHAIN 1 175 Alpha-crystallin B chain.
FT /FTId=PRO_0000125907.
FT METAL 104 104 Zinc.
FT METAL 111 111 Zinc.
FT METAL 119 119 Zinc.
FT SITE 48 48 Susceptible to oxidation.
FT SITE 60 60 Susceptible to oxidation.
FT SITE 68 68 Susceptible to oxidation.
FT MOD_RES 1 1 N-acetylmethionine (Probable).
FT MOD_RES 19 19 Phosphoserine.
FT MOD_RES 45 45 Phosphoserine.
FT MOD_RES 59 59 Phosphoserine.
FT MOD_RES 92 92 N6-acetyllysine; partial.
FT MOD_RES 166 166 N6-acetyllysine.
FT CARBOHYD 170 170 O-linked (GlcNAc) (By similarity).
FT VARIANT 41 41 S -> Y (in dbSNP:rs2234703).
FT /FTId=VAR_014607.
FT VARIANT 51 51 P -> L (in dbSNP:rs2234704).
FT /FTId=VAR_014608.
FT VARIANT 120 120 R -> G (in MFM2; decreased interactions
FT with wild-type CRYAA and CRYAB but
FT increased interactions with wild-type
FT CRYBB2 and CRYGC; dbSNP:rs28929489).
FT /FTId=VAR_007899.
FT CONFLICT 165 165 E -> K (in Ref. 4; AAC19161).
FT CONFLICT 175 175 K -> KKMPFLELHFLKQESFPTSE (in Ref. 4;
FT AAC19161).
SQ SEQUENCE 175 AA; 20159 MW; AE08BED46B7849CB CRC64;
MDIAIHHPWI RRPFFPFHSP SRLFDQFFGE HLLESDLFPT STSLSPFYLR PPSFLRAPSW
FDTGLSEMRL EKDRFSVNLD VKHFSPEELK VKVLGDVIEV HGKHEERQDE HGFISREFHR
KYRIPADVDP LTITSSLSSD GVLTVNGPRK QVSGPERTIP ITREEKPAVT AAPKK
HEADER CHAPERONE 22-MAY-09 2WJ7
TITLE HUMAN ALPHAB CRYSTALLIN
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: ALPHA-CRYSTALLIN B CHAIN;
COMPND 3 CHAIN: A, B, C, D, E;
COMPND 4 FRAGMENT: ALPHA-CRYSTALLIN DOMAIN, RESIDUES 67-157;
COMPND 5 SYNONYM: ALPHA(B)-CRYSTALLIN, ROSENTHAL FIBER COMPONENT,
COMPND 6 HEAT SHOCK PROTEIN BETA-5, HSPB5, RENAL CARCINOMA ANTIGEN
COMPND 7 NY-REN-27, HUMAN ALPHAB;
COMPND 8 ENGINEERED: YES
SOURCE MOL_ID: 1;
SOURCE 2 ORGANISM_SCIENTIFIC: HOMO SAPIENS;
REMARK 999 THE SEQUENCE IS RESIDUES 67-157 PRECEDED BY A GAM TAG
DBREF 2WJ7 A 1 3 PDB 2WJ7 2WJ7 1 3
DBREF 2WJ7 A 4 94 UNP P02511 CRYAB_HUMAN 67 157
SEQRES 1 A 94 GLY ALA MET GLU MET ARG LEU GLU LYS ASP ARG PHE SER
SEQRES 2 A 94 VAL ASN LEU ASP VAL LYS HIS PHE SER PRO GLU GLU LEU
SEQRES 3 A 94 LYS VAL LYS VAL LEU GLY ASP VAL ILE GLU VAL HIS GLY
SEQRES 4 A 94 LYS HIS GLU GLU ARG GLN ASP GLU HIS GLY PHE ILE SER
SEQRES 5 A 94 ARG GLU PHE HIS ARG LYS TYR ARG ILE PRO ALA ASP VAL
SEQRES 6 A 94 ASP PRO LEU THR ILE THR SER SER LEU SER SER ASP GLY
SEQRES 7 A 94 VAL LEU THR VAL ASN GLY PRO ARG LYS GLN VAL SER GLY
SEQRES 8 A 94 PRO GLU ARG
ATOM 1 N MET A 3 23.981 -7.754 15.338 1.00 40.49 N
ATOM 2 CA MET A 3 23.218 -8.749 14.574 1.00119.83 C
ATOM 3 C MET A 3 24.149 -9.916 14.119 1.00 98.08 C
ATOM 4 O MET A 3 25.254 -10.069 14.670 1.00 80.95 O
ATOM 5 CB MET A 3 22.414 -8.092 13.403 1.00 45.41 C
ATOM 6 N GLU A 4 23.686 -10.723 13.149 1.00 63.18 N
ATOM 7 CA GLU A 4 24.418 -11.846 12.545 1.00 46.19 C
ATOM 8 C GLU A 4 25.920 -11.668 12.409 1.00 83.92 C
ATOM 9 O GLU A 4 26.425 -10.547 12.237 1.00 48.25 O
ATOM 10 CB GLU A 4 23.895 -12.164 11.147 1.00 37.69 C
ATOM 11 CG GLU A 4 24.844 -13.184 10.250 1.00 95.40 C
ATOM 12 CD GLU A 4 25.982 -12.580 9.254 1.00125.93 C
ATOM 13 OE1 GLU A 4 26.166 -11.344 9.039 1.00 77.38 O
ATOM 14 OE2 GLU A 4 26.731 -13.391 8.642 1.00 87.82 O
ATOM 15 N MET A 5 26.659 -12.773 12.451 1.00 36.27 N
ATOM 16 CA MET A 5 28.054 -12.609 12.120 1.00 59.07 C
ATOM 17 C MET A 5 28.601 -13.893 11.567 1.00 59.32 C
PDB
>sp|P02511|CRYAB_HUMAN Alpha-crystallin
B chain OS=Homo sapiens GN=CRYAB
MDIAIHHPWIRRPFFPFHSPSRLFDQFFGEH
LLESDLFPTSTSLSPFYLRPPSFLRAPSW
FDTGLSEMRLEKDRFSVNLDVKHFSPEELKV
KVLGDVIEVHGKHEERQDEHGFISREFHR
KYRIPADVDPLTITSSLSSDGVLTVNGPRKQV
SGPERTIPITREEKPAVTAAPKK
>2WJ7:A|PDBID|CHAIN|SEQUENCE
GAMEMRLEKDRFSVNLDVKHFSPEE
LKVKVLGDVIEVHGKHEERQDEHGFI
SREFHRKYRIPADVDPLTITSSLSSDG
VLTVNGPRKQVSGPER
PDB
UniProtKB-PDB residue level mapping
Why Residue Level Mapping?
The UniProt Knowledgebase (UniProtKB), the worldwide protein sequence resource,
contains over 32 million sequences (as of release 2013-03). Of these, 539616 have been
manually annotated. The ‘added value’ of this annotation implies a degree of certainty about
the quality of the sequence as well as a large amount of extra information which has been
collated from a wide variety of sources. One of these is the Protein Databank (PDB) which
has 3D experimental details of protein folds, active/binding site residues, ligands, metals and
cofactor binding from which mechanisms of action can be deduced. This information is
invaluable for drug design, homology modelling, impact of SNPs, mutation studies, novel
protein design, etc.
The number of solved structures in the PDB is 87681 on 03/2013 (less if non-redundancy is
taken into account) - this represents a very small fraction of the total UniProtKB universe and
would appear to have little impact on the improvement of UniProtKB annotation. However,
this structural and functional information can be extended to the widely distributed
homologous and orthologous sequences related to these PDB entries.
The mappings are mostly generated automatically and updated weekly via a Java
application called getafix. The number of new PDB structures deposited each week varies
but is somewhere between 200-500 with each one requiring a ‘match’ to its specific
UniProtKB entry. Problematic matches always occur and these are manually curated.
Examples include chimaeras, N- and C- terminal tags, missing sections and domains, short
sequences and peptides, antibodies and immunoglobulin folds, modified and non-standard
residues. Merged, demerged and deleted UniProtKB entries are often a source of error in
automated mapping and also need manual inspection.
Mapped PDB text files possess one or more DBREF line(s) which indicate which residues of
the structure relate to which in a UniProtKB sequence. In cases of multiple structures or
chimaeras one PDB entry will point to a number UniProtKB identifiers.
UniProtKB entries will cross-reference one or more PDB records in their DR PDB line(s).
A direct link from each amino acid in a UniProtKB sequence to a PDB entry may appear a trivial task but, as can be seen in the simple example above, the N
and C termini are not expressed in this crystallised protein (red). Note also that the SEQRES lines in the PDB suggest that the structure has an N-terminal 3
residue tag ‘GAM’. The actual coordinates, however, start with the final methionine of the tag, a residue not part of the UniProtKB sequence. Again, the
SEQRES lines state the sequence ends in QVSGPER, whereas in fact these residues are also missing from the 3D coordinates (purple).
Final mapped sequence:
EMRLEKDRFSVNLDVKHFSPEELKV
KVLGDVIEVHGKHEERQDEHGFISRE
FHRKYRIPADVDPLTITSSLSSDGVLT
VNGPRKQVSGPER
The Binding of Biological Molecules in Protein Structures
The binding of biologically important molecules in a PDB structure is captured and
automatically added to unreviewed UniProtKB/TrEMBL entries, visible in the various FT
lines. Reviewed or hand annotated UniProtKB/Swiss-Prot entries can be updated with
similar information from the PDB using an in-house curator tool, again part of the getafix
suite.
FT METAL 167 185 Manganese[1ATP]. FT NP_BIND 50 58 ATP. FT NP_BIND 122 128 ATP. FT NP_BIND 169 172 ATP. FT ACT_SITE 167 167 Proton acceptor. FT BINDING 73 73 ATP.
Collaborations and Applications
Maintaining up to date mappings relies on a close collaboration with the PDBe and good
communication with the RCSB. The mapping data is integral to the SIFTS database
(structure integration with function, taxonomy and sequence) which provides residue level
mapping to IntEnz, GO, Pfam, InterPro, SCOP, CATH and PubMed databases. The Enzyme
Portal is another resource which uses residue level mappings by combining enzyme
sequence and structure information with small molecule substrates/drugs and biochemical
pathways and functionality.
runPdbReleaseCheck
UniProtKB/
Swiss-Prot and TrEMBL
PDBe
Get new and modified pdb
files
RunWeeklyPdbRelease.sh
PDBe Repository for all PDB entries
getafix Fasta and
XML files
pdbReleaseMapping
Editor DBREF.txt
PDBe
mappings
PDBe
cronjob
UniProtPdbXrefs.txt
Email SIB
SwissProtAddLogFile.txt
SwissProtCuratedMoveLogFile.txt
SwissProtDeletedXrefLogfile.txt
ftp XREF files
weekly
monthly buildGetafixDB
weekly
Obsolete etc.
files
Errors in PDB entries email
RCSB, PDBj or PDBe
SIFTS
GO annotation xml
TrEMBL SwissProt
CSA
Catalytic
site
annotations
FT lines to TrEMBL
makeUniProtPdbXrefs
RCSB