Integration of PRO and UniProtKB
description
Transcript of Integration of PRO and UniProtKB
Integration of PRO and UniProtKB
Amherst, NYMay 16, 2013
Cathy H. Wu, Ph.D.
PRO-PO-GO Meeting
2
PRO FrameworkPRO terms are defined/annotated using other ontologies and
resources via definition of relations or mappings when appropriate
• Accessioned, species-specific protein complexes in ProComp are described using protein entities in ProForm; and are cross-referenced to species-independent complex representations in GO
• A gene product (PR:000025358) and its isoforms and modified forms (PR:000025355; PR:000025356) are represented in PRO as separate, uniquely accessioned entities; but are described in the same UniProtKB record (UniProtKB:Q9D6R2)
The representation of protein complexes in the Protein Ontology (PRO)Bult CJ, Drabkin HJ, Evsikov A, Natale D, Arighi C, Roberts N, Ruttenberg A, D'Eustachio P, Smith B, Blake JA, Wu C. (2011) BMC Bioinformatics 12, 371 [PMID: 21929785]
Relationships Between PRO-GO-UniProtKB
ProComp-ProForm: has_partProComp-GO: is_aProForm-UniProtKB: xref
3
Mappings to various external databases• promapping.txt: tab-delimited, each line indicating the PRO ID, the
database ID, and the type of mapping (is_a or exact) • promapping.obo: the same information as promapping.txt, but in OBO
format
Mappings are of two types: • exact
• The database object is an exact match to the PRO object• e.g., PR:000026497 describes an isoform of 6-phosphofructokinase
type C in human only, which corresponds to UniProtKB:Q01813-1• is_a
• The database object is more specific than the PRO object• e.g., PR:000026465 describes an (organism-nonspecific) isoform of
6-phosphofructokinase type C, so UniProtKB:Q01813-1 (human) and UniProtKB:Q9WUA3-1 (mouse) are mapped to this term
4
PRO ID Mapping
bri1/iso1/phos 5 (PR:000035786) has two parents:explicit one in formal definition (PR:000035785)implicit one only shown in the reasoned version (PR:000028355)
[Term]id: PR:000035786name: protein brassinosteroid insensitive 1 isoform 1 phosphorylated 5 (Arabidopsis thaliana)def: "A protein brassinosteroid insensitive 1 isoform 1 phosphorylated 5 in Arabidopsis thaliana. UniProtKB:O22476-1, Thr-872, MOD:00047|Ser-858, MOD:00046|Ser-891, MOD:00046." [PMID:22184234, PRO:LVM]comment: Category=organism-modification. Flag=automatic.synonym: "Athal-BRI1/iso:1/Phos:5" EXACT PRO-short-label [PRO:DNx]synonym: "At protein brassinosteroid insensitive 1 isoform 1 phosphorylated 4" RELATED []is_a: PR:000028355 ! implied link automatically realized ! protein brassinosteroid insensitive 1 isoform 1 (Arabidopsis thaliana)is_a: PR:000035785 ! implied link automatically realized ! protein brassinosteroid insensitive 1 isoform 1 phosphorylated 5intersection_of: PR:000035785 ! protein brassinosteroid insensitive 1 isoform 1 phosphorylated 5intersection_of: only_in_taxon NCBITaxon:3702 ! Arabidopsis thaliana
5
PRO Reasoning with ID Mapping
PR:000035785
PR:000028355
pro.obo: PRO version with no implied linkspro_reasoned.obo: implied link automatically realized via is_a
6
Ontological Representation of UniProtKB in PRO
PRO provides the ontological presentation for UniProtKB Integration of UniProt records/subrecords into the PRO
ontological framework Use UniProtKB protein records (labeled by accession numbers,
isoform IDs, and potentially other stable identifiers within UniProtKB records) to represent organism-gene level and sequence level (and potentially modification-level) terms of PRO Organism-Gene: canonical protein record Organism-Sequence: isoform subrecord Organism-Modification: chain/variant subrecord
7
Organism-Gene/Sequence
8
Ontologizing UniProtKB Full-scale implementation of 12 reference genomes (others as needed)
Organism-Gene: canonical protein record – UniProtKB:xxxxxx Organism-Sequence: isoform subrecord – UniProtKB:xxxxxx-1
Persistent URL: http://purl.obolibrary.org/obo/PR_xxxxxxxxx UniProtKB URL in the ontological space, proposed as:
PR:xxxxxx (UniProtKB at organism-gene level) PR:xxxxxx-1 (UniProtKB at organism-sequence level)
To consider Organism-Modification: chain – UniProtKB:PRO_xxxxxxxxx Organism-Modification: variant – UniProtKB:VAR_xxxxxx Integration/coordination between ProComp and IntAct for ontological
representation of protein complexes
9
Orthologous-Gene
Ortho-Isoform
Ortho-PTM
Organism-PTM
Ortho-Complex
Organism-Complex
UniProtKB in PRO Ontological Framework: Rich Relations
10
Issues Stable identifiers
UniProtKB would provide stable identifiers ID mapping service
Need for sequence merging and isoform curation: when exist Swiss-Prot (SP) entry for a given gene and corresponding unmerged TrEMBL (Tr) entries that may represent a new isoform, a new variant, or a duplicate. Unmerged Tr entries corresponding to additional isoforms with a
sequence different than any mentioned in the SP entry organism-gene (SP): Q96F24organism-sequence (SP): Q96F24-1, Q96F24-2organism-sequence (Tr): B4DWS0
Organism-gene only represented in unreviewed (Tr) section: where one or multiple Tr entries exist for a given gene One entry
organism-gene accession (Tr) = Q8VGZ9organism-sequence accession (Tr; implied) = Q8VGZ9-1
Multiple entriesorganism-gene accession ***???***organism-sequence accession = B9E100, Q6W3E0
Integrating PRO curation into UniProtKB• Isoforms curated by PRO curators will continue to be integrated into
UniProtKB as a priority PRO isoform curation (mostly done at MGI) is based on experimental
information from literature, and covers information such as UniProtKB AC, GO annotation, and comments on evidence on isoform and expression
PIR curators integrate new isoforms and associated annotations to SP entry• Submission of annotation for a new SP entry
PIR curators create new reviewed SP entries when annotating protein isoforms and PTM forms with no reference SP entry
Example: BUB3_XENLA• Other areas of PRO annotations, particularly on PTMs and complexes,
could be integrated as appropriate• Reciprocal links from UniProtKB to PRO
11
• PRO literature-based annotation of isoforms 4 and 5 of a mouse protein• UniProt curation:
Merged 3 TrEMBL entries to existing UniProtKB record (Q8BIF2) Added Isoform specific subcellular localization information Updated information about function and added new information
New isoform curation in PRO & UniProt
CC -!- SUBCELLULAR LOCATION: Nucleus. Cytoplasm.CC -!- SUBCELLULAR LOCATION: Isoform 1: Nucleus.CC -!- SUBCELLULAR LOCATION: Isoform 4: Cytoplasm.CC -!- SUBCELLULAR LOCATION: Isoform 5: Nucleus.CC -!- TISSUE SPECIFICITY: Widely expressed in brain, regions including …CC -!- DEVELOPMENTAL STAGE: In the neural tube, expressed as early asCC embryonic day 9.5 (E9.5) and expression is confined to the nervous …CC -!- INDUCTION: By retinoic acid. Expression is up-regulated in P19CC cells during neural differentiation upon retinoic acid treatment …CC -!- PTM: Phosphorylated (Probable).CC -!- SIMILARITY: Contains 1 RRM (RNA recognition motif) domain.CC -!- CAUTION: Initial characterization was derived from usage of aCC monoclonal antibody (A60) directed to an unknown protein called ...
12
Integrating PRO curation into UniProtKB• Reciprocal links from UniProtKB to PRO
UniProtKB cross-reference (DR) lines [e.g., DR GO; GO:0006954; P:inflammatory response; IEA:Compara] DR line to include PRO identifier (PURL), PRO name, and short-label Link to the PRO page(s) at the exact (organism-gene) level and possibly
also other PTM forms (organism-modification)• Other areas of PRO annotations, particularly on PTMs and
complexes, could be integrated as appropriate Annotation of sequence features (such as PTMs not annotated in
UniProtKB) and functional annotation that apply to those features Barrier for direct annotation integration: curation depth needed for all
aspects of annotatable information beyond PTMs Possible Solution: link to information in PRO as additionally annotated
data, similarly to UniProt approach to include additional bibliography
13