Integration of PRO and UniProtKB

13
Integration of PRO and UniProtKB Amherst, NY May 16, 2013 Cathy H. Wu, Ph.D. PRO-PO- GO Meeting

description

Integration of PRO and UniProtKB. PRO-PO-GO Meeting. Amherst, NY May 16, 2013 Cathy H. Wu, Ph.D. PRO Framework. PRO terms are defined/annotated using other ontologies and resources via definition of relations or mappings when appropriate. Relationships Between PRO-GO-UniProtKB. - PowerPoint PPT Presentation

Transcript of Integration of PRO and UniProtKB

Page 1: Integration of PRO and UniProtKB

Integration of PRO and UniProtKB

Amherst, NYMay 16, 2013

Cathy H. Wu, Ph.D.

PRO-PO-GO Meeting

Page 2: Integration of PRO and UniProtKB

2

PRO FrameworkPRO terms are defined/annotated using other ontologies and

resources via definition of relations or mappings when appropriate

Page 3: Integration of PRO and UniProtKB

• Accessioned, species-specific protein complexes in ProComp are described using protein entities in ProForm; and are cross-referenced to species-independent complex representations in GO

• A gene product (PR:000025358) and its isoforms and modified forms (PR:000025355; PR:000025356) are represented in PRO as separate, uniquely accessioned entities; but are described in the same UniProtKB record (UniProtKB:Q9D6R2)

The representation of protein complexes in the Protein Ontology (PRO)Bult CJ, Drabkin HJ, Evsikov A, Natale D, Arighi C, Roberts N, Ruttenberg A, D'Eustachio P, Smith B, Blake JA, Wu C. (2011) BMC Bioinformatics 12, 371 [PMID: 21929785]

Relationships Between PRO-GO-UniProtKB

ProComp-ProForm: has_partProComp-GO: is_aProForm-UniProtKB: xref

3

Page 4: Integration of PRO and UniProtKB

Mappings to various external databases• promapping.txt: tab-delimited, each line indicating the PRO ID, the

database ID, and the type of mapping (is_a or exact) • promapping.obo: the same information as promapping.txt, but in OBO

format

Mappings are of two types: • exact

• The database object is an exact match to the PRO object• e.g., PR:000026497 describes an isoform of 6-phosphofructokinase

type C in human only, which corresponds to UniProtKB:Q01813-1• is_a

• The database object is more specific than the PRO object• e.g., PR:000026465 describes an (organism-nonspecific) isoform of

6-phosphofructokinase type C, so UniProtKB:Q01813-1 (human) and UniProtKB:Q9WUA3-1 (mouse) are mapped to this term

4

PRO ID Mapping

Page 5: Integration of PRO and UniProtKB

bri1/iso1/phos 5 (PR:000035786) has two parents:explicit one in formal definition (PR:000035785)implicit one only shown in the reasoned version (PR:000028355)

[Term]id: PR:000035786name: protein brassinosteroid insensitive 1 isoform 1 phosphorylated 5 (Arabidopsis thaliana)def: "A protein brassinosteroid insensitive 1 isoform 1 phosphorylated 5 in Arabidopsis thaliana. UniProtKB:O22476-1, Thr-872, MOD:00047|Ser-858, MOD:00046|Ser-891, MOD:00046." [PMID:22184234, PRO:LVM]comment: Category=organism-modification. Flag=automatic.synonym: "Athal-BRI1/iso:1/Phos:5" EXACT PRO-short-label [PRO:DNx]synonym: "At protein brassinosteroid insensitive 1 isoform 1 phosphorylated 4" RELATED []is_a: PR:000028355 ! implied link automatically realized ! protein brassinosteroid insensitive 1 isoform 1 (Arabidopsis thaliana)is_a: PR:000035785 ! implied link automatically realized ! protein brassinosteroid insensitive 1 isoform 1 phosphorylated 5intersection_of: PR:000035785 ! protein brassinosteroid insensitive 1 isoform 1 phosphorylated 5intersection_of: only_in_taxon NCBITaxon:3702 ! Arabidopsis thaliana

5

PRO Reasoning with ID Mapping

PR:000035785

PR:000028355

pro.obo: PRO version with no implied linkspro_reasoned.obo: implied link automatically realized via is_a

Page 6: Integration of PRO and UniProtKB

6

Ontological Representation of UniProtKB in PRO

PRO provides the ontological presentation for UniProtKB Integration of UniProt records/subrecords into the PRO

ontological framework Use UniProtKB protein records (labeled by accession numbers,

isoform IDs, and potentially other stable identifiers within UniProtKB records) to represent organism-gene level and sequence level (and potentially modification-level) terms of PRO Organism-Gene: canonical protein record Organism-Sequence: isoform subrecord Organism-Modification: chain/variant subrecord

Page 7: Integration of PRO and UniProtKB

7

Organism-Gene/Sequence

Page 8: Integration of PRO and UniProtKB

8

Ontologizing UniProtKB Full-scale implementation of 12 reference genomes (others as needed)

Organism-Gene: canonical protein record – UniProtKB:xxxxxx Organism-Sequence: isoform subrecord – UniProtKB:xxxxxx-1

Persistent URL: http://purl.obolibrary.org/obo/PR_xxxxxxxxx UniProtKB URL in the ontological space, proposed as:

PR:xxxxxx (UniProtKB at organism-gene level) PR:xxxxxx-1 (UniProtKB at organism-sequence level)

To consider Organism-Modification: chain – UniProtKB:PRO_xxxxxxxxx Organism-Modification: variant – UniProtKB:VAR_xxxxxx Integration/coordination between ProComp and IntAct for ontological

representation of protein complexes

Page 9: Integration of PRO and UniProtKB

9

Orthologous-Gene

Ortho-Isoform

Ortho-PTM

Organism-PTM

Ortho-Complex

Organism-Complex

UniProtKB in PRO Ontological Framework: Rich Relations

Page 10: Integration of PRO and UniProtKB

10

Issues Stable identifiers

UniProtKB would provide stable identifiers ID mapping service

Need for sequence merging and isoform curation: when exist Swiss-Prot (SP) entry for a given gene and corresponding unmerged TrEMBL (Tr) entries that may represent a new isoform, a new variant, or a duplicate. Unmerged Tr entries corresponding to additional isoforms with a

sequence different than any mentioned in the SP entry organism-gene (SP): Q96F24organism-sequence (SP): Q96F24-1, Q96F24-2organism-sequence (Tr): B4DWS0

Organism-gene only represented in unreviewed (Tr) section: where one or multiple Tr entries exist for a given gene One entry

organism-gene accession (Tr) = Q8VGZ9organism-sequence accession (Tr; implied) = Q8VGZ9-1

Multiple entriesorganism-gene accession ***???***organism-sequence accession = B9E100, Q6W3E0

Page 11: Integration of PRO and UniProtKB

Integrating PRO curation into UniProtKB• Isoforms curated by PRO curators will continue to be integrated into

UniProtKB as a priority PRO isoform curation (mostly done at MGI) is based on experimental

information from literature, and covers information such as UniProtKB AC, GO annotation, and comments on evidence on isoform and expression

PIR curators integrate new isoforms and associated annotations to SP entry• Submission of annotation for a new SP entry

PIR curators create new reviewed SP entries when annotating protein isoforms and PTM forms with no reference SP entry

Example: BUB3_XENLA• Other areas of PRO annotations, particularly on PTMs and complexes,

could be integrated as appropriate• Reciprocal links from UniProtKB to PRO

11

Page 12: Integration of PRO and UniProtKB

• PRO literature-based annotation of isoforms 4 and 5 of a mouse protein• UniProt curation:

Merged 3 TrEMBL entries to existing UniProtKB record (Q8BIF2) Added Isoform specific subcellular localization information Updated information about function and added new information

New isoform curation in PRO & UniProt

CC -!- SUBCELLULAR LOCATION: Nucleus. Cytoplasm.CC -!- SUBCELLULAR LOCATION: Isoform 1: Nucleus.CC -!- SUBCELLULAR LOCATION: Isoform 4: Cytoplasm.CC -!- SUBCELLULAR LOCATION: Isoform 5: Nucleus.CC -!- TISSUE SPECIFICITY: Widely expressed in brain, regions including …CC -!- DEVELOPMENTAL STAGE: In the neural tube, expressed as early asCC embryonic day 9.5 (E9.5) and expression is confined to the nervous …CC -!- INDUCTION: By retinoic acid. Expression is up-regulated in P19CC cells during neural differentiation upon retinoic acid treatment …CC -!- PTM: Phosphorylated (Probable).CC -!- SIMILARITY: Contains 1 RRM (RNA recognition motif) domain.CC -!- CAUTION: Initial characterization was derived from usage of aCC monoclonal antibody (A60) directed to an unknown protein called ...

12

Page 13: Integration of PRO and UniProtKB

Integrating PRO curation into UniProtKB• Reciprocal links from UniProtKB to PRO

UniProtKB cross-reference (DR) lines [e.g., DR GO; GO:0006954; P:inflammatory response; IEA:Compara] DR line to include PRO identifier (PURL), PRO name, and short-label Link to the PRO page(s) at the exact (organism-gene) level and possibly

also other PTM forms (organism-modification)• Other areas of PRO annotations, particularly on PTMs and

complexes, could be integrated as appropriate Annotation of sequence features (such as PTMs not annotated in

UniProtKB) and functional annotation that apply to those features Barrier for direct annotation integration: curation depth needed for all

aspects of annotatable information beyond PTMs Possible Solution: link to information in PRO as additionally annotated

data, similarly to UniProt approach to include additional bibliography

13