preecha/files/MEC_annotation_g… · Web vie

20
Metabolic entities corpus Annotation guideline Version 1.01 (2016-JAN-15) Patumcharoenpol et al. [email protected] This work is licensed under a Creative Commons Attribution- ShareAlike 4.0 International License . 1

Transcript of preecha/files/MEC_annotation_g… · Web vie

Metabolic entities corpus

Annotation guideline

Version 1.01 (2016-JAN-15)

Patumcharoenpol et [email protected]

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

1

ContentIntroduction 3

Convention 3

Entities and Events 4Entities 4Event 5Metabolic event 5

The task 7

General guidelines 8

Type-specific guidelines 9Gene and Protein (GP) entities specific guidelines 9Metabolite entities specific guidelines 11Events specific guidelines 12Appendix 15

2

Introduction

We are currently under the process of building a corpus, which is used in assisting further for developing integrated text mining framework for metabolic interaction network reconstruction. This document provides a practical guide on an annotation task to help in creating a consistence and well-form corpus.

The annotation guideline initially starts with a convention that will be used throughout this document, followed by the annotation step-by-step with how-to. Lastly, we provide a set of general and type-specific rules for identifying of what to include in the corpus.

Convention

The text throughout this document is formatted in a straightforward way. Bold font is used to show types of entities and events, like Gene and Protein (GP) as entities as well as Metabolic reaction, Metabolic consumption, Metabolic production, and Positive regulation as events. Notably, all examples in this document in following details are shown as either text format or picture format.

Text format Description: An excerpt from an abstract in italic font with underlined annotation. This case shows annotation of GP entities for example, “E.C. 1 .1.1.262 is annotated as GP entities”.

The fourth step is catalyzed by 4-hydroxythreonine-4-phosphate dehydrogenase (PdxA, E.C. 1.1.1.262), which converts 4-hydroxy-l-threonine phosphate (HTP) to 3-amino-2-oxopropyl phosphate.

Picture formatDescription: An excerpt from an abstract shows annotation of two metabolic events with highlight in orange and two metabolites annotation with highlight in blue as shown in Figure 1.

Figure 1 An example of picture format.

3

Entities and events

Considering on entities and events in this guideline, Gene and Protein, as well as Metabolite are annotated as entities. Events are processes or actions that involve with entities. For entities, they mostly represent by noun or noun-phrase while events mostly represent by verb or nominalized verb. In addition, entities and events can be classified into different sub-types. As shown in Figure 2, for example, Phosphoglucosamine mutase, GlmM, glucosamine-1-phosphate, and glucosamine-6-phosphate are identified as entities. In this case, first two are Gene and Protein entities and latter two are Metabolite entities. Next, catalyzes and formation are identified as events of sub-types Positive regulation and Metabolic reaction.

Figure 2 An example of annotation for entities and events.

We build a model of the type(s) of these entities and events in order to help in assist with the later automatic extraction of metabolic context from text. Therefore, these types are modeled around the synthesis and usage of metabolite in text. This defined type is shown in hierarchical relationships between annotation types (Figure 3). It is noted that those nodes with shown in all capitals are for organizational purposes only without assignable type.

Figure 3 Hierarchical relationships between annotation types.

4

Entities

In order to meet our current implementation of integrated text mining framework for metabolic interaction network reconstruction, we divide the type of entities into two categories, Gene and Protein (GP) and Metabolite as showed in Table 1.

GP means gene and protein that reside within organism. This also includes:Gene: genetic sequences reside on DNA that code for mRNA or protein.Protein: A long chain amino acid.Enzyme: Subset of protein that has catalytic function.mRNA: Polypeptide of ribonucleotides. We only restrict to find mRNA.

Note: The others type of gene’s products, that are not mentioned here, which are not annotated, e.g., sRNA, snRNA.

Metabolite means specified chemical substance, which is an intermediate or product of metabolism. Metabolite is usually a small molecule but it could be amino acids, lipids, carbohydrates, and nucleotides. In this work, only chemical that is resided in the cell (in vivo) is considered as metabolite.

Table 1 Description of entitiy types

Entity types Reference Ontology Id

Gene or Protein (GP) Ecocyc SBO:0000246

Metabolite ChEBI SBO:0000247

Events

Event is an occurrence of a process or action. Event is normally represented by verb and nominalized verb and accompany by entity as its argument.

Figure 4 Metabolic reaction event with two arguments

As shown in an example of Figure 4, this sentence shows event that is represented by the word transformation. This event associates with two arguments, acetolactate as type of Theme and dihydroxyisovalerate as type of Product.

For more formal sense, event composes of two components, trigger word and argument.1. Trigger word: A sequence of word, which represents the event.2. Argument: Entities that associate to event through trigger word. It can further be classified

into two categories.- Theme, the entity that instigates an event.- Other argument, an optional argument that adds more biological descriptive to the event.

5

Metabolic events

We are focusing on specific type of event, a mention of mechanical description of the metabolic interaction, which we call a metabolic event in this guideline. They should explicitly describe the change of one chemical into another. In this work, we classify metabolic events into 4 categories as listed in Table 2.

1. Metabolic production: Metabolic event that corresponds to the formation of metabolite.2. Metabolic consumption: Metabolic event that corresponds to the consumption of

metabolite. 3. Metabolic reaction: Metabolic event that corresponds to the conversion of metabolite.4. Positive regulation: Enzyme relation with metabolic event.

Table 2 Event types and their arguments for this annotation task. The type of each argument is shown in parenthesis.

Event type ArgumentsAdditional Arguments

Description Ontology ID

Metabolic production Theme: Metabolite,

Cause: Enzyme

Metabolic event that results in formation

of metabolite.

SBO:0000176

Metabolic

consumption

Theme: Metabolite,

Cause: Enzyme

Metabolic event that results in

consumption of metabolite.

SBO:0000176

Metabolic reaction Theme: Metabolite,

Cause: Enzyme

Metabolic event that results in conversion

of metabolite.

SBO:0000176

Positive regulation Theme: Event,

Cause: Enzyme

Enzyme relation with metabolic event

(Metabolic production, Metabolic

consumption and metabolic reaction)

GO:0048518,

GO:0044093

6

The task

In this work, we would like to discover all events that represent a metabolic process. In order to do this, we need to discover as in the following.

1. All genes and proteins as well as metabolites that are mentioned in text.2. All metabolic reactions that can be assigned as events.3. Roles of proteins and metabolites in context of metabolic events.

This above information can be used to infer linguistic pattern underlying and further be used for further text mining task.

7

General guidelines

An instruction in this section is applied to all kinds of entities and event annotations.

1. All annotations must be a continuous stretch of word.

(a) Last, the SerC (PdxF) enzyme uses 4PHT as a substrate in the reverse transamination reaction (13).

As shown in (a), SerC (PdxF) enzyme is considered to be a single GP entity.

2. Preposition and determiner are excluded from annotation.

(b) Substitution of pyridoxal 5'-phosphate in D-serine dehydratase from Escherichia coli by cofactor analogues provides information on cofactor binding and catalysis.

As shown above in (b), at first we consider D-serine dehydratase from Escherichia coli as a potential GP entity. As stated above, however we do not want to include any prepositions, so we can shorten the candidate down to D-serine dehydratase.

3. Pronoun must be excluded. A pronoun likes “which”, “it”, or “they”, even though it is possible to be resolved into proper noun.

(c) Pyridoxine 5'-phosphate (PNP) synthase is the key enzyme in the pdx group. It catalyses a multistep ring closure reaction yielding PNP and inorganic phosphate (Pi).

In example (c), it refers itself to Pyridoxine 5'-phosphate (PNP) synthase.

4. Special character (e.g., quote, dash, or parenthesis) should not be at the beginning or ending of annotation.

(d) L-Glutamine:D-fructose-6-phosphate amidotransferase ('glucosamine synthase', EC 5.3.1.19) from Escherichia coli MRE 600 was purified at least 75-fold.

As shown in (d), parenthesis and single quote are excluded from glucosamine synthase.

5. People’s name should be excluded from annotation.

(e) This enzyme is referred to as the Wood and Gunsalus L-threonine deaminase.

6. Always apply the most specific type from the hierarchy that is applicable.

8

Type-specific guidelines

Next section describes specific annotation of individual type, namely entities guidelines and events guidelines.

We must emphasize here that it is crucial to annotate every entities (GP and Metabolite) whether it has corresponded to event or not.

In case where two or more entity names sharing a head of a phrase annotate them as one entity.

(a) In the present work, we provide in vivo evidence that gadC is co-transcribed with gadB and that the functional glutamic acid-dependent system requires the activities of both GadA/B and GadC.

(b) The ratio of the valine- and isoleucine-alpha-ketoglutarate activities did not change significantly during purification.

In example (a), GadA/B has A and B sharing GAD, which can be expanded into GadA and GadB. Similarly, in the second example (b), valine- and isoleucine-alpha-ketoglutarate can be expanded into valine-alpha-ketoglutarate and isoleucine-alpha-ketoglutarate.

Gene and protein (GP) entities specific guidelines

1. ID from reference database (e.g., NCBI and EBI).

2. Common name, gene symbol, and EC number.

(a) The fourth step is catalyzed by 4-hydroxythreonine-4-phosphate dehydrogenase (PdxA, E.C. 1.1.1.262), which converts 4-hydroxy-l-threonine phosphate (HTP) to 3-amino-2-oxopropyl phosphate.

Let’s consider (a); 4-hydroxythreonine-4-phosphate dehydrogenase, PdxA, and E.C. 1.1.1.262 are considered for the GP since they are common name, gene symbol, and EC number.

9

3. Prefix or suffix that adds biological meaning to GP is considered as a part of GP.

(b) The accumulation of 2-ketoisovalerate in ilvE leu double mutants was shown to interfere with 2-KIC amination by the tyrB-encoded transaminase and also by the aspC- and avtA-encoded transaminases.

(c) The FolB protein shows 30% identity to the paralogous dihydroneopterin-triphosphate epimerase, which is specified by the folX gene located at 2427 kilobases on the E. coli chromosome.

(d) The role of intersubunit side chain-side chain interactions in the stability of the Escherichia coli aspartate aminotransferase (eAATase) homodimer was investigated by directed mutagenesis at 10 different interface contacts.

(e) Evidence was obtained for two monocistronic gltA transcripts extending anti-clockwise, to a common terminus, from independent promoters with start points 196 bp (major) and 299 bp (minor) upstream of the gltA coding region.

(f) In all conditions tested, this regulation required a functional narL gene product.

(g) Each subunit (361 residues) of the PSAT homodimer is composed of a large pyridoxal-5'-phosphate binding domain (residues 16-268).

As shown in examples (b)-(g) above, enzyme, protein, gene, clusters, family, homodimer, coding region, transcripts and gene product. Mentions of species and compartments are also included (e.g., E.coli and cytoplasmic).

4. The GP has to be able to resolve itself without any additional or external information.

(h) A three-dimensional structural comparison to four other vitamin B6-dependent enzymes reveals that three alpha-helices of the large domain, as well as an N-terminal domain (subgroup II) or subdomain (subgroup I) are absent in PSAT.

(i) Two polytopic membrane proteins, NarK and NarU, are assumed to transport nitrite out of the Escherichia coli cytoplasm.

Examples of (h) and (i) contain words that can potentially be GP, vitamin B6-enzyme in (h) and Two polytopic membrane proteins in (i). In this particular case, both are not annotated as GP since it is impossible to identify an exact enzyme without using additional context around them.

5. Word(s) that do not have any significant biological meaning should not be included.

(j) Purified transaminase B catalyzed transamination.

In example of (j), Purified did not add anything biologically significant to transaminase B, so it is omitted.

10

6. The amino acid residue or functional group is not counted.

(k) The cofactor is bound through an aldimine linkage to Lys198 in the active site.

In example above, Lys198 is Lysine residue, so it is omitted.

Metabolite entities specific guidelines

1. Oxygen, CO2, NADH and its variants, ATP and its variants are considered as Metabolite entity.

2. Same as GP entities, the prefixes and the suffixes which add more meaningful description are included as a part of Metabolite entity.

(a) The enzyme activity is dependent on the presence of a divalent magnesium ion…

(b) A Zn 2+ ion is bound within each active site,…

(c) It is suggested that anticapsin behaves as a glutamine analogue…

As shown in (a)-(c), ion and analogue are counted as a part of Metabolite entity.

3. Co-factor is annotated as in the Metabolite entity.

(d) Either NADP + or NAD + function as cofactors, whereas the free alcohol 4-hydroxy-L-threonine is not a substrate for the reaction.

In example of (d) NADPH is annotated as Metabolite in this example.

4. Amino-acid that does not have catalytic activity is considered as Metabolite.

(e) It is suggested that anticapsin behaves as a glutamine analogue and that a reaction of its epoxide group with a thiol group of glucosamine synthase results in its linkage to the enzyme by a covalent bond.

In example of (e), the glutamine analogue is annotated as Metabolite.

5. Mentions that are too ambiguous are not annotated and included.

(f) The reductoisomerase is able to catalyze the reduction of ketopantoate to produce pantoate (the intermediate in coenzyme A biosynthesis) which again requires that the reduction half-reaction produce a 2-(R)-hydroxy acid (Primerano & Burns, 1983).

In example of (f), 2-(R)-hydroxy acid can be referred to many types of metabolites. Such this case, 2-(R)-hydroxy acid is not annotated as Metabolite.

11

Events specific guidelines

In this section, we explain the guideline on event annotation.

1. Event’s trigger word should be presented with one word long.

(a) … the amination of 2-ketoisocaproate (2-KIC) to form leucine…

(b) … divalent metal ion-dependent oxidative decarboxylation of a b-hydroxy acid substrate…

From examples of (a) and (b), amination, and decarboxylation are metabolic event of type of Metabolic consumption.

2. Try to annotate the event using the most specific type first. For examples, Metabolic consumption, and Metabolic production which are preferred to Metabolic reaction.

However, in some cases, we have preposition instead (e.g., from, in), which cannot be used as an event (technical limitation). In this case, we use metabolic reaction as seen in Figure 5.

Figure 5 Annotation example of metabolic reaction for specific case.

Metabolic event can acts on multiple entities. There are two cases as a separate event and a combined event. In case of separate event shows in Figure 6, there are two events occurring here, a biosynthesis of isoleucine and biosynthesis of valine which we can assign Metabolic production event to them accordingly.

Figure 6 Assigning two Metabolic production event to biosynthesis with isoleucine and valine as its theme, respectively

However, there is a case where text explicitly states that both events are occurred in the same reaction at the same time. In that case, the event is combined as illustrated in Figure 7 as example. In this particular case, deoxyxylulose 5-phosphate and 4-phosphohydroxy-L-threonine are used together in same reaction.

Figure 7 Metabolic consumption event with two Themes, deoxxylulose 5-phosphate and 4-phosphohydroxy-L-threonine

12

3. Only annotate event that leads to an occurrence.

(c) This strain did not form aminoacetone from threonine, but it slowly degraded threonine.

An example of (c) shows that form and degraded are not assigned as an event since this sentence is explained that this occurrence are rarely occurred by stating this reaction did not form and slowly.

Metabolic productionTypically, we want to annotate all text that states the synthesis of metabolite. Most of the time, Synthesis and Biosynthesis are counted as a Metabolic production event. In Figure 8, it states the event of pyridoxine production.

Figure 8 Example of Metabolic production event

Metabolic consumption1. The verbs or nominalized verbs that indicate the usage of metabolite are considered to be a

Metabolic consumption event.

(a) for a third enzyme which can utilize only L-serine

(b) confined to the B6 vitamer salvage pathway

In example of (a) utilize is annotated as Metabolic consumption with L-serine as Theme argument. Catabolism, metabolism, utilize, savage are counted as Metabolic consumption event.

2. Metabolite being bound to enzyme, does not count as Metabolic consumption event since it did not specific if the such occurrence has any catalytic activity involved.

(c) The x-ray structure of PdxA bound to Zn2+ as well as the HTP presented here…

As shown in (c), PdxA being bound to Zn2+ does not annotated as Metabolic consumption.

Metabolic reactionIn particular case where the usage/synthesis is obscured (for example, transfer of functional group), we recommend to use Metabolic reaction as a general assignment. As seen in Figure 9, Transamination is assigned as Metabolic reaction.

Figure 9 Example of specific case of metabolic reaction event

Positive regulationPositive regulation is a metabolic event of catalyst relation between enzyme and metabolic event.

13

1. If possible, use Positive regulation in place of Cause argument. As in this Figure 10, catalyzes as Positive Regulation. However, if there is no verb or nominalized verb for annotate Positive regulation, using “Cause” argument instead as seen in Figure 11.

Figure 10 Example of Positive regulation event

Figure 11 Using Cause argument instead of using Positive regulation event.

2. One positive regulation responsible for one enzyme or one event. Considering separate them according to the number of events and enzymes as shown in Figure 12 and 13, respectively.

Figure 12 Isocitrate dehydrogenase catalyzing two separate events.

Figure 13 Amination reactions catalyzing by two separate enzymes.

14

Appendix

In this annotating process, we are looking for metabolic events in text with associate genes, proteins or metabolites.

In order to make the annotation process as consistence as possible, we recommend following this procedure.

1. Pre-process text with BANNER to get a candidate list of GP and Metabolite entities.2. For each sentence in text, read carefully and make a correction on GP and Metabolite

entities from BANNER.3. Locate verbs/nominalized verbs and mark them with appropriate events.4. Determine the entities that associate with events within the same sentence.

We recommend using BRAT (https://github.com/nlplab/brat) for annotation process. A configuration file for BRAT could be downloaded from http://www.sbi.kmutt.ac.th/~preecha/metrecon/.

15

Version history

1.01 – Add license.1.0 – First draft.

16