An intensive case study on kernel-based relation extraction
-
Upload
hanmin-jung -
Category
Documents
-
view
215 -
download
2
Transcript of An intensive case study on kernel-based relation extraction
An intensive case study on kernel-basedrelation extraction
Sung-Pil Choi & Seungwoo Lee & Hanmin Jung &
Sa-kwang Song
# Springer Science+Business Media New York 2013
Abstract Relation extraction refers to a method of efficiently detecting and identifyingpredefined semantic relationships within a set of entities in text documents. Numerousrelation extractionfc techniques have been developed thus far, owing to their innate impor-tance in the domain of information extraction and text mining. The majority of the relationextraction methods proposed to date is based on a supervised learning method requiring theuse of learning collections; such learning methods can be classified into feature-based, semi-supervised, and kernel-based techniques. Among these methods, a case analysis on a kernel-based relation extraction method, considered the most successful of the three approaches, iscarried out in this paper. Although some previous survey papers on this topic have beenpublished, they failed to select the most essential of the currently available kernel-basedrelation extraction approaches or provide an in-depth comparative analysis of them. Unlikeexisting case studies, the study described in this paper is based on a close analysis of theoperation principles and individual characteristics of five vital representative kernel-basedrelation extraction methods. In addition, we present deep comparative analysis results ofthese methods. In addition, for further research on kernel-based relation extraction with aneven higher performance and for general high-level kernel studies for linguistic processingand text mining, some additional approaches including feature-based methods based onvarious criteria are introduced.
Keywords Relation extraction . Kernel methods . Text mining . Information extraction .
Machine learning
Multimed Tools ApplDOI 10.1007/s11042-013-1380-5
S.-P. Choi : S. Lee : H. Jung : S.-k. Song (*)Information S/W Research Center, Korea Institution of Science and Technology Information (KISTI),335 Gwahangno Yuseong-gu, Daejeon, South Koreae-mail: [email protected]
S.-P. Choie-mail: [email protected]
S. Leee-mail: [email protected]
H. Junge-mail: [email protected]
1 Introduction
Relation extraction refers to a method of efficient detection and identification of predefinedsemantic relationships within a set of entities in text documents [39, 42]. The importance ofthis method was first recognized at the Message Understanding Conference [30], which washeld from 1987 to 1997 under the supervision of DARPA1. From 1999 to 2008, theAutomatic Content Extraction [1] Workshop facilitated numerous researches promoted asa new project by the NIST2. The workshop is currently held each year, and is considered thebest world forum for the comparison and evaluation of new technologies in the field ofinformation extraction, such as named entity recognition, relation extraction, event extrac-tion, and temporal information extraction. This workshop is conducted as a sub-field of theText Analytics Conference [36], which is currently under the supervision of the NIST.
According to ACE, an entity within a text is a representation naming a real object.Exemplary entities include the names of persons, locations, facilities, and organizations. Asentence including such entities can express their semantic relationships. For example, in thesentence, “President Clinton was in Washington today,” there is a “Located” relation between“Clinton” and “Washington.” In the sentence, “Steve Balmer, CEO of Microsoft, said…” therelation of “Role (CEO, Microsoft)” can be extracted.
Several relation extraction techniques have been developed within the framework of thevarious technological workshops mentioned above. Most relation extraction methods devel-oped thus far have been based on supervised learning requiring learning collections. Thesemethods are classified into feature-based methods, semi-supervised learning methods, boot-strapping methods, and kernel-based methods [4, 9]. Feature-based methods rely on classi-fication models for automatically specifying the category where a relevant feature vectorbelongs. In these approaches, surrounding contextual features are mainly used to identifysemantic relations between the two entities in a specific sentence and represent them as afeature vector. The major drawback of supervised learning-based methods, however, is thatthey require learning collections. Semi-supervised learning and bootstrapping methods, onthe other hand, use a large corpora or Web documents based on reduced learning collectionsthat are progressively expanded to overcome the above disadvantage. Kernel-based methods[11], in turn, devise kernel functions that are most appropriate for relation extraction andapply them for learning in the form of a kernel set optimized for syntactic analysis and part-of-speech tagging. The kernel function itself is used for measuring the similarity betweentwo instances, which are the main objects of machine learning. General kernel-based modelsare discussed in detail in Section 3.
As a representative approach of feature-based methods, the research in [23] combinesvarious types of lexical, syntactic, and semantic features required for relation extractionusing a maximum entropy model. Although based on the same type of composite featuresproposed by [23], the research in [44] makes use of support vector machines for relationextraction, allowing for a flexible kernel combination. The authors in [43] classified allfeatures that were available at the time to create individual linear kernels, and attemptedrelation extraction using composite kernels made up of individual linear kernels. Mostfeature-based methods aim at applying feature engineering algorithms for selecting theoptimal features for relation extraction, and the application of syntactic structures has beenvery limited.
1 Defense Advanced Research Projects Agency of the U.S.2 National Institute of Standards and Technology of the U.S.
Multimed Tools Appl
Exemplary semi-supervised learning and bootstrapping methods are Snowball [2] andDIPRE [5]. These methods rely on a few learning collections for using bootstrappingmethods similar to the Yarowsky algorithm [37] for gathering various syntactic patternsthat denote the relations between two entities in a large Web-based text corpus. Recentdevelopments include the KnowItAll [15] and TextRunner [38] methods for automaticallycollecting lexical patterns of target relations and entity pairs based on ample Web resources.Although this approach does not require large learning collections, its disadvantage is thatmany incorrect patterns are detected through expanding pattern collections, and that onlyone relation can be handled at a time.
Kernel-based relation extraction methods were first attempted by Zelenco et al. [39], whodevised both contiguous and sparse subtree kernels for recursively measuring the similarityof two parse trees for application to binary relation extraction, which demonstrated arelatively high performance. A variety of kernel functions for relation extraction have sincebeen suggested, e.g., dependency parse trees [14], convolution parse tree kernels [40], andcomposite kernels [9, 41], which show even better performances.
In this paper, a case analysis is carried out for kernel-based relation extraction methods,which are considered the most successful approach thus far.While some previous survey papersbased on the importance and effect of this methodology have been previously published [4, 28],they fail to fully analyze the particular functional principles or characteristics of currentlyavailable kernel-based relation extraction models, and simply cite the contents of individualarticles or describe limited analyses. Although the performance of most kernel-based relationextraction methods has been demonstrated based on ACE evaluation collections, a comparisonand analysis of the overall performance has not yet been made.
Unlike existing case studies, the study described herein is based on a close analysis of theoperation principles and individual characteristics of five kernel-based relation extractionmethods starting from the research in [39], which is the source of kernel-based relationextraction studies, and ending with the composite kernel, which is considered the mostadvanced kernel-based relation method currently available [9, 41]. For a comparison of theoverall performance of each method, the focus of this study is placed on the ACE collection.We hope this study will contribute to further research of kernel-based relation extraction ofan even higher performance and to high-level general kernel studies for linguistic processingand text mining.
Section 2 outlines supervised learning-based relation extraction methods, and in Section 3we discuss kernel-based machine learning. Section 4 closely analyzes five exemplary kernel-based relation extraction methods. As mentioned above, Section 5 then compares the perform-ances of these methods to analyze their advantages and disadvantages. Finally, Section 6provides some concluding remarks regarding this research.
2 Supervised learning-based relation extraction
As discussed above, relation extraction methods are classified into three categories. Thedifference between feature-based and kernel-based methods is shown in Fig. 1. These twomethods differ from semi-supervised learning methods with respect to their machine learn-ing procedures.
On the left side of Fig. 1, individual sentences that make up a learning collection have atleast two entities (black square), whose relation is manually extracted and predefined. Sincemost relation extraction methods studied thus far work with binary relations, learningexamples are modified for a convenient relation extraction from a pair of entities by
Multimed Tools Appl
preprocessing the original learning collection. Such modified learning examples are referredto as a relation instance. A relation instance is defined as an element of the learningcollection modified so that it can be efficiently applied to the relevant relation extractionmethods based on a specific sentence containing at least two entities.
The aforementioned modification is closely related to the feature information used inrelation extraction. Since most supervised learning-based methods use both the entity itselfand the contextual information about the entity, it is important to collect contextual infor-mation efficiently for an improved performance. Linguistic processing (part-of-speechtagging, base phrase recognition, syntactic analysis, etc.) for individual learning sentencesin the pre-processing step contributes as a base for effective feature selection and extraction.For example, when a sentence goes through the syntactic analysis shown in Fig. 1, onerelation instance is composed of a parse tree and the locations of the entities indicated in thetree [18, 42, 45]. A single sentence can be represented as a feature vector or syntactic graph[21, 25, 40]. The type of such relation instances depends on the relation extraction methods,and can also involve various preprocessing tasks [31, 40].
In general, feature-based relation extraction methods follow the procedures shown in theupper part of Fig. 1. That is, feature collections that “can express individual learningexamples the best” are extracted (feature extraction.) The learning examples are feature-vectorized, and inductive learning is carried out using selected machine learning models. Onthe contrary, the kernel-based relation extraction shown in the lower part of Fig. 1 devises akernel function that “can calculate the similarity of any two learning examples the mosteffectively” to replace the feature extraction process. Here, the measurement of similaritybetween the learning examples is not a general similarity measurement. That is, the functionproperly computing similarity between the two sentences or instances that express the samerelation is the most effective kernel function in terms of relation extraction. For example, thetwo sentences, “Washington is in the U.S.” and “Seoul is located in Korea,” use differentobject entities but feature the same relation (“located”). Therefore, an efficient kernelfunction would detect a high similarity between these two sentences. On the other hand,since the sentences, “Washington is the capital of the United States” and “Washington islocated in the United States,” express the same object entities but different relations, thesimilarity between them should be determined as very low. As such, in kernel-based relationextraction methods, the selection and creation of kernel functions are the most fundamentalaspects that affect the overall performance.
Fig. 1 Learning process for supervised learning-based relation extraction
Multimed Tools Appl
As shown in Fig. 1, kernel functions (linear, polynomial, and sigmoid) can be used infeature-based methods as well. However, they are functions applied only to vector-expressedinstances. On the other hand, kernel-based methods are not limited in terms of the type ofinstance, and can therefore contain various kernel functions.
3 Overview of kernel-based machine learning methods
Most machine learning methods are carried out on a feature basis. That is, each instance towhich an answer (label) is attached is modified into a feature sequence or N-dimensionalvector, f1, f2, …, fN, for use in the learning process. For example, important features foridentifying the relation between two entities in a single sentence are the entity type;contextual information between, before, and after entities occur; part-of-speech informationfor contextual words; and dependency relation path information for the two entities [9, 23,25, 40]. These data are selected as single features represented through vectors for anautomatic classification of the relation between the entities.
In Section 2, we show that the essence of feature-based methods is to create a featurevector that can best express individual learning examples. However, in many cases, it is notpossible to express a feature vector reasonably. For example, a feature space is required forexpressing the syntactic information3 of a specific sentence as a feature vector, and it isalmost impossible to express it as a feature vector in a limited space in certain instances [13].Kernel-based methods used for learning calculate the kernel functions between two exam-ples while maintaining the original learning example without an additional feature expres-sion [13]. The kernel function is defined as the mapping K : X � X ! ½0;1Þ from theinput space X to the similarity score, K x; yð Þ ¼ fðxÞ � fðyÞ ¼P
ifiðxÞfiðyÞ . Here, fðxÞ is
the mapping function from the learning examples in input space X to the multidimensionalfeature space. The kernel function is symmetric, and exhibits positive semi-definite features.Using the kernel function, it is not necessary to calculate all features one by one, andmachine learning can thus be carried out based only on the similarity between the twolearning examples. Exemplary models in which learning is carried out on the basis of allsimilarity matrices between the learning examples include Perceptron [35], Voted Perceptron[17], and Support Vector Machines [12, 28]. Kernel-based matching learning methods aredrawing an increasingly greater amount of attention, and are widely used for patternrecognition, data and text mining, and Web mining. The performance of kernel methods,however, depends to a great extent on the kernel function selection or configuration [24].
Kernel-based learning methods are also used for natural language processing. Linear,polynomial, and Gaussian kernels are typically used in simple feature vector-based machinelearning. A convolution kernel [11] is used for the efficient learning of structural data such astrees or graphs. A convolution kernel is a type of kernel function featuring the idea ofsequence kernels [26], tree kernels [14, 34, 39, 40, 42], and graph kernels [19]. A convo-lution kernel can measure an overall similarity by defining “sub-kernels” for measuring thesimilarity between the components of an individual entity and for calculating the similarityconvolution among the components. For example, to measure their similarity, a sequencekernel divides two relevant sequences into subsequences, the overall similarity of which arecalculated using a similarity measurement. Likewise, the tree kernel divides a tree into itssub-trees, whose similarity is then calculated along with the convolution of these similarities.
3 Dependency grammar relation, parse tree, etc. between words in a sentence.
Multimed Tools Appl
As described above, another advantage of kernel methods is that learning is possible as asingle kernel function for input instance collections of a different type. For example, theauthors in [9, 41] demonstrated a composite kernel for which a convolution parse tree kernelis combined with an entity kernel for high-performance relation extraction.
4 Kernel-based relation extraction
In this section, five important research results regarding kernel-based relation extraction arediscussed. Of course, several other important studies have also drawn a significant amountof attention owing to their high performance. Most of the approaches, however, simplymodify or supplement these five methods discussed below.
4.1 Tree kernel-based method [39]
This study is known as the first application of a kernel method toward relation extraction. The parsetrees derived from shallow parsing are used for measuring the similarity between sentencescontaining certain entities. In [3], REES (Relation and Event Extraction System) is used to analyzethe parts-of-speech and types of individual words within a sentence, as well as the syntactic structureof the sentence in question. Figure 2 below shows an example result from the use of REES.
As shown in Fig. 2, when the syntactic analysis of the input sentence is completed, theparticular information regarding the words of the sentence is analyzed and extracted. Four typesof attribute information are attached to all words or entities other than articles and stop words.The Type represents the part-of-speech or entity type of the current word.While “John Smith” is
Fig. 2 Exemplary shallow parsing result for relation extraction
Multimed Tools Appl
the “Person” type, “scientist” is specified as a “PNP” representing a personal noun. The Headrepresents the presence of key words of composite nouns or preposition phrases. The Rolerepresents the relation between the two entities. In Fig. 2, “John Smith” is a “member” of“Hardcom C,” and in turn, “Hardcom C.” is an “affiliation” of “John Smith.”
As the figure shows, it is possible to identify and extract the relation between the two entitiesin a specific sentence at a given level if REES is used. However, since the REES system is rule-based, it has certain limitations in terms of scalability and generality [39]. Zelenco et al. [39]constructed tree kernels based on REES analysis results with a view of a better relationextraction to overcome the limitations of machine learning and a low performance. Zelencoet al. [39] proposed two types of tree kernel functions: a sparse subtree kernel (SSK) and acontinuous subtree kernel (CSK). Even though they are not continuously connected, an SSKincludes two node subsequences for comparison. For example, “Person, PNP” at the left-middle and “Person, PNP” at the right-middle of Fig. 3 are node sequences separated in theparse tree. An SSK includes such subsequences for comparison. On the contrary, a CSK doesnot approve of such subsequences and excludes them from comparison.
To measure the performance of the two proposed tree kernels, the authors in [39] used 60 %of the data manually constituted as a learning collection, and carried out a ten-fold crossvalidation. Only two relations, “Person-Affiliation” and “Organization-Location,” were tested.The test revealed that the kernel-based method offers a better performance than the feature-based method, and that the continuous subtree kernel excels over the sparse subtree kernel. Inparticular, the tree kernel proposed in [10] has inspired new research on tree kernels.
The study in [10] is generally recognized as an important contribution toward devisingkernels for efficient similarity measurements between very complex tree-type structuralentities to be used later for relation extraction. Since a lot of information, including syntacticinformation, is still required, the method depends highly on the performance of the REESsystem for creating parse trees. Because the quantity of data was not sufficient for their test,and a binary classification test was carried out on only two types of relations, the authors in[10] did not analyze the scalability or generality of the proposed kernel in detail.
4.2 Dependency tree kernel-based method [14]
Relation extraction has been carefully studied since the ACE collectionwas first constituted anddistributed beginning in 2004. Culotta and Sorensen [14] proposed a kernel-based relationextraction method, which uses a dependency parse tree structure based on the tree kernelproposed by Zelenco et al. [39], as described in Section 4.1. This special parse tree, called anAugmented Dependency Tree, usesMXPOST [33] to carry out parsing, and then modifies some
Fig. 3 Execution of a kernel computation
Multimed Tools Appl
syntactic rules based on the results. Exemplary rules include “subjects are dependent on verbs”and “adjectives” are dependent on the nouns they describe. To improve analysis results, thisstudy uses even more complex node features than those used in [39]. The hypernym extractedfrom WordNet [16] is applied to expand the matching function between nodes. A compositekernel is then constructed for application to relation extraction for the first time.
Figure 4 shows a sample of the augmented dependency tree used in this study. The root ofthe tree is the verb “advanced,” and the resultant subject and preposition are the child nodes ofthe root. The object of the preposition “Tikrit” is the child node of the preposition “near.” Eachnode has eight types of node feature information. The table below outlines each node feature.
In Table 1, the first four features (words, detailed/general parts-of-speech information,and phrase information) are the information obtained from parsing, and the rest are namedentity features from the ACE collection. Among them, a WordNet hypernym is the result ofextracting the highest node for the corresponding word from the WordNet database.
As discussed above, the tree kernel defined by [39] is used in this method. Since thefeatures of each node are added, the matching and similarity functions defined by Zelenco etal. [39] are modified and applied accordingly. In detail, from among eight features, those tobe applied to the matching function and those to be applied to the similarity function weredynamically divided for devising the following models for application.
ti : feature vector representing the node i:
tj : feature vector representing the node j:
tmi : subset of ti used in matching function
tsi : subset of ti used in similarity function
m ti; tj� � ¼ 1 if tmi ¼ tmj
0 otherwise
�
s ti; tj� � ¼X
vq2tsi
Xvr2tsj
C vq; vr� �
ð1Þ
Fig. 4 Sample augmented dependency tree (“Troops advanced near Tikrit”)
Multimed Tools Appl
where m is the matching function, s is the similarity function, and ti is a feature collectionshowing node i. In addition, C(•,•) is a function used for comparing two feature values basedon approximate matching, not simple perfect matching. For example, the recognition of“NN” and “NP”, shown in the particular part-of-speech information in Table 1, as the samepart-of-speech is implemented by modifying the internal rule of the function. Equations 3and 4 in Section 4.1 are applied as tree kernel functions for comparing the similarity of twoaugmented dependency trees based on two basic functions.
For the evaluation, the initial ACE collection version (2002) released in 2003 was used.This collection defines five entity types and 24 types of relations. Culotta & Sorensen [14]tested the relation extraction for only the five higher types of relation collections (“AT,”“NEAR,” “PART,” “ROLE,” and “SOCIAL”). The kernels tested were the sparse subtree kernel(K0), the continuous subtree kernel (K1), and the bag-of-words kernel (K2). In addition, twocomposite kernels for which the tree kernel was combined with the bag-of-word kernel, that isK3=K0+K2 and K4=K1+K2, were further constituted. A test consisting of two relationdetection4 steps and a relation classification5 revealed that all tree kernel methods, includingthe composite kernel, show a better performance than the bag-of-words kernel. Unlike theevaluation results in [39], although the performance of the continuous subtree kernel is higher,the reason for this was not clearly described.; the authors also failed to demonstrate theadvantage of using a dependency tree instead of a full parse tree in their experiment.
4.3 Shortest path dependency tree kernel method [6]
In Section 4.2, we discussed relation extraction using dependency parse trees for the treekernel proposed by Zelenco et al. [39]. Bunescu & Mooney [6] studied the dependency pathbetween the two named entities in the dependency parse tree with the intention of proposingthe shortest possible path dependency kernel for relation extraction. There is always adependency path between two named entities in a sentence, and Bunescu & Mooney [6]argued that the relation extraction performance is improved using syntactic paths. Figure 5below shows the dependency graph for a sample sentence.
The red nodes in Fig. 5 are named entities specified in the ACE collection. The separationof the entire dependency graph results in ten dependency syntax pairs. To construct adependency path, it is possible to select pairs that include named entities in the syntax pairs,as shown in Fig. 10.
4 Binary classification for identifying the possible relation between two named entities.5 Relation extraction for all instances with relations shown in the relation identification results.
Table 1 Node features in theaugmented dependency tree Feature Example
Words troops, Tikrit
Detailed POS (24) NN, NNP
General POS (5) Noun, Verb, Adjective
Chunking Information NP, VP, ADJP
Entity Type person, geo-political-entity
Entity Level name, nominal, pronoun
WordNet hypernyms social group, city
Relation Argument ARG_A, ARG_B
Multimed Tools Appl
As shown in Fig. 6, it is possible to construct dependency paths between the namedentities, “Protesters” and “stations,” and “workers” and “stations.” As discussed above, thedependency path for connecting two named entities in this sentence can be extendedinfinitely. Bunescu & Mooney [6] estimated that the shortest path contributes the most toestablishing the relation between two entities. Therefore, it is possible to use kernel-basedlearning for estimating the relation between the two named entities connected through adependency path. For example, to estimate the relation for the path “protesters → seized ←stations,” the relation is estimated as the PERSON entity (“protesters”) did a specific action(“seized”) for the FACILITY entity (“stations”), through which PERSON (“protesters”) islocated at FACILITY (“stations”) (“LOCATED_AT”). In another complex path, “workers →holding ← protesters → seized ← stations,” it is possible to estimate the relation in whichPERSON (“workers”) is located at FACILITY (“stations”) (“LOCATED_AT”) if PERSON(“protesters”) did a particular action (“holding”) to PERSON (“workers”), and PERSON(“protesters”) did an action (“seized”) to FACILITY (“stations”). As such, using a dependencyrelation path, it is possible to identify the semantic relation between two entities more intuitively.
For learning, Bunescu & Mooney [6] extracted the shortest dependency paths, includingtwo entities from individual training instances, as shown in Table 2 below.
Fig. 5 Dependency graph and dependency syntax pair list for a sample sentence
Fig. 6 Extracting dependency path including named entities from a dependency syntax pair collection
Multimed Tools Appl
As shown in Table 2, each relation instance is expressed as a dependency path, both endsof which are named entities. In terms of learning, however, it is not easy to extract sufficientfeatures from such instances. Therefore, as discussed in Section 4.2, various types ofsupplementary information are created, such as the part-of-speech, entity type, andWordNet synset. As a result, individual nodes that make up the dependency path comprisea plurality of information elements, and a variety of new paths are finally created, as shownin Fig. 7.
As shown in Fig. 7, with more information available for individual nodes, 48 newdependency paths can be created through the Cartesian product of the node values. Here,relation extraction is carried out by applying the dependency path kernel for calculating theredundancy of the information included in each node rather than comparing all newlycreated paths, as shown below.
x ¼ x1x2 � � � xm; y ¼ y1y2 � � � ynxi : the set of word classes corresponding to position i:
yi : the set of word classes corresponding to position i:
K x; yð Þ ¼ 0; m 6¼ n
Πni¼1c xi; yið Þ m ¼ n
�c xi; yið Þ ¼ xi \ yij j
ð2Þ
In Eq. 2 above, x and y represent extended individual instances, m and n denote thelengths of the dependency path, K(•,•) presents the dependency path kernel, and c(•,•) is afunction for calculating the level of information element redundancy between the two nodes.Figure 8 below shows the process of calculating the kernel value based on Eq. 2.
Based on the same test environment as the collection used by Culotta & Sorensen [14],two parsing systems, a CCG parser [20] and CFG parser [10] were used to construct theshortest dependency path. For a performance comparison, the test included K4 (bag-of-words kernel + continuous subtree kernel ), which showed the best performance for Culotta& Sorensen [14]. The test revealed that a CFG-based shortest dependency path kernel offersa better performance by using a CFG parser instead of a CCG parser for the same kernel.
With regard to relation extraction, the shortest dependency path information is consideredto be very useful, and is likely to be used in various fields. However, the kernel structure is
Table 2 Shortest path dependen-cy tree-based sample relation ex-traction instance
Relations Relation Instances
LOCATED_AT protesters → seized ← stations
LOCATED_AT workers → holding ← protesters → seized ←stations
LOCATED_AT detainees → abusing ← Jelisic → created ← at ←camp
[ ] [ ]
protesters stationsseized
NNS NNSVBD
Noun NounVerb
Person Facility
⎡ ⎤ ⎡ ⎤⎡ ⎤⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥× → × × ← ×⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦
⎣ ⎦ ⎣ ⎦
Fig. 7 New dependency path information created in a single instance
Multimed Tools Appl
too simple. Yet another limitation is that only paths of the same length are included incalculating the similarity of two dependency paths.
4.4 Subsequence kernel-based method [7]
The tree kernel presented by Zelenco et al. [39] is used to compare two sibling nodes at basically thesame level and uses the subsequence kernel. Bunescu & Mooney [7] introduced the subsequencekernel and attempted relation extraction using only a base phrase analysis (chunking), withoutapplying the syntactic structure. Since the kernel input does not have a complex syntactic structure,but instead utilizes base phrase sequences, the assumption was that the feature space can be dividedinto the following three different types with amaximumof four words for each type of feature, usingthe advantage of easy selection of contextual information essential for a relation extraction.
In Fig. 8, [FB] represents the words positioned before and between two entities, and [B] and[BA] indicate only the word between, and the word collections between and after the two entities,respectively. The three types of feature collections can accept individual relation expressions.Furthermore, various types of supplementary word information (part-of-speech, entity type,WordNet synset, etc.) are used to expand the features as in the methods described above.
Zelenco et al. [39] described how to calculate the subsequence kernel, which will bedescribed in detail again later. The kernel calculation function, Kn(s, t), is defined as shownbelow based on all n-length subsequences included in two sequences, s, t.
Kn s; t; lð Þ ¼Xi: ij j¼n
Xj: jj j¼n
ll ið Þþl jð Þ Πn
k¼1c sik ; tjk� � ð3Þ
where i and j represent subsequences contained in s and t respectively, c(•,•) is a function fordetermining the homogeneity of the two inputs, and λ is a weight given to matchingsubsequences. In addition, l(i) and l(j) are the values indicating how far each relevantsubsequence is positioned from the entire sequences. To calculate the weighted length,Equation 3 selects only n-length subsequences among s and t, which exist in both sequences.For a simple description, the following two sentences and base phrase analysis results will beused to explain the process used in calculating the kernel value.
As shown in Fig. 9, each of the nodes used in the analysis has eight types of lexicalinformation (word, part-of-speech, base phrase type, entity type, etc.) The kernel value, K3(s,t, 0.5), of two analysis sequences is calculated according to the process shown in Fig. 10,using the features of the subsequence, the length of which is 3.
There are three subsequence pairs in which it is determined that, among thesubsequences in s and t, the number of nodes of all subsequences is at least 0 by
• [FB] Fore-Between • ‘interaction of [P1] with [P2]’• ‘activation of [P1] by [P2]’
• [B] Between• ‘[P1] interacts with [P2]’• ‘[P1] is activated by [P2]’
• [BA] Between-After• ‘[P1] – [P2] complex’• ‘[P1] and [P2] interact’
Fig. 8 Contextual location information for feature extraction
Multimed Tools Appl
means of the homogeneity decision function c(•,•). The scores for each of the matchingsubsequences are derived by calculating the cumulative factorial of c(•,•) for eachsubsequence, and then multiplying it by the weight. For example, a similarity of0.84375 is obtained for “troops advanced near” and “forces moved toward.” On thecontrary, the similarity of “troops advanced …Tikrit” and “forces moved …Bagdhad” is0.21093. This results from the lowered weight since the two subsequences are posi-tioned apart. Finally, the similarities of the subsequences are introduced through Eq. 3,which gives 1.477.
Fig. 9 Two sentences and base phrase analysis results used to illustrate the process of calculating asubsequence kernel
Fig. 10 Process using in calculating K3(s, t, 0.5)
Multimed Tools Appl
As described above, it is possible to construct the subsequence kernel functionbased on the subsequences of all lengths using contextual location information.Figure 11 shows the surrounding contextual location where named entities occur inthe sentence.
In Fig. 11, xi and yi indicate the named entities, sf and tf denote the word lists before thenamed entities, sa and ta are the contextual word collections after two entities, and s
0b and t
0b
represent contextual information including two entities. Thus, the subsequence kernel for thetwo sequences s and t is defined using Eq. 4 as follows.
K s; tð Þ ¼ Kfb s; tð Þ þ Kb s; tð Þ þ Kba s; tð ÞKb;i s; tð Þ ¼ Ki sb; tb; 1ð Þ � c x1; y1ð Þ � c x2; y2ð Þ � ll s
0bþl t
0bð Þð Þ
Kfb s; tð Þ ¼X
i�1;j�1;iþj<fbmax
Kb;i s; tð Þ � K 0j sf ; tf� �
Kb s; tð Þ ¼X
1�i�bmax
Kb;i s; tð Þ
Kba s; tð Þ ¼X
i�1;j�1;iþj<bamax
Kb;i s; tð Þ � K 0j sf ; tf� �
ð4Þ
The subsequence kernel K(s, t) consists of the sum of the contextual kernel before andbetween entity pairs, Kfb; the intermediate contextual kernel, Kb, of the entities; and thecontextual kernel between and after entity pairs, Kba. The definition of an individualcontextual kernel is described from the third through the fifth lines of Eq. 4. Here, K
0n is
the same as Kn, with the exception that it specifies the length of the relevant subsequencefrom the location where the subsequence starts to the end of the entire sequence, and isdefined as follows.
K0n s; t; lð Þ ¼
Xi: ij j¼n
Xj: jj j¼n
l sj jþ tj j�i1�j1þ2 Πn
k¼1c sik ; tjk� � ð5Þ
In Eq. 5, i1 and j1 indicate the starting positions of subsequences i and j, respec-tively. Using Eq. 5, the individual contextual kernel calculates the similarity between thetwo sequences for the subsequences within the location divided based on the locationsspecified in Fig. 11, and sums all of the kernel values to determine the final total value.
Fig. 11 Specifying the contextual information depending on the named entity location and defining variables
Multimed Tools Appl
The performance evaluation under the same test environment described in Sections 4.2and 4.3 shows an increase in performance, even without a complicated pre-processing, suchas parsing, and without any syntactic information. In conclusion, the evaluation shows thatthis method is very fast in terms of learning speed, and has potential for an additionalimprovement in performance.
4.5 Composite kernel-based method [41]
The release of the ACE 2003 and 2004 versions contributed to a full-scale study on relationextraction. In particular, the collection is characterized by even richer information for taggedentities. For example, ACE 2003 provides various entity features, e.g., entity headwords,entity type, and entity subtype for a specific named entity, and the features have been used asan important clue for determining the relation between two entities in a specific sentence. Inthis context, Zhang et al. [40, 41] have built a composite kernel for which the convolutionparse tree kernel proposed by Collins & Duffy [11] is combined with the entity featurekernel. Equation 6 provides the entity kernel definition.
KL R1;R2ð Þ ¼X
i¼1;2KE R1:Ei;R2:Eið Þ
KE E1;E2ð Þ ¼X
iC E1:fi;E2:fið Þ
C f1; f2ð Þ ¼ 1; if f1 ¼ f20; otherwise
� ð6Þ
In Eq. 6, Ri indicates the relation instance, and Ri.Ej is the j-th entity of Ri. Additionally,Ei.fj is the j-th entity feature of Ei, and C(•,•) is a homogeneity function for the two entities. Itis possible to calculate the entity kernel KL through a summation based on the featureredundancy decision kernel KE for a pair of entities.
Second, the convolution parse tree kernel expresses one parse tree as an occurrence frequencyvector of a subtree to measure the similarity between the two parse trees as shown below.
f Tð Þ ¼ # subtree1 Tð Þ; . . . ; # subtreei Tð Þ; . . . ; # subtreen Tð Þð Þ ð7Þ
In Eq. 7, #subtreei(T) represents the occurrence frequency of the i-th subtree. All parsetrees are expressed using a vector as described above, and the kernel function is calculated asthe inner product of two vectors as follows.
K T1; T2ð Þ ¼ f T1ð Þ; f T2ð Þh i ð8Þ
Figure 12 shows all subtrees of a specific parse tree. There are nine subtrees in the figurealtogether, and each subtree is an axis of the vector, which expresses the left side parse tree.If the number of all unified parse trees that can be extracted for N parse trees is M, each ofthe extracted subtrees can be expressed as an M-dimension vector.
As shown in Fig. 12, there are two constraints for a subtree of a specific parse tree. First,there must be at least two subtree nodes, and the subtree should comply with the grammarproduction rule. For example, [VP → VBD → “got”] cannot become a subtree.
It is necessary to investigate and calculate the frequency of all subtrees in tree T so as tobuild a vector for each parse tree. This process is quite inefficient, however. Since we need toonly compute the similarity of two parse trees in the kernel-based method, we can come upwith indirect kernel functions without building a subtree vector for each parse tree, as with
Multimed Tools Appl
Eq. 7. The following Eq. 9, proposed by Collins & Duffy [11], is used to efficiently calculatethe similarity of two parse trees.
K T1; T2ð Þ ¼ f T1ð Þ; f T2ð Þh i¼X
i#subtreei T1ð Þ � #subtreei T2ð Þ
¼X
i
Xn12N1
Isubtreei n1ð Þ� �
�X
n22N2Isubtreei n2ð Þ
� �¼X
n12N1
Xn22N2
Δ n1; n2ð ÞN1;N2 ! the set of nodes in treesT1 and T2:
IsubtreeiðnÞ ¼1 if ROOT subtreeið Þ ¼ n
0 otherwise
�
Δ n1; n2ð Þ ¼X
iIsubtreei n1ð Þ � Isubtreei n2ð Þ
ð9Þ
In Eq. 9, Ti represents a specific parse tree, Ni represents all node collections of the parsetree, and Ist(n) is the function used for checking whether node n is the root node of thespecific subtree, st. The most time-consuming part in Eq. 9 is for calculating n1; n2ð Þ: Toenhance this, Collins & Duffy [11] came up with the following algorithm.
The function n1; n2ð Þ defined in (3) in Fig. 13 compares the child nodes of the input nodeto calculate the frequency of subtrees contained in both parse trees and the product thereof,
Fig. 12 Parsing tree and its subtree collection
Fig. 13 Algorithm used for calculating Δ n1; n2ð Þ
Multimed Tools Appl
until the end conditions defined in (1) and (2) are satisfied. In this case, the decay factor λ,which is a variable used for limiting large subtrees to address the fact that larger subtrees ofthe parse tree comprise other subtrees, can be applied repeatedly for calculating the innerproduct of the subtree vector.
Two kernels built as described above, that is, the entity kernel and the convolution parsetree kernel, are combined in the following two ways.
K1 R1;R2ð Þ ¼ a � KL R1;R2ð ÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiKL R1;R1ð Þ � KL R2;R2ð Þp þ 1� að Þ K T1; T2ð Þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
K T1; T1ð Þ � K T2; T2ð Þp ð10–1Þ
K2 R1;R2ð Þ ¼ a � KL R1;R2ð ÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiKL R1;R1ð Þ � KL R2;R2ð Þp þ 1
!2
þ 1� að Þ K T1;T2ð ÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiK T1; T1ð Þ � K T2; T2ð Þp ð10–2Þ
In Eqs. 10–1 and 10–2, KL represents the entity kernel and K stands for the convolutionparse tree kernel. Equation 10–1 shows that the composite kernel is a linear combination ofthe two kernels, and Eq. 10–2 defines the composite kernel constructed using a quadraticpolynomial combination.
For the evaluation, Zhang et al. [40, 41] used both ACE 2003 and ACE 2004. Theyparsed all available relation instances using Charniak’s Parser [8], and based on the parsingresults, carried out an instance conversion using the method described in Table 3. To thisend, Moschitti [29] developed a kernel tool, while SVMLight [22] was used for learning andclassification.
The evaluation shows that the composite kernel features a better performance than asingle syntactic kernel. The combination using a quadratic polynomial type shows thedifference in performance between the two kernels. This means that flat (entity type) andstructural (syntactic) features can be organically combined as a single kernel function.Considering that the Path-enclosed Tree method shows the best performance among allrelation instance pruning methods, it is possible to achieve the effect with only core-relatedsyntactic information, so as to estimate the relation of two entities in a specific sentence.
4.6 Other recent studies
Choi et al. [9] constructed and tested a composite kernel where various lexical and contextualfeatures are added by expanding the existing composite kernel [9]. In addition to the syntacticfeature, called a flat feature, they extended the combination range of the lexical feature from the
Table 3 Description of ACEcollection Items ACE-2002 ACE-2003 ACE-2004
# training documents 422 674 451
# training relation instances 6,156 9,683 5,702
# test documents 97 97 N/A
# test relation instances 1,490 1,386 N/A
# entity types 5 5 7
# major relation types 5 5 7
# relation sub-types 24 24 23
Multimed Tools Appl
entity feature to the contextual feature to achieve a higher performance. Mintz et al. [27]proposed a new method for using Freebase, which is a semantic database for thousands ofrelation collections used to gather exemplary sentences for a specific relation and generating arelation extraction based on the exemplary sentences obtained [27]. In their test, a collection of10,000 instances corresponding to 102 types of relations had an accuracy of 67.6%. In addition,T.-V. T. Nguyen et al. [32] designed a new kernel, which extends the existing convolution parsetree kernel [32], and Reichartz et al. [34] proposed a method that extends the dependency treekernel [34]. As described previously, most of the studies published thus far are based on thekernels described in Sections 4.1 through 4.5.
5 Comparison and analysis
In the previous section, five types of kernel-based relation extraction were analyzed in detail.Here, we discuss the results of a comparison and an analysis of these methods. Section 5.1briefly describes the criteria used for the comparison and analysis, and Section 5.2 comparesthe characteristics of the methods. Section 5.3 then covers the performance results in detail.Finally, Section 5.4 summarizes the advantages and disadvantages of each method.
5.1 Criteria for comparison and analysis
A large variety of criteria can generally be used for comparing kernel-based relationextraction methods. The following six criteria, however, were selected and used in thisstudy: (1) Linguistic analysis and the pre-processing method refer to the pre-processinganalysis methods and types of individual instances, which are composed of learningcollections and evaluation collections, e.g., the type of parsing method or parsing systemused. (2) The level of linguistic analysis, which is the criterion related to method (1), is areference to what level of linguistic analysis will be carried out for the pre-processing andanalyzing instances. Exemplary levels include part-of-speech tagging, base phrase analysis,and dependency or full parsing. (3) The method used for selecting a feature space is areference for determining if a substantial input of the kernel function is an entire sentence ora part thereof. (4) The applied lexical and supplementary feature information refers to thevarious supplementary feature information used for addressing the issue of sparse data. (5)The relation extraction method is a practical relation extraction method based on previouslyconstituted learning models. Exemplary relation extraction methods include multi-classclassification at a particular time, and a single mode method for separating instances withrelations from those without relations by means of processing multiple classifications at aparticular time or for a binary classification, and then carrying out relation classification onlyfor the instances with relations. (6) The manual work requirement is a reference used todetermine if the entire process is automatically carried out in full or if manual work isrequired in a particular step. These six criteria were used to analyze the kernel-based relationextraction methods, the results of which are shown in Table 6.
In addition, for the purpose of describing the characteristics of the kernel function, thedescription in Section 4 will be summarized, and the structure of each factor to be input intothe kernel function will be described. The modification of the kernel function for anoptimized speed will be included in the analysis criteria and described. For a performancecomparison of the individual methods, the types and scale of the tested collections and testedrelations will be analyzed and described in detail. Table 3 below describes the ACEcollection generally used for the relation extraction methods developed thus far.
Multimed Tools Appl
As shown in Table 3, the ACE collection is generally used and can be divided into threetypes. ACE-2002, however, is not widely used owing to its consistency and quality problems.There are five to seven types of entities used, e.g., Person, Organization, Facility, Location, andGeo-Political Entity. For the relations, all collections are structured to be at two levels,consisting of 23 to 24 types of particular relations corresponding to five entity types: Role,Part, Located, Near, and Social. As the method used for constructing these collections advancesand its quality improves, the tendency will be toward a reduction in the scale of the traininginstances. Although subsequent collections have previously been constituted, they have notbeen publicized according to the principle of non-disclosure, which is the policy of the ACEWorkshop. This policy should be improved for the benefit of more active studies.
5.2 Comparison of the characteristics
To conform to the six types of comparison and analysis criteria described in 5.1, Table 4 summarizeseach concept of each kernel function used in the kernel-based relation extraction method.
Table 5 shows the comparison and analysis of the kernel-based relation extractionmethods. The closer to the right side of the table the method is, the more recently it wasintroduced. The characteristic found in all of the above methods is that various types offeature and syntactic information are used. Such heterogeneous information was firstcombined and used in a single kernel function, but is separated from the composite kerneland applied.
Table 4 Summary of kernel-based relation extraction methods
Kernel types Description of concept
Tree Kernel (TK) ▪ Compares each node which consists of two trees to be compared.
▪ Based on BFS, applies subsequence kernel to the child nodes located at thesame level.
Dependency Tree Kernel(DTK)
▪ The decay factor is adjusted and applied to the similarity depending on thelength of subsequences at the same level or on the level itself.
Shortest Path DependencyKernel (SPDK)
▪ Compares each element of the two paths to cumulatively calculate thecommon values for the elements in order to compute the similarity bymultiplying all values.
▪ Similarity is 0 if the length of two paths is different.
Subsequence Kernel (SK) ▪ For the example of measuring similarity of two words, extracts only thesubsequences which exist in both words, and expresses the two words witha vector (Φ(x)) by using the subsequences, among all of the subsequenceswhich belong to two words and of which the length is n.
▪ Afterwards, obtains the inner product of the two vectors to calculate thesimilarity.
▪ Generalizes and uses SSK to compare planar information of the tree kernel(sibling node).
Composite Kernel (CK) ▪ Finds all subtrees in the typical CFG-type syntactic tree, and establishesthem as a coordinate axis to represent the parse tree as a vector (Φ(x)).
▪ In this case, the following constraints hold true: (1) the number of nodesmust be at least 2, and (2) the subtree should comply with the CFG creationrule.
▪ Since there can be multiple subtrees, each coordinate value can be at least 1,and similarity is calculated by obtaining the inner product of the two vectorscreated as such.
Multimed Tools Appl
Tab
le5
Com
parisonof
thecharacteristicsof
kernel-based
relatio
nextractio
n
TK
DTK
SPDK
SK
CK
Langu
ageProcessor
Shallo
wParser(REES)
Statistical
Parser(M
XPOST)
Statistical
Parser(Collin
s’Parser)
Chu
nker
(OpenN
LP)
Statistical
Parser
(Charniak’s
Parser)
Level
ofLingu
istic
Analysis
PLO
TaggingParsing
Parsing
Parsing
Chunking
Parsing
Feature
Space
Selectio
nExtractsfeatures
from
parsetreesmanually
Selectssm
allsub-tree
including
twoentitiesfrom
theentire
depend
ency
tree
Selectstheshortestdepend
ency
path
which
startswith
oneentityand
ends
with
theotheron
e
BeforeEntities
MCT
After
Entities
PT
BetweenEntities
CPT
FPT
FCPT
Feature
used
forKernel
Com
putatio
nEntity
Headw
ord
Word
Word
Word
Entity
Headw
ord
Entity
Role
POS
POS
POS
Entity
Type
Entity
Text
Chu
nkingInfo.
Chunk
ingInfo.
Chu
nkingInfo.
Mentio
nTy
pe
Entity
Type
Entity
Type
Entity
Type
LDCmentio
nTy
pe
Entity
Level
Chu
nkHeadw
ord
Chu
nkingInfo.
WordN
etHyp
erny
m
RelationParam
eter
RelationExtractionMetho
dSinglePhase
(Multiclass
SVM)
Cascade
Phase
(Relation
Detectio
nandClassification)
SinglePhase
SinglePhase
(Multiclass
SVM)
SinglePhase
(Multiclass
SVM)
Cascade
Phase
ManualProcess
Necessary
(InstanceExtraction)
Necessary
(DependencyTree)
Necessary
(DependencyPath)
N/A
N/A
Multimed Tools Appl
With respect to selecting the feature space, most of the sentences or portions of the parsetree are applied, rather than a tree kernel. Manual work was initially required for extractingrelation instances and building a parse tree. More recently developed methods, however,offer full automation. In relation extraction methods, multi-class classification is used, inwhich a case with no relations is included as having a single relation.
5.3 Comparison of performance
Table 6 shows the parameter type of each kernel function, along with the computationcomplexity in calculating the similarity between two inputs. Most of the kernels show acomplexity of O(N2), but SPDK clearly demonstrates a complexity on the order of O(N) andcan be considered the most efficient.
It should be noted that the complexity shown in Table 6 is simply the kernel calculationcomplexity. The overall complexity of the relation extraction can be much higher when theprocessing time for the parsing and learning is also considered Table 7.
ACE-2002, which was the first version of the ACE collection, had an issue with dataconsistency. This problem has been continuously addressed in the subsequent versions, andwas finally resolved in version ACE-2003. Starting with a performance rate of 52.8 %,achieved by Kambhatla [23], and based on the performance of ACE-2003 using 24 relationcollections, the performance was improved up to 59.6 %, as recently announced by Zhou etal. [45]. Similarly, the maximum relation extraction performance for 23 particular relationsin the ACE-2004 collection is currently 66 %.
Although each model has a different performance in differently sized relation collections,the composite kernel generally shows better results. In particular, Zhou et al. [45] demon-strated a high performance for all collections and relation collections based on extendedmodels initially proposed by Zhang et al. [40, 41]. As described above, it is thought thatvarious features for relation extraction, that is, a syntactic structure and vocabulary, can beefficiently combined in a composite kernel for an even better performance.
Table 6 Parameter structure and calculation complexity of each kernel
Kernels Parameter structure Time complexity
TK Shallow parse trees CSTK : O Ni;1
�� �� � Ni;2
�� ��� �SSTK : O Ni;1
�� �� � Ni;2
�� ��3� �N1j j : # 1stinput0s nodes in level ið ÞN2j j : # 2ndinput0s nodes in level i
� �DTK Dependency trees
SPDK Dependency pathsO N1j jð Þ
N1j j : # 1stinput0s nodesð Þ
SK Chunking results O n � N1j j � N2j jð Þn : subsequence lengthN1j j : # 1stinput0s nodesð ÞN2j j : # 2ndinput0s nodes
� �
CK Full parse trees O N1j j � N2j jð ÞN1j j : # 1stinput0s nodesð ÞN2j j : # 2ndinput0s nodes
� �
Multimed Tools Appl
5.4 Comparison and analysis of the advantages and disadvantages of each method
The advantages and disadvantages of the five kernel-based relation extraction methods arediscussed and outlined in Table 8.
As shown in Table 8, each method has certain advantages and disadvantages. A lot of effortis required for the process of feature selection used in general feature-based relation extraction.The kernel-based method does not have this disadvantage, but it does have various limitations.For example, although the shortest dependency path kernel has a variety of potential uses, it hasa low performance owing to the overly simple structure of the kernel function. Since thecomposite kernel constitutes and compares the subtree features only on the basis of the part-of-speech and vocabulary information of each node, the generality of the similarity measurement isnot high. A scheme to overcome this limitation is to use word classes or semantic information.
To overcome the shortcomings described above, a scheme for a new kernel design may alsobe suggested. For example, a scheme may be used for interworking various types of supple-mentary feature information (WordNet Synset, thesaurus, ontology, part-of-speech tag, thematicrole information, etc.), ensuring a general comparison between subtrees in the composite kernel.The performance can be improved by replacing the current simple linear kernel with asubsequence or other composite kernel and applying several kinds of supplementary featureinformation to address the shortcomings of the shortest path dependency kernel.
6 Conclusion
In this paper, we analyzed a kernel-based relation extraction method, which is consideredthe most efficient approach currently available. Previous case studies did not fully coverspecific operation principles of kernel-based relation extraction models, and simply cited
Table 7 Performance comparison of each kernel-based relation extraction model
Articles Year Methods Test Collection F1
(Zelenco et al. 2003 [39]) 2003 TK 200 News Articles (2-relations) 85.0
(Culotta & Sorensen 2004 [14]) 2004 DTK ACE-2002 (5-major relations) 45.8
(Kambhatla, 2004) [23] 2004 ME ACE-2003 (24-relation sub-types) 52.8
(Bunescu & Mooney 2005) [6] 2005 SPDK ACE-2002 (5-major relations) 52.5
(Zhou et al. 2005) [44] 2005 SVM ACE-2003 (5-major relations) 68.0
ACE-2003 (24-relation sub-types) 55.5
(Zhao & Grishman, 2005) [43] 2005 CK ACE-2004 (7-major relations) 70.4
(Bunescu & Mooney, 2006) [7] 2006 SK ACE-2002 (5-major relations) 47.7
(Zhang et al. 2006) [41] 2006 CK ACE-2003 (5-major relations) 70.9
ACE-2003 (24-relation sub-types) 57.2
ACE-2004 (7-major relations) 72.1
ACE-2004 (23-relation sub-types) 63.6
(Zhou et al. 2007) [45] 2007 CK ACE-2003 (5-major relations) 74.1
ACE-2003 (24-relation sub-types) 59.6
ACE-2004 (7-major relations) 75.8
ACE-2004 (23-relation sub-types) 66.0
(Jiang & Zhai, 2007) [21] 2007 ME/SVM ACE-2004 (7-major relations) 72.9
Multimed Tools Appl
the contents of individual studies or conducted a limited analysis. This paper, however,closely examined the operation principles and individual characteristics of five kernel-based relation extraction methods, starting from the original kernel-based relation extrac-tion studies [39], and ending with the composite kernel [9, 41], which is considered themost advanced kernel-based method. The overall performance of each method wascompared using ACE collections, and the particular advantages and disadvantages ofeach method were summarized. This study should contribute to further kernel studies onrelation extraction of a higher performance, and to general high-level kernel studies forlinguistic processing and text mining.
Table 8 Analysis of the advantages and disadvantages of kernel-based relation extraction
Method Advantage Disadvantage
Feature-basedSVM/ME
Applies the typical automatic sentenceclassification without modification.
A lot of effort is required for feature extractionand selection.
Performance can be further improved byapplying various feature information.
Performance can be improved only throughfeature combination.
Relatively high speed
Tree Kernel(TK)
Calculates particular similarity betweenshallow parse trees.
Very limited use of structural information(syntactic relations)
Uses both structural (parenthood) and planarinformation (brotherhood).
Slow similarity calculation speed in spite ofoptimized speed
Optimization for speed improvement.
DependencyTK (DK)
Addressed the issue of insufficient use ofstructural information, which is adisadvantage of TK.
Predicates and key words in a dependency treeare emphasized only by means of decayfactors (low emphasis capability)
Uses key words which are the core of relationexpression in the sentence, as featureinformation, on the basis of the structure thatthe predicate node is raised to a higherstatus, which is a structural characteristic ofa dependency tree.
Slow similarity calculation speed
ShortestPath DTK(SPTK)
Creates a path between two named entities bymeans of dependency relation to reducenoise not related to relation expression.
Too simple structure of the kernel function
Shows very fast computation speed becausethe kernel input is not trees, but paths,different from previous inputs.
Too strong constraints because the similarity is0 if the length of two input paths is different.
Adds various types of supplementary featureinformation to improve the performance ofsimilarity measurement, thanks to the simplestructure of paths.
SubsequenceKernel(SK)
Very efficient because syntactic analysisinformation is not used.
Can include many unnecessary features
Adds various types of supplementary featureinformation to improve the performance.
CompositeKernel(CK)
Makes all of constituent subtrees of a parsetree have a feature, to perfectly use structureinformation in calculating similarity.
Comparison is carried out only on the basis ofsentence component information of eachnode (phrase info.) (kernel calculation isrequired on the basis of composite featureinformation with reference to word class,semantic info, etc.)
Optimized for improved speed
Multimed Tools Appl
References
1. ACE, “Automatic Content Extraction,” 2009. [Online]. Available: http://www.itl.nist.gov/iad/mig//tests/ace/.2. Agichtein E, and Gravano L (2000) “Snowball: extracting relations from large plain-text collections,” in
Proceedings of the fifth ACM conference on Digital libraries, pp. 85–94.3. Aone C, and Ramos-Santacruz M (2000) “REES: a large-scale relation and event extraction system,” in
The 6th Applied Natural Language Processing Conference4. Bach N, and Badaskar S (2007) “A survey on relation extraction,” Literature review for Language and
Statistics II5. Brin S (1999) Extracting patterns and relations from the World Wide Web. Lect Notes Comput Sci 1590/
1999:172–1836. Bunescu R, and Mooney RJ (2005) “A shortest path dependency kernel for relation extraction,” in
Proceedings of the Human Language Technology Conference and Conference on Empirical Methods inNatural Language Processing, pp. 724–731
7. Bunescu R, and Mooney RJ (2006) “Subsequence kernels for relation extraction,” in Proceeding of theNinth Conference on Natural Language Learning (CoNLL-2005)
8. Charniak E (2001) “Immediate-head parsing for language models,” in Proceedings of the 39th AnnualMeeting of the Association for Computational Linguistics
9. Choi S-P, Jeong C-H, Choi Y-S, Myaeng S-H (2009) Relation extraction based on extended compositekernel using flat lexical features. J KIISE: Softw Appl 36:8
10. Collins M (1997) “Three generative, lexicalised models for statistical parsing,” in Proceedings of the 35thAnnual Meeting of the ACL (jointly with the 8th Conference of the EACL), Madrid
11. Collins M, and Duffy N (2001) “Convolution kernels for natural language,” in NIPS-200112. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–29713. Cristianini N, and Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-
based learning methods. Cambridge University Press14. Culotta A, and Sorensen J (2004) “Dependency tree kernels for relation extraction,” in Proceedings of the
42nd Annual Meeting on Association for Computational Linguistics15. Etzioni O, Cafarella M, Downey D, Popescu A-M, Shaked T, Soderland S, Weld DS, Yates A (2005)
Unsupervised named-entity extraction from the Web: an experimental study. Artif Intell 165(1):91–13416. Fellbaum C (1998) WordNet: an electronic lexical database. MIT Press, Cambridge17. Freund Y, Schapire RE (1999) Large margin classification using the perceptron algorithm. Mach Learn
37(3):277–29618. Fundel K, Küffner R, Zimmer R (2007) RelEx—Relation extraction using dependency parse trees.
Bioinformatics 23(3):365–37119. Gartner T, Flach P, and Wrobel S (2003) “On graph kernels: hardness results and efficient alternatives,”
Learning Theory and Kernel Machines, pp. 129–14320. Hockenmaier J, and Steedman M (2002) “Generative models for statistical parsing with combinatory
categorial grammar,” in Proceedings of 40th Annual Meeting of the Association for ComputationalLinguistics, Philadelphia, PA
21. Jiang J, and Zhai C (2007) “A systematic exploration of the feature space for relation extraction,” inNAACL HLT22. Joachims T (1998) “Text categorization with support vecor machine: learning with many relevant
features,” in ECML-199823. Kambhatla N (2004) “Combining lexical, syntactic and semantic features with Maximum Entropy models
for extracting relations,” in ACL-200424. Li J, Zhang Z, Li X, Chen H (2008) Kernel-based learning for biomedical relation extraction. J Am Soc
Inf Sci Technol 59(5):756–76925. Li W, Zhang P, Wei F, Hou Y, and Lu Q (2008) “A novel feature-based approach to Chinese entity relation
extraction,” in Proceedings of the 46th Annual Meeting of the Association for Computational Linguisticson Human Language Technologies: Short Papers, pp. 89–92
26. Lodhi H, Saunders C, Shawe-Taylor J, Cristianini N, Watkins C (2002) Text classification using stringkernels. J Mach Learn Res 2:419–444
27. Mintz M, Bills S, Snow R, and Jurafsky D (2009) “Distant supervision for relation extraction withoutlabeled data,” in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4thInternational Joint Conference on Natural Language Processing of the AFNLP 2:1003–1011
28. Moncecchi G, Minel JL, and Wonsever D (2010) “A survey of kernel methods for relation extraction,” inWorkshop on NLP and Web-based technologies (IBERAMIA 2010)
29. Moschitti A (2004) “A study on convolution kernels for shallow semantic parsing,” in ACL-2004
Multimed Tools Appl
30. MUC, “The NIST MUCWebsite,” 2001. [Online]. Available: http://www.itl.nist.gov/iaui/894.02/related_projects/muc/.
31. Nguyen DPT, Matsuo Y, and Ishizuka M (2007) “Exploiting syntactic and semantic information forrelation extraction from wikipedia,” in IJCAI Workshop on Text-Mining & Link-Analysis (TextLink 2007)
32. Nguyen T.-VT, Moschitti A, and Riccardi G (2009) “Convolution kernels on constituent, dependency andsequential structures for relation extraction,” in Proceedings of the 2009 Conference on EmpiricalMethods in Natural Language Processing 3:1378–1387
33. Ratnaparkhi A (1996) “A maximum entropy part-of-speech tagger,” in Proceedings of the EmpiricalMethods in Natural Language Processing Conference
34. Reichartz F, Korte H, and Paass G (2009) “Dependency tree kernels for relation extraction from naturallanguage text,” Machine Learning and Knowledge Discovery in Databases, pp. 270–285
35. Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in thebrain. Psychol Rev 65(6):386–408
36. TAC, “Text Analysis Conference,” 2012. [Online]. Available: http://www.nist.gov/tac/.37. Yarowsky D (1995) “Unsupervised word sense disambiguation rivaling supervised methods,” in
Proceedings of the 33rd annual meeting on Association for Computational Linguistics pp. 189–19638. Yates A, Cafarella M, Banko M, Etzioni O, Broadhead M, and Soderland S (2007) “TextRunner: open
information extraction on the web,” in Proceedings of Human Language Technologies: The AnnualConference of the North American Chapter of the Association for Computational Linguistics:Demonstrations, pp. 25–26
39. Zelenco D, Aone C, Richardella A (2003) Kernel methods for relation extraction. J Mach Learn Res3:1083–1106
40. Zhang M, Zhang J, and Su J (2006) “Exploring syntactic features for relation extraction using aconvolution tree kernel,” in Proceedings of the main conference on Human Language TechnologyConference of the North American Chapter of the Association of Computational Linguistics, pp. 288–295
41. Zhang M, Zhang J, Su J, and Zhou G (2006) “A composite kernel to extract relations between entitieswith both flat and structured features,” in 21st International Conference on Computational Linguisticsand 44th Annual Meeting of the ACL, pp. 825–832
42. Zhang M, Zhou G, Aiti A (2008) Exploring syntactic structured features over parse trees for relationextraction using kernel methods. Inf Process Manag 44(2):687–701
43. Zhao S, and Grishman R (2005) “Extracting relations with integrated information using kernel methods,”in ACL-2005
44. Zhou G, Su J, Zhang J, and ZhangM (2005) “Exploring various knowledge in relation extraction,” inACL-200545. Zhou G, Zhang M, Ji D, and Zhu Q (2007) “Tree Kernel-based relation extraction with context-sensitive
structured parse tree information,” in The 2007 Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational Natural Language Learning, 2007
Sung-Pil Choi received his Ph.D degree in Department of Information and Communications Engineeringfrom Korea Advanced Institute of Science and Technology (KAIST) in 2012. He works for Korea Institute ofScience and Technology Information (KISTI) as senior researcher. He has experience in big data processing/analytics and contents management. His research interests are in the area of text mining, natural languageprocessing, data analytics, information/knowledge processing.
Multimed Tools Appl
Seungwoo Lee received the B.S. degree from Kyungbook National University, Korea, in 1997. He receivedhis M.S. and Ph.D. degrees in Computer Science and Engineering from POSTECH, Korea in 1999 and 2005.Currently, he works as a senior researcher and leader of platform research group at Korea Institute of Scienceand Technology Information (KISTI), Korea. Previously, He studied and developed an information retrievalengine and a named entity recognizer based on Natural Language Processing (NLP) technologies. Andrecently he has researched and developed Semantic Web-related tools such as a semantic repository, aninference engine. His current research interest includes technology intelligence and discovery of technologyopportunities based on Text Mining and Semantic Web technologies.
Hanmin Jung works as chief researcher and the head of the Dept. of S/W Research at Korea Institute ofScience and Technology Information (KISTI), Korea. He received his M.S. and Ph.D. degrees in ComputerScience and Engineering from POSTECH, Korea. His current research interest is intelligent service based onthe Semantic Web and Text Mining, Human-Computer Interaction, and Natural Language Processing.
Multimed Tools Appl
Sa-kwang Song received his B.S. degree in Statistics in 1997 and his M.S. degree in Computer Science in1999 at Chungname National University, Korea. He received his Ph.D. degree in Computer Science at KoreaAdvanced Institute of Science and Technology (KAIST), Korea. He joined Korea Institute of Science andTechnology Information (KISTI), Daejeon Korea in 2010. He is currently a senior researcher at Dept. of SWResearch. His current research interest includes Text Mining, Natural Language Processing, InformationRetrieval, Semantic Web, and Health-IT.
Multimed Tools Appl