HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL

7
International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013 DOI:10.5121/ijfcst.2013.3408 67 HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL Deepti Chopra 1 , Sudha Morwal 2 and Dr. G.N. Purohit 3 Department of Computer Engineering, Banasthali Vidyapith, (Raj.), INDIA [email protected] [email protected] [email protected] ABSTRACT Named Entity Recognition is the task of recognizing Named Entities or Proper Nouns in a document and then classifying them into different categories of Named Entity classes. In this paper we have introduced our modified tool that not only performs Named Entity Recognition (NER) in any of the Natural Languages, performs Corpus Development task i.e. assist in developing Training and Testing document but also solves unknown words problem in NER, handles spurious words and automatically computes Performance Metrics for NER based system i.e. Recall, Precision and F-Measure. KEYWORDS NER, Transliteration, Unknown words, Performance Metrics 1. INTRODUCTION Named Entity Recognition (NER) is one of the application areas of Natural Language Processing, in which Named Entities are identified and thereafter categorised into different classes of Named Entities. The various classes of Named Entities can be the name of person, location, organization, state, sport, river, city, country, percentage, time, quantity etc. Various applications of NER include: Information extraction, Machine Translation, Question Answering System, Information Retrieval, Automatic Summarization etc. e. g. Consider Training Sentences: Ram/PER is/OTHER a/OTHER intelligent/OTHER boy/OTHER Deepa/PER lives/OTHER in/OTHER Nagpur/CITY Ankit/PER is/OTHER a/OTHER football/SPORT player/OTHER Aabhas/PER plays/OTHER cricket/SPORT In the given above tagged training text in English, ‘PER’ denotes that ‘Ram’, ‘Deepa’,’ Ankit’ and ‘Aabhas’ are the Names of Person. ’Nagpur’ is tagged with ‘CITY’ tag since it is a Name of City. Similarly, ‘football’ and ‘cricket’ are the names of Sport, so they are tagged with ‘SPORT’ tag. The entities that are tagged with ‘OTHER’ tag are not Named Entities. The above tagged sentences are input to HMM Train module that computes HMM Parameters i.e. Start Probability, Transition Probability and Emission Probability. HMM Parameters and Testing sentences are input to the HMM Test module, and using Viterbi Algorithm Named Entities can be derived. If testing sentence in NER is given as:

description

Named Entity Recognition is the task of recognizing Named Entities or Proper Nouns in a document and then classifying them into different categories of Named Entity classes. In this paper we have introduced our modified tool that not only performs Named Entity Recognition (NER) in any of the Natural Languages,performs Corpus Development task i.e. assist in developing Training and Testing document but also solves unknown words problem in NER, handles spurious words and automatically computes Performance Metrics for NER based system i.e. Recall, Precision and F-Measure.

Transcript of HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL

Page 1: HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL

International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013

DOI:10.5121/ijfcst.2013.3408 67

HIDDEN MARKOV MODEL BASED NAMED

ENTITY RECOGNITION TOOL

Deepti Chopra1, Sudha Morwal2 and Dr. G.N. Purohit3

Department of Computer Engineering, Banasthali Vidyapith, (Raj.), INDIA [email protected] [email protected]

[email protected]

ABSTRACT Named Entity Recognition is the task of recognizing Named Entities or Proper Nouns in a document and then classifying them into different categories of Named Entity classes. In this paper we have introduced our modified tool that not only performs Named Entity Recognition (NER) in any of the Natural Languages, performs Corpus Development task i.e. assist in developing Training and Testing document but also solves unknown words problem in NER, handles spurious words and automatically computes Performance Metrics for NER based system i.e. Recall, Precision and F-Measure. KEYWORDS NER, Transliteration, Unknown words, Performance Metrics 1. INTRODUCTION Named Entity Recognition (NER) is one of the application areas of Natural Language Processing, in which Named Entities are identified and thereafter categorised into different classes of Named Entities. The various classes of Named Entities can be the name of person, location, organization, state, sport, river, city, country, percentage, time, quantity etc. Various applications of NER include: Information extraction, Machine Translation, Question Answering System, Information Retrieval, Automatic Summarization etc. e. g. Consider Training Sentences: Ram/PER is/OTHER a/OTHER intelligent/OTHER boy/OTHER Deepa/PER lives/OTHER in/OTHER Nagpur/CITY Ankit/PER is/OTHER a/OTHER football/SPORT player/OTHER Aabhas/PER plays/OTHER cricket/SPORT In the given above tagged training text in English, ‘PER’ denotes that ‘Ram’, ‘Deepa’,’ Ankit’ and ‘Aabhas’ are the Names of Person. ’Nagpur’ is tagged with ‘CITY’ tag since it is a Name of City. Similarly, ‘football’ and ‘cricket’ are the names of Sport, so they are tagged with ‘SPORT’ tag. The entities that are tagged with ‘OTHER’ tag are not Named Entities. The above tagged sentences are input to HMM Train module that computes HMM Parameters i.e. Start Probability, Transition Probability and Emission Probability. HMM Parameters and Testing sentences are input to the HMM Test module, and using Viterbi Algorithm Named Entities can be derived. If testing sentence in NER is given as:

Page 2: HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL

International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013

68

Aabhas lives in Nagpur The output of NER based system for the above testing sentence is list of Named Entities along with their tags i.e. Aabhas/PER and Nagpur/CITY. We have developed a tool NERHMM, a language independent NER tool based on Hidden Markov Model technique. [1][2]. In this paper, we will discuss about our modified tool. 2. PERFORMANCE METRICS OF NER BASED SYSTEM Performance Metrics is means to compute the performance of a NER based system. Performance Metrics can be estimated in terms of three parameters: Precision, Accuracy and F-Measure. The result of a NER based system is referred to as “response” and the interpretation of human as the “answer key” [9]. Consider the following terms: 1. Correct-If the response is same as the answer key. 2. Incorrect-If the response is not same as the answer key. 3. Missing-If answer key is found to be tagged but response is not tagged. 4. Spurious-If response is found to be tagged but answer key is not tagged. [6] Hence, we define Precision, Recall and F-Measure as follows: [5]7][8] Precision (P): Correct / (Correct + Incorrect + Missing) Recall (R): Correct / (Correct + Incorrect + Spurious) F-Measure: (2 * P * R) / (P + R) 3. HIDDEN MARKOV MODEL Hidden Markov Model (HMM) is a machine learning based approach that was used initially for the purpose of Speech Recognition but now it is being used for performing Named Entity Recognition on Natural languages. HMM can be represented using three parameters: λ = (A, B, П). Start Probability (П), Transition probability (A = aij) and Emission Probability (B ={bj(O)}).[1][3] Start Probability (П) means the probability that a given tag occurs first in a sentence. Transition probability (A = aij) means the probability of occurrence of the next tag j in a sentence given the occurrence of particular tag i at present Emission Probability (B = {bj(O)}) is the probability of occurrence of output sequence given a state j. HMM involves two steps: HMM Training and HMM Testing. The input to the HMM Train is an annotated text and the output of HMM Train are the three parameters i.e. Start Probability (П), Transition probability (A = aij) and Emission Probability (B ={bj(O)}).The input to the HMM Test is a testing sentence and the three parameters obtained in previous phase. The output of the HMM Test are the sequence of states from which Named Entities can be detected. 4. OUR HMM BASED NER TOOL We have performed NER in eight languages namely: English, Hindi, Bengali, Telugu, Punjabi, Urdu, Marathi and French. Our tool is capable of performing Annotation task. If any of the existing tags need to be modified, then this can be done. Annotation module is shown in fig1.

Page 3: HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL

International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013

69

Figure 1: Annotation in NER Tool

Figure 2 HMM Train and HMM Parameter estimation

Similarly, we can develop Testing document also using our tool. So, our tool is capable of performing Corpus Development both for training as well as for testing. After getting the annotated corpus, we click on ‘TRAIN HMM’ button and choose the file to be trained by clicking on Browse button. HMM parameters (Start Probability, Transition Probability and Emission Probability) are calculated and can be viewed by clicking on View Parameters button. This is shown in Fig2.

Page 4: HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL

International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013

70

Figure 3 HMM Testing and its Output

Now, when we click on TEST HMM button, we can either click on browse button to select a file for testing, or build a testing file by clicking on button named ‘Develop a new testing Corpus’. Finally, when we click on ‘TEST HMM’, we select a testing file using Browse button and Viterbi algorithm is made to run that accepts all the HMM parameters computed by the tool and displays optimal state sequence as shown in Fig 3. If any unknown word appears in testing file then transliteration module is made to run and the unknown word can be handled Our system can perform training and testing in any language while dealing with known words. In case of dealing with unknown words, our system can handle only those words that appear in one of the following languages: Hindi, Punjabi, Marathi, Bengali, Telugu, Urdu, English and French. When we click on ‘SAVE OUTPUT’ button then output of NER based system can be saved in a file. And, when we click on NER EVALUATION button, then Performance Metrics of NER based system is calculated automatically and displayed in a new window. fig 4. Our system is capable of handling Spurious words. Spurious words are those that are found to be untagged in training file. Such words are tagged as ‘OTHER’ or Not-a-Named Entity by our system. We have tried to solve the problem of unknown words using Transliteration approach.

Page 5: HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL

International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013

71

Figure 4 NER Evaluation

5. FEATURES OF OUR TOOL Some of unique features of our tool include the following: Performs task of Corpus Development i.e. assist in developing Training as well Testing

documents. It is a Language Independent tool can perform NER in any language. Unknown word

handling task has been performed for eight languages i.e. English, French, Hindi, Urdu, Punjabi, Telugu, Bengali and Marathi using Transliteration approach.

Spurious words i.e. words that are found untagged in Training Corpus are handled. The words that are found in testing file and are absent in training file are given Not-a-

Named Entity tag and are given as a feedback to the training file again, so that next time when testing is done then these words are known words.

Automatic computation of NER Evaluation or Performance Metrics (i.e. Start Probability, Emission Probability and Transition Probability) can be performed by our tool.

Our tool can perform NER on documents of any domain with high accuracy. Documents may include dynamic tag sets.

Our tool can perform NER on Mutilingual documents also. Our tool is user friendly in nature, since it assists in Corpus development, automatically

computes HMM Parameters and performs NER Evaluation also. It is highly accurate. The result of NER Evaluation or Performance Metrics is close to

that of Human interpretation.

6. CONCLUSION We have performed Named Entity Recognition using Hidden Markov Model in Natural languages such as Hindi, Marathi, Punjabi, Telugu, Urdu, Bengali, English and French.

Page 6: HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL

International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013

72

The existing tools related to Named Entity Recognition are highly language dependent and domain specific in nature. So, a need was felt to develop a tool that is language independent and can work in any domain. So, we developed a tool that performs NER in Natural languages and can work in any domain using Hidden Markov Model. We have also tried to solve the problem of Unknown words in Named Entity Recognition using Transliteration approach. Our system is also capable of performing NER on multilingual data. If the training Named Entities is in one language and in testing file same Named Entities are in another language, then using Transliteration approach these Named Entities can be identified easily ACKNOWLEDGEMENT We would like to thank all those who helped me in accomplishing this task. REFERENCES [1] Sudha Morwal and Deepti Chopra” NERHMM: A Tool For Named Entity Recognition based on

Hidden Markov Model“International Journal on Natural Language Computing (IJNLC) Vol.2, No.2, April 2013 DOI:10.5121/ijnlc.2013.2204, Pg 43-49. Available at: http://airccse.org/journal/ijnlc/papers/2213ijnlc04.pdf

[2] Sudha Morwal and Deepti Chopra “Identification and Classification of Named Entities in Indian Languages” International Journal on Natural Language Computing (IJNLC) Vol.2, No.1, February 2013 DOI:10.5121/ijnlc.2013.210 Pg 37-43 Available at: http://airccse.org/journal/ijnlc/papers/1412ijnlc02.pdf

[3] Sudha Morwal, Nusrat Jahan and Deepti Chopra “Named Entity Recognition using Hidden Markov Model (HMM)” International Journal on Natural Language Computing (IJNLC) Vol.1, No.4, December 2012, DOI:10.5121/ijnlc.2012.1402, Pg 15-23Available at: http://airccse.org/journal/ijnlc/papers/1412ijnlc02.pdf

[4] Deepti Chopra, Nusrat Jahan and Sudha Morwal ”Hindi Named Entity Recognition By Using Rule Based Heuristics And Hidden Markov Model”International Journal of Information Sciences and Techniques (IJIST) Vol.2, No.6, November 2012. DOI : 10.5121/ijist.2012.2604. Available at: http://airccse.org/journal/IS/papers/2612ijist04.pdf

[5] G.V.S.RAJU, B.SRINIVASU, Dr.S.VISWANADHA RAJU, 4K.S.M.V.KUMAR “Named Entity Recognition for Telugu Using Maximum Entropy Model”

[6] B. Sasidhar, P. M. Yohan, Dr. A. Vinaya Babu3, Dr. A. Govardhan,.“A Survey on Named Entity Recognition in Indian Languages with particular reference to Telugu” IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 2, March 2011.

[7] Asif Ekbal, Rejwanul Haque, Amitava Das, Venkateswarlu Poka and Sivaji Bandyopadhyay “Language Independent Named Entity Recognition in Indian Languages” .In Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian Languages, pages 33–40,Hyderabad, India, January 2008.Available at: http://www.mt-archive.info/IJCNLP-2008-Ekbal.pdf

[8] Darvinder kaur, Vishal Gupta.“A survey of Named Entity Recognition in English and other Indian Languages”.IJCSI International Journal of Computer Science Issues, Vol.7, Issue 6, November 2010.

[9] Shilpi Srivastava, Mukund Sanglikar & D.C Kothari. ”Named Entity Recognition System for Hindi Language:A Hybrid Approach” International Journal of Computational Linguistics (IJCL), Volume (2): Issue (1): 2011.Available at http://cscjournals.org/csc/manuscript/Journals/IJCL/volume2/Issue1/IJCL-19.pdf

Page 7: HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL

International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013

73

Authors Deepti Chopra is working as Assistant Professor in the Department of Computer Science at Banasthali University (Rajasthan), India. She has received B.Tech degree in Computer Science and Engineering from Rajasthan College of Engineering for Women, Jaipur, Rajasthan in 2011.She has done M.Tech in Computer Science and Engineering from Banasthali University, Rajasthan in 2013. Her research interests include Artificial Intelligence, Natural Language Processing, and Information Retrieval. She has published many papers in International journals and conferences. Sudha Morwal is an active researcher in the field of Natural Language Processing. Currently working as Associate Professor in the Department of Computer Science at Banasthali University (Rajasthan), India. She has done M.Tech (Computer Science) , NET, M.Sc (Computer Science) and her PhD is in progress from Banasthali University (Rajasthan), India. She has published many papers in International Conferences and Journals. Dr. G. N. Purohit is a Professor in Department of Mathematics & Statistics at Banasthali University (Rajasthan). Before joining Banasthali University, he was Professor and Head of the Department of Mathematics, University of Rajasthan, Jaipur. He had been Chief-editor of a research journal and regular reviewer of many journals. His present interest is in O.R., Discrete Mathematics and Communication networks. He has published around 40 research papers in various journals.