A word sense disambiguation technique for sinhala

17
Janindu Arukgoda,Vidudaya Bandara,Samiththa Bashani,Vijayindu Gamage, Daya Wimalasuriya A Word Sense Disambiguation Technique For Sinhala Department of Computer Science & Engineering University of Moratuwa Sri Lanka

Transcript of A word sense disambiguation technique for sinhala

Janindu Arukgoda,Vidudaya Bandara,Samiththa Bashani,Vijayindu Gamage,Daya Wimalasuriya

A Word Sense Disambiguation Technique For Sinhala

Department of Computer Science & EngineeringUniversity of Moratuwa

Sri Lanka

Overview

Problem Statement

Uses of Word Sense Disambiguation for Sinhala

Existing Solutions for Other Languages

Attempts on Sinhala

Our Approach

Sinhala WordNet

System Architecture

Evaluation

Future works

Q & A

Problem Statement

All natural languages have words that have different senses in different contexts (Polysemy)

Identifying the implied sense of a polysemous word in a given context is called Word Sense Disambiguation

Open problem in NLP

Sinhala has a lot of polysemous words

But Sinhala doesn’t have a word sense disambiguation tool

Examples

දත් දදොස්තර මුව පරීක්ෂා කද ේය. (The dentist checked the mouth)

මුව රංචුව කැලය තුලට වැදුණි. (The herd of deerdisappeared in to the woods)

ඇයදේ මුව සඳක් වැනිය. (Her face looks like the moon)

Uses of Word Sense Disambiguation for Sinhala

Accurate translation from Sinhala to other languages

Create proper search methodologies for internet surfing for Sinhala resources

Text summarization

Better information retrieval systems for Sinhala

Word processing and spell checking

Content analysis

Information extraction

Text to Speech translation

Existing Solutions for Other Languages

A graph based approach for Hindi language

A rule based approach for Hindi language

Machine learning approaches for Hindi language

Semantic relatedness based approaches for German language

A WordNet based approach for German language

Rule based approaches for English language

Machine learning approaches for English language

Attempts on Sinhala

Word sense discrimination Vs disambiguation

No large enough corpus

High complexity

No morphological processor

Our Approach

Based on Sinhala WordNet

Important properties of human languages

o One sense per collocation

o One sense per discourse

Lesk Algorithm

Simplified Lesk Algorithm

Lesk Algorithm

Get the glosses of the senses of the target word from the WordNet

Compare the gloss of each sense of the target word with the glosses of every other word in the given window of words

Keep a count of the overlapping words in each sense pair

The most appropriate sense will be the one with highest count of overlaps

Simplified Lesk Algorithm

Get the glosses of the senses of the target word

Compare the gloss of each sense of the target word with a selected window of words ( n words around the target word in the given context, basically n/2 for right side and n/2 for the left side or choose accordingly),

Keep overlapping count for every sense within window pair

The most appropriate sense will be the one with highest count of overlaps

Sinhala WordNet

A lexical semantic network

Modeled after the Princeton WordNet for English

Crowdsourced

System Architecture

Evaluation Criteria

recall+precision

.recall precision2.=F1

returned answers totalNo.of

returned answers correctNo.of =precision

cases test totalNo.of

returned answers correctNo.of =recall

Evaluation

System Precision = 0.63System F1-Score = 0.63

Future Works

Disambiguating verbs, adjectives and adverbs

Including a morphological processor

Q & A

Thank You