Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

59
Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008

Transcript of Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Page 1: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal VachhaniCFILT and DIL,

IIT Bombay

CS 671 ICT For Development19th Sep 2008

Page 2: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Agro Explorer

A Meaning Based Multilingual Search Engine

Vishal Vachhani 2

Page 3: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Web-site for Indian farmers Farmers can submit their problems related

to their crops Queries are answered by Agricultural

Experts at KVK, Baramati Languages supported: Marathi, Hindi,

English

Vishal Vachhani 3

Page 4: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Why Need Multilingual Search

Vast Amount of Information available on the Web

Almost 70% of the Information is in English

The Indian rural populace is not English-Literate

“A Big Language Barrier” Information has to be made available to

them in their local languages.

Vishal Vachhani 4

Page 5: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Why Need Meaning Based Search

Most of the current Search Engines are Keyword Based.

They do not consider the semantics of the query

The result set contains a large number of extraneous documents.

Search based on the Meaning of the query will help narrow down on the desired information quickly.

Vishal Vachhani 5

Page 6: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 6

Query in Hindi

English Document

System

Marathi Document

search

English Document

Result in Hindi

Page 7: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 7

Same Keywords Different

Semantics

Moneylenders Exploit Farmers

Farmers Exploit Moneylenders

Found 1 Result Found 0 Result

Page 8: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Provides both Meaning Based Search Cross-Lingual Information Access

Vishal Vachhani 8

Page 9: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

System Architecture

Vishal Vachhani 9

Page 10: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 10

Page 11: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 11

Page 12: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 12

Page 13: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 13

Page 14: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 14

Page 15: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Conclusion

Provides two independent features Multi-Linguality Meaning Based Search.

Because of UNL both multi-lingual and meaning based properties can be incorporated together rather than using separate language translators in search engines. The scheme admits itself to Integration of multiple languages in a seamless, scalable manner.

Vishal Vachhani 15

Page 16: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 16

UNL UNL Universal Networking Universal Networking

LanguageLanguage

Page 17: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 17

UNL

English

French

Tamil

Marathi

Hindi

Page 18: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Direct translation - translation will be done directly - N*(N-1) translator are needed for N languages translation. Intermediate Language - intermediate language will be used for language translation - Only 2*N translators are required.

Vishal Vachhani 18

Page 19: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

UNL is an acronym for “Universal Networking Language”.

UNL is a computer language that enables computers to process information and knowledge across the language barriers.

UNL is a language for representing information and knowledge provided by natural languages

Unlike natural languages, UNL expressions are unambiguous.

Vishal Vachhani 19

Page 20: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Although the UNL is a language for computers, it has all the components of a natural language.

It is composed of Universal Words (UWs), Relations, Attributes.

Knowledge :semantic graph◦ Nodes concepts◦ Arcs relation between concepts

Vishal Vachhani 20

Page 21: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

A UW represents simple or compound concepts. There are two classes of UWs:◦ unit concepts ◦ compound structures of binary relations grouped

together ( indicated with Compound UW-Ids) A UW is made up of a character string (an

English-language word) followed by a list of constraints. ◦ <UW>::=<Head Word>[<Constraint List>]◦ example

state(icl>express) state(icl>country)

Vishal Vachhani 21

Page 22: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

◦ A relation label is represented as strings of 3 characters or less.

◦ The relations between UWs are binary. rel (UW1, UW2)

◦ They have different labels according to the different roles they play.

◦ At present, there are 46 relations in UNL◦ For example, agt (agent), ins (instrument), pur

(purpose), etc.

Vishal Vachhani 22

Page 23: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Attribute labels express additional information about the Universal Words that appear in a sentence.

◦ They show what is said from the speaker’s point of view; how the speaker views what is said. (time, reference, emphasis, attitude, etc)

◦ @entry, @present, @progressive, @topic, etc.

Vishal Vachhani 23

Page 24: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Example: Ram eats rice.{unl}

agt(eat.@entry.@present, Ram)obj(eat.@entry.@present, rice(icl>eatable))

{/unl}

Vishal Vachhani 24

Page 25: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 25

Ram

eat

rice

plc agt

Page 26: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Example: The boy who works here went to school.{unl}

agt(go(icl>move).@entry.@past, :01)

plt(go(icl>occur).@entry.@past,school(icl>institution))agt:01(work(icl>do), boy(icl>person.@entry))plc:01(work(icl>do),here)

{/unl}

Vishal Vachhani 26

Page 27: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 27

agt

plc

plt

agt

go

here

work school

boy

:01

Page 28: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 28

Enconvertor

IntermediateLanguage

Deconvertor

Source language

target language

Page 29: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

It’s a Language Independent Generator It can deconvert UNL expressions into a variety of

native languages, using a number of linguistic data such as Word Dictionary, Grammatical Rules of each language.

The DeConverter transforms the sentence represented by a UNL expression into Natural language sentence.

Vishal Vachhani 29

Page 30: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 30

Page 31: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 31

DictionarySyntax

Planning Rules

UNL Parser

Case MarkingModule

Morphology Module

SyntaxPlanning Module

Case Marking

RulesMorphology

Rules

UNLDoc

HindiDoc

Language dependent Module

Language Independent Module

Page 32: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

• UNL parser module will do following tasks

–Check input format of UNL document–Separate attributes form UWs–Separate attributes form dictionary entries–Replace UWs with Hindi root words

Page 33: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Category of morpho-syntactic properties which distinguish the various relations that a noun phrase may bear to a governing head.

ने�, पर ,के� , से�, प�,etc. A rule base based on :

◦ UNL attributes◦ lexical attributes from dictionary

Vishal Vachhani 33

Page 34: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Case marking is implemented using rules. We analyze all UNL as well as dictionary

attributes and decide next and previous case marker.

Also we use relation with parent to extract the right case mark.

Vishal Vachhani 34

Page 35: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

agt:null:null:null:ने�:@past#V:VINT:N:null Structure

◦ relName : ◦ parent previous case marker: ◦ parent next case marker:◦ child previous case marker: ◦ child next case marker:◦ the rest four are in form of ◦ attr'REL'relationname ◦ and attr will be separated by # ◦ also relation name are separated by #

Vishal Vachhani 35

Page 36: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

What is Morphology

◦ Study of Morphemes◦ Their formation into words, including

inflection, derivation and composition

Vishal Vachhani 36

Page 37: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Noun, Verb and Adjective Morphology◦ Depends on the phonetic properties of the

Hindi word Noun Morphology

◦ Depends on gender, number and vowel ending of the noun

Adjective Morphology◦ अच्छा लडके, अच्छा लडके�, अच्छा� लडके�◦ adjective अच्छा changes, lexical attribute “AdjA”

Verb Morphology◦ Depends upon tense, gender, number ,

person etc.

Vishal Vachhani 37

Page 38: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Verbs are categorized by ◦ Tense (past,present,future)◦ Gender(male,female)◦ Person (1st , 2nd , 3rd )◦ Number (sg,pl)

Example◦ Ladaka khana kha raha hai.

It contains present continuous tense,male, sg, and 3rd person

Vishal Vachhani 38

Page 39: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Arranging word according to the language structure

Rule based module It is priority based graph traversal

Vishal Vachhani 39

Page 40: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Algorithm for Syntax Planning:

1) Start traversing the UNL graph from the entry node.2) If node has no children then add this node to final string.3) If there is more than one child of one node then sort

children based on the priority of the relations. Relation having highest

priority will be traversed first.4) Mark that node as visited node.5) Repeat steps 3 and 4 until all the children of that node get

visited.6) If all the children of that node get visited then add that

node to final string.7) Repeat steps 2 to 4 until all the nodes get traversed.

Vishal Vachhani 40

Page 41: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Also, spray 5% Neemark solution.

Vishal Vachhani41

man

qua

modmod

obj

spray

alsosolution

Neemarkpercent

5

obj:17man:9mod:5qua:5

U-3

Page 42: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 42

spray

Entry

Page 43: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 43

spray

Entry

obj man

Page 44: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 44

spray

Entry

obj:17 man:9

Page 45: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 45

spray

Entry

obj:17 man:9

solution

Page 46: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 46

spray

Entry

obj:17 man:9

solution

mod mod

Page 47: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 47

spray

Entry

obj:17 man:9

solution

mod:5 mod:5

Page 48: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 48

spray

Entry

obj:17 man:9

solution

mod:5 mod:5

percent

Page 49: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 49

spray

Entry

obj:17 man:9

solution

mod:5 mod:5

percent

Page 50: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 50

spray

Entry

obj:17 man:9

solution

mod:5 mod:5

percent

qua:5

Page 51: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 51

spray

Entry

obj:17 man:9

solution

mod:5 mod:5

percent

qua:5

5

Output : 5

Page 52: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 52

spray

Entry

obj:17 man:9

solution

mod:5 mod:5

percent

qua:5

5

Output : 5 percent

Page 53: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 53

spray

Entry

obj:17 man:9

solution

mod:5 mod:5

percent

qua:5

5

Neemark

Output : 5 percent Neemark

Page 54: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 54

spray

Entry

obj:17 man:9

solution

mod:5 mod:5

percent

qua:5

5

Neemark

Output : 5 percent Neemark solution

Page 55: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 55

spray

Entry

obj:17 man:9

solution

mod:5 mod:5

percent

qua:5

5

Neemark

also

Output : 5 percent Neemark Solution also

Page 56: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 56

spray

Entry

obj:17 man:9

solution

mod:5 mod:5

percent

qua:5

5

Neemark

also

Output : 5 percent Neemark Solution also spray

Page 57: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Output: 5 percent Neemark solution also spray 5 प्रति�श� ने�मअके� घो�ल भी� छि�ड़के� | 5 प्रति�श� ने�मअके� घो�ल भी� छि�ड़के� |

Vishal Vachhani 57

Page 58: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

Vishal Vachhani 58

Input sentence: Its roots are affected by bacterial infection.

Module Output

UNL parser जड़् प्रभीति�� ज��ण्वि!�के से"क्रमण्� Case marking

Morphology

Syntax Planning

जड़् प्रभीति�� ज��ण्वि!�के से"क्रमण्� से� इसेकी� जड़ें ज��ण्वि!�के प्रभीति�� हो�ती हो� से"क्रमण् से�| ज��ण्वि!�के से"क्रमण् से� इसेके� जड़& प्रभीति�� हो��� हो(|

Output: ज��ण्वि!�के से"क्रमण् से� इसेके� जड़& प्रभीति�� हो��� हो(|

Input Its roots are affected by bacterial infection.

Page 59: Vishal Vachhani CFILT and DIL, IIT Bombay CS 671 ICT For Development 19 th Sep 2008.

UNL 2005 Specifications: http://www.undl.org/unlsys/unl/unl2005/

S.Singh, M.Dalal, V.Vachhani, P.Bhattacharrya and O.Damani “Hindi generation from interlingua” MTsummit 2007

(www.cse.iitb.ac.in/~vishalv) Mrugank Surve, Sarvjeet Singh, Satish Kagathara,

Venkatasivaramasastry K, Sunil Dubey, Gajanan Rane, Jaya Saraswati, Salil Badodekar, Akshay Iyer, Ashish Almeida, Roopali Nikam, Carolina Gallardo Perez, Pushpak Bhattacharyya, AgroExplorer Group: AgroExplorer: a Meaning Based Multilingual Search Engine, International Conference on Digital Libraries (ICDL), New Delhi, India, Feb 2004.

Agro Explorer : http://agro.mlasia.iitb.ac.in aAQUA : http://www.aaqua.org

Vishal Vachhani 59