Information retrieval based on word sens 1

1

Information Retrieval Based On Word SensAthman HajhamouComputer and Modeling Laboratory – USMBA- FSDM – Fès

2

SummaryResearch domainCharacteristics of classical arabicMorphological processing Research problemSemantic approches

3

Research domainNatural Language Processing

(NLP) :is a theoretically motivated range of

computationaltechniques for analyzing and representing

naturallyoccurring texts at one or more levels of

linguisticanalysis for the purpose of achieving

human-likelanguage processing for a range of tasks orapplications.

4

Research domainLevels of Natural Language Processing

:Phonology.Morphology.Lexical.Syntactic.Semantic.

5

Research domainLevels of Natural Language Processing

: Phonology :

this level deals with the interpretation of speech sounds within and across words. In a NLP system that accept spoken input, the sound waves are analyzed and encoded into digitized signal for interpretation.

6

Research domainLevels of Natural Language Processing :

Morphology : this level deals with the componential nature of words, which are composed of morphemes – the smallest units of meaning. For example the word المفس==دون can be morphologically analyzed into three separate morphemes: the prefix الم, the root NLP system can .ون and the suffix ,فس==دrecognize the meaning conveyed by each morpheme in order to gain and represent meaning.

7


Lexical : At this level, the words that have only one possible sense or meaning can be replaced by a semantic representation of that meanings. The nature of the representation varies according to the semantic theory utilized in the NLP system. The lexical level may require a lexicon an the particular approach taken by NLP system will determine whether a lexicon will be utilized, as well as the nature and extent of the information that is encoded in the lexicon.

8


Syntactic : This level focuses on analyzing the words in a sentence and so as to uncover the grammatical structure of the sentence. The output of this level of processing is a representation of the sentence that reveals the structural dependency relationships between the words. Syntax conveys meaning in most languages because order and dependency contribute to meaning.

9


Semantic : This is the level at witch most people think mining is determined, however, as we can see in the above defining of the levels, it is all the levels that contribute to meaning. Semantic processing determines the possible meanings of a sentence by focusing on the interactions among word-level meanings in the sentences. This level of processing can include the semantic disambiguation of words with multiple senses. Semantic disambiguation permits one and only one sense of polysemous words to be selected.

10

Research domainInformation Retrieval (IR):

Can be defined as a study of how to determine and retrieve from a corpus of stored information the portion witch are relevant to particular information need. The information may be stored in a structured form or in a unstructured form, depending upon its applications

11

Research domain Information Retrieval (IR):

A user of the store has to express his information need as a request for information in one form or another. Thus IR is concerned with the determining and retrieving of information that is relevant to his information need as expressed by his request and translated into a query witch conforms to a specific information retrieval system (IRS). An IRS normally stores surrogates of the actually documents in the system to represent the documents and the information stored in them.

12

Characteristics of classical arabicThe Arabic Language raise

several challenges to Natural Language Processing (NLP) largely due to its rich morphology. Morphological processing becomes particulary important for Information retrieval (IR), because IR needs to determine an appropriate form of words as index.

13

Characteristics of classical arabicThe Arabic Language is a semantic

language with a composite morphology. Arabic words are categorized as particles, nouns, or verbs. Unlike most western languages, Arabic script writing orientation is from right to left. There are 28 characters in Arabic. The characters are connected and do not start with capital letter. Most of the characters differ in shape based in their position in the sentence and adjunct letters.

14

Morphological processing

Almost all information retrieval systems work in the same way and pass several steps before retrieve the most relevant documents in the field of some formulated queries. These steps deal with a set of documents and its text contents deal with representations of documents.

15


Pre-processing :document content is pre-processed before search process. Pre-processing can be divided into four text operations : Lexical analysis of the text with the

objective of treating digits, hyphens, punctuation marks.

Elimination of the stop words. Remove diacritics. Normalization of the word. Stemming. Selection of index term.

16


Pre-processing : Lexical analysis of the text :

the text of every text file is converted into a stream of words (the candidate words to be adopted as index). The following three case have to be considered with care : not Arabic word, punctuation marks, digits.

17


Pre-processing : Elimination of the stop words :

Stop words are words which are too frequent among text files which do not carry a particular and useful meaning for IR. Elimination of stop words reduces the size of the indexing structure.

18


Pre-processing : Remove diacritics :

short vowels and other diacritics are removed from every text file. Short vowels include the fatha, domma, and kasra. Others diacritics such as the shadda, sikkun, and tanween.

19


Pre-processing : Normalization of the words:

is the process of unification of different form of the same letter.

20


Pre-processing : Stemming :

stemming of the remaining words with objective of remaining affixes (prefixes and suffixes) and allowing the retrieval of documents containing syntactic variations of query terms. (Mountassire)

21


Pre-processing : Selection of index term :

Index term or Keyword a pre-selected term which can be used to refer to the content of a document.

22


Search method:is based on the root of the word, each word of the user query is go back to the previous phase (text files pre-processing) and do all pre-processing steps. Each root words of the user query is matched to the root word in the index table and retrieve documents or portions of documents that have the same root word.

23

Research problem

Synonymy and polysemy are two important areas in linguistics that present a problem for computational linguistics. They complicate the task of natural language processing because it’s difficult to know when two names mean the same thing and it’s difficult to know the sense of a name that has multiple meanings (doing so requires word-sense disambiguation).

24

Research problem

Synonymy :is the phenomenon where different words describe the same idea. Thus, a query in a search engine may fail to retrieve a relevant document that does not contain the words which appeared in the query. For example, a search for may not return a document "علم"containing the word "معرف==ة", even though the words have the same meaning.

25

Research problem

Polysemy : is the phenomenon where the same word has multiple meanings. So a search may retrieve irrelevant documents containing the desired words in the wrong meaning. For example, a botanist and a computer scientist looking for the word "tree" probably desire different sets of documents.

26

Semantic approchesAutomatic discovery of similar

words :the underlying goal of this approach is in general the automatic discovery of synonyms. Most methods provide words that are “similar” to each other, with some vague notion of semantic similarity.

27


words :among the existing methods we find : techniques that, upon input of a word, automatically compile a list of good synonyms or near-synonyms, and techniques that generate a thesaurus (from some source, they built a complete lexicon of related words ).

28


words : the basic assumption of most of

these approaches is that words are similar if they are used in the same contexts. The methods differ in the way the contexts are defined and the way the similarity function is computed.

29


words : the basic assumption of most of

these approaches is that words are similar if they are used in the same contexts. The methods differ in the way the contexts are defined and the way the similarity function is computed.

30

Semantic approchesTerm Selection : one approches of term selection problem

is based on the co-occurrence of “similar” terms in “the same context”. We use the notion of term profile to calculate term quality and select the best quality index terms. The quality of a term t is based on distribution of terms “similar” to t and co-occurring in sentences across the document collection.

31

Semantic approchesSynonyms based search method: this search method is based on the

synonyms of the words. Each word of the user query go to an arabic thesaurus and get the synonyms of each word. Each synonyms word of the user query is marched to the same word in the index table.

32

ReferencesP. Senellart and V. D. Blondel, ‘Automatic

discovery of similar words’, Survey of text mining book,pp. 26-44. 2003.

A. T. Al-Taani and A. M. Al-Gharaibeh, ‘Searching Concepts and Keywords in the holy Quran’, Yarmou University, Jordan.

I. Dhillon and J. Kogan and C. Nicholas, ‘Feature selection and document clustering’, Survey of text mining book,pp. 73-100. 2003.

ED Liddy, Natural language processing-Introduction. 2001.

Information retrieval based on word sens 1

Technology

Transcript of Information retrieval based on word sens 1