Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

109
م ي ح ر ل ا ن م ح ر ل ه ا ل ل م ا س ب* ْ نُ دَ لْ نِ مْ تَ لِ ّ صُ ! فَ ّ مُ # ثُ هُ ت اَ ي) اْ تَ مِ كْ حُ - اٌ ابَ تِ ك ر ل اٍ ر يِ 6 بَ ! حٍ م يِ كَ ح* 1

Transcript of Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Page 1: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

1

بسم الله الرحمن الرحيم

الر كتاب احكمت اياته ثم فصلت من *

*لدن حكيم خبير

Page 2: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

THESIS OF MAGISTER

Proposal of an Advanced Retrieval System for Noble Qur’an

PRESENTED BYASSEM CHELLI

SUPERVISED BY PR. AMAR BALLA

M. TAHA ZERROUKI

Page 3: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Plan

IntroductionProblematicState of Art

Search Engines Arabic Language Noble Quran

ObjectivesProposed search

featuresConception

Implemented workPublished papersConclusion &

Perspectives

_

Page 4: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Introduction

Qur’an, in Arabic, means the Read or the Recitation. Muslim scholars define it as:

« the words of Allah revealed to His Prophet Muhammad, written in Mus’haf and transmitted by successive generations »

Qur’an is a sacred book for all Muslims Qur’an is also the first reference to Islamic law.The Muslims, through 14 centuries, are still:

Studying it, Teaching it, Writing books about it, Developing applications for it -recently-.

4

Assem Chelli
arabe
Page 5: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Problematic

Qur’an is an important source of information about all aspects of life: Scientific, Social, Historical, Political, Ethical, Juridical,

etc.   With a huge amount of information.

Quran is extremely difficult for regular search tools to successfully extract key information, so we should find other ways to enquire!

The appropriate solution for that is an Advanced Retrieval System Why a Retrieval System? Why advanced?

5

Assem Chelli
arabe
Page 6: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Indexing

Indexing consists in : Analyzing each document in the collection to create a

set of keywords. Creating a representation of documents in the system. Supporting other domains:

Auto-Clustering of documents, Related keywords suggestion Documents Auto-Analysis, Calculating collocated terms, Auto-summarization. Etc.

6

Page 7: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Full-text searchA technology of finding documents matching a set of

words.

Most of the web search engines such as Google and Bing! use full-text search engines at the heart of their service

The core of a full-text search engine is split into two main operations: Indexing the information into an efficient format Searching the relevant information from this pre-computed index

7

Page 8: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Indexing :: Phases

Example: « Assem is >defending< his thesis!! »

Tokenization:  Assem + is + >defending< + his + thesis!!

Normalization: assem + is + defending + his + thesisFiltering stop words : assem + $ + defending + $ +

thesisStemming: assem + $ + defend + $ + thesis

Resulted keywords: assem, defend, thesis

8

Page 9: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Indexing :: Index types

• Document 1 | The cow says moo• Document 2 | The cat and the

hat• Etc.

Document Index

• Document 1 | the, cow, says, moo

• Document 2 | the, cat, and, the, hat

• Etc.

Forward index

• “the” | doc 1, doc 2, …• “cow” | doc 1, …• Etc.

Inverted index

Page 10: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Querying (Search)

Querying is the phase of interaction between the system and the user.

Search takes a user query and returns the effective list of matching results sorted by relevance.

Relevance: A degree of relationship between the document and the query

10

Page 11: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Querying process

Page 12: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Semantic Approach12

Objective: improve search accuracy by understanding searcher intent and the contextual meaning of terms to generate more relevant results.

Semantic search does not just mean contextual search

It is a smart search that would consider several factors to provide the most relevant and useful search queries.

Page 13: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Semantic Approach :: factors13

Current trendLocation of searchIntend of the searchVariations of wordsSynonyms Generalized and Specialized queriesConcept matchingNatural language queriesChange of meaning based on the group of

words

13

Page 14: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Semantic Approach :: factors14

Current trend Who wins the Classico? last one of course

Location of search Weather temperature? here in Algiers preferably

Intend of the search Earth quake Checking if one happened, or looking for articles

Variations of words Man, Men, Man’s.

Synonyms Biggest mountain , Highest mountain

Generalized and Specialized queries Health vs Diabetes

Concept matching Half life the game or the physical constant

Natural language queries What time is it in Cairo?

Change of meaning based on the group of words New egg health benefits New egg health products

14

Page 15: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Arabic :: Orthography

A Semitic language The language of Quran A Right-to-Left language

Arabic is a language semi cursive most letters are attached to each other, changing shapes

15

Page 16: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

16

Arabic :: Lexicography

The classical Arabic grammar has only three subsetsVerbs

Verbs with a simple root ( المجرد فعل :(الفعل Hamzated verb (مهموز), Assimilated verb (مثال), Hollow verb

.(مضعف) Geminated verb ,(ناقص) Weakened verb ,(اجوف) Verbs with augmented root ( المزيد (الفعل

فاعل،>> ، استفعل فعل انفعل، افتعل، تفاعل، تفعل، ، افعلNouns

Primitive nouns ( الجامدة : (األسماء Nouns derived from verbals ( المشتقة (األسماء

Numbers, Demonstrative pronouns, Relative pronouns, Personal pronouns, Function words

Particles

Page 17: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Arabic :: Morphology

• Arabic is a fusional language, considered as an intro-flexion language:

•Consonants indicate the meaning •Vowels mark the flexion

• Arabic language is very rich and based on the structure of patterns (about 500) and roots (about 7000).

• Theoretically:• A single Arabic root can generate hundreds of

words (noun, verb, ...) by applying patterns. • A single Arabic word can exist in about a hundred

of forms by adding certain suffixes and prefixes

17

Page 18: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

18

Arabic :: Flexional Morphology

• Arabic uses for the conjugation of verbs and declension of nouns, some indications (Generally Affixes) of:• aspect, mood, time, person, gender, number,

case.

• These flexional marks can distinguish:• Mode of verbs: Perfective, Imperfective …• Function of nouns: Nominative, Accusative,

Genitive

Page 19: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

19

Arabic :: Flexion

• Flexion of verbs (Conjugation)o Aspecto Mood

Doubted, Affirmed (Actual or Eventual)

o Tense Perfective (الماضي): فعلت فعلت، فعلت، Imperfective (المضارع) Imperative (األمر)

Page 20: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

20

Arabic :: Flexion :: Verbs

Page 21: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

21

Arabic :: Flexion :: Verbs

• Perfective (الماضي): • 1st person:   فعلنا فعلت،• 2nd person: فعلتن فعلتم، فعلتما، فعلت، فعلت،• 3rd person:   فعلن فعلوا، فعلتا، فعال، فعلت، فعل،

• Imperfective (المضارع)• Nominative, • Accusative, • Jussive,

• Imperative (األمر)

Page 22: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

22

Arabic :: Flexion :: Nouns• Flexion of nouns (declension)

o 3 cases: Nominative (الرفع) Accusative (النصب) Genitive (الكسر)

o Depends on: Number: Singular (المفرد), Dual (المثنى), Plural (الجمع) Form: Triptote , Diptote , etc.

-

Page 23: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

23

Arabic :: Flexion :: Nounso Declension of Singular nouns

Triptotes ( المنصرفة كتاب :(األسماء كتابا كتاب Diptotes ( الصرف من الممنوعة صحراء :(األسماء قاحلة Five Nouns ( الخمسة اخي :(األسماء اخا اخو Deverbals with defective roots : ماض

o Declension of dual nouns: كتابان كتابينo Declension of plural nouns

External masculine plural ( السالم مذكر :(جمعكاتبين كاتبون

o Declension of function words Invariables : منذ Variables: كل

Page 24: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

24

Arabic :: Derivational morphologyo Deverbal noun (المصدر): ,ود مودة , ودادة , وداد , ودo Active participle ( فاعل (hitter) ضارب :(اسم

o Passive participle ( مفعول (struck) مضروب :(اسم

o Nouns of time and place ( والمكان الزمان مدرسة :(اسماء(school), مغ>رب (sunset)

o The Nomen Vicis ( المرة (a hit) ضربة :(اسم

o The Nomen Speciei ( الهيئة _ :(اسم األميرات_ جلسة she) جلستsat like princesses)

Page 25: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Arabic :: Ambiguities :: Absence of Vocalization

If text has the word (الملك),

How should search engine understand the meaning?

Is it ? 1. ,« Angel | الملك »2. « Kingdom | الملك »3. « King | الملك »

25

Page 26: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

For the word «وعد » , the letter wâw «واو » is :

1. A part of the word:(to promise)  وعد

2. Not a part of the word:عدو (and + to count)

Arabic :: Ambiguities :: Prefixes26

Page 27: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

For the word «وله », the letter ha’ (هاء) is :

1. A part of the word:(admire)  وله

2. Not a part of the word: هول   (crown + him)و هل   (and + he <-> has)

Arabic :: Ambiguities :: Suffixes27

Page 28: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Quran :: Structure

The Qur’an consists of 114 surahs, the surahs are divided into ayahs. the main fragmentation, specified by the prophet.

28

القران

1سورة اية•اية•اية•اية••...

2سورة اية•اية•اية•اية••...

...

114سورة اية•اية•اية•اية••...

Page 29: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Quran :: Structure

There are many fragmentations: Primary structure: surah, ayah, word and letter; Special locations: First ayahs of Surah ( السورة Last ,(فواتح

ayahs of Surah ( السورة فاصلة ) Qur’anic comma ,( خواتيم(وقف) Waqf ,( سجدة ) Sajdah ,( قرانية

Other Structures: page, Juz’ (جزء) , Hizb( حزب), Nisf( نصف), Rubu’( ربع ), Thumn( ثمن)

القراناول جزء

حزب

نصف

ربع

ثمنثمن

ربع

نصف

حزب

...جزء ثالثو

ن

29

القران

1سورة اية•اية•اية•اية••...

2سورة اية•اية•اية•اية••...

...

114سورة اية•اية•اية•اية••...

Page 30: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Quran :: Structure :: Stops (Waqfs)3030

Page 31: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Quran :: Uthmani Script

  standard uthmani position changes

سأريكم سأوريكم : )145األعراف( في الزيادةالواو

  العالمين العلمين في مواضعها جميعالقران األلف حذف

  الغاوون الغاون : موضع) 94الشعراء( واخر الواو حذف

  النبيين النبين في مواضعها جميعالقران الياء حذف

  الليل اليل في مواضعها جميعالقران الالم حذف

  ننجي نجي : )88األنبياء( النون حذف

  وجيء وجائ : موضع) 69الزمر( واخر األلف زيادة

 

31

Page 32: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Quran :: Sciences32

Specific to Quran Tafssīr (التفسير) Knowledge of Makkan and Medinan ayahs Knowledge of the causes of revelation Knowledge of the beginnings of surahs Science of allegorical ayahs ( المتشابه (علم Qur’anic Parables ( القرانية (األمثال

32

Page 33: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Quran :: Sciences33

Shared with other resources Legislative Study:

Fiqh ( الفقه) Abrogating and Abrogated ayahs ( والمنسوخ (الناسخ General and Particular ( والعام (الخاص

Lingustic Study: Orthography ( الخط مرسوم (علم Grammatical analysis of the Qur’an ( القران الفاظ (اعراب Morphology ( الصرف) Rhetoric ( البالغة) Lexicology ( المعاجم (علم

Scientific Study Scientific Miracles in Quran Numerical study of verses (ignoring the debate about it)

Page 34: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Quran :: indexes

Syntactic

Semantic

Structural

Statistical

Thematic

The indexes are catigorized by purpose on 5 main categories:

Page 35: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Quran :: Indexes :: Projects35

Midād lbayān Word morphology index

Zerrouki’s Indexes Word morphology index Topic index Synonym index

Qur’anic Arabic Corpus Word_by_word morphology index

Tanzil Project Ayah index (Electronic Mushaf) Sructural index Surah index

Boundary-Annotated Qur’an Corpus Word_by_word Waqf index (+mapping Uthmani-Standard)

Qurany Concepts Tool Concept index

Page 36: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Quran :: Ontologies + examples36

Qur’anic Concepts OntologyHenni’s Ontology

Page 37: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Quran :: Indexes/Ontologies projects Global critics

37

Not Available|Not Open Except Zerrouki’s , Quranic Arabic Corpus, Tanzil

Discontinued Development Except Quranic Arabic Corpus, Tanzil

37

Page 38: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Quranic Search Tools38

Alawfa (األوفى) Al-Monaqeb-Alqurany ( القراني (المنقبQuran complex search serviceQuranic Researcher ( القراني (الباحثQuranologie ( القران (علمQuranic Corpus Word-by-Word SearchTanzil Quran Browser (تنزيل)Zekr (ذكر)

38

Page 39: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Quranic Search Tools :: Global Critics 39

They are not Full-Text Search Engines except Tanzil’s and Zekr’s advanced Search.

Basic Search OperationsSimple Query SystemWeak or unsupported linguistic operations

except Quranic Corpus word_by_word searchNo Semantic ApproachClosed source

except ZekrImplemented as Interfaces, not as APIs or

Librairies.

39

Page 40: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Objectives40

Design a retrieval system that fits perfectly the Qur’an search needs. Yet, first we should list and classify all the search

features that are possible and helpful. Then, we need to study how to implement each

feature and what is its requirements.

40

Page 41: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Proposed Search Features :: Advanced Query

Fielded search الفاتحةسورة:

Logical relations الزكاة والصالة

Phrase search ” لله “الحمد

Interval search : اآلية_ [5 الى 1]رقم

Full Regular expression [ ا ما or من to search for م [ن

Wildcards (Jokers) بصطة , ب؟طة بسطة األنبياء , *نبي* ، النبيين ,,, نبي

41

Page 42: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Proposed Search Features :: Output Improvements

Pagination

Sorting Relevance Mushaf natural order Revelation order Numirical, Alphabitical, or Abjad order

Keyword Highlightخلقت > ومن <style>/وحيدا< styleذرني

42

Page 43: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Proposed Search Features :: Output Improvements (2)

Real time output

Results grouping by surahs by topics by taffssir dependency by revelation events by allegorical ayahs by parables

Uthmani script with full diacritical marks

43

Page 44: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Proposed Search Features :: Suggestion System

Spell corrections ابراهيمابراهام:

Semantically related words (Ontology-based)

، يعقوب : اسرائيل اسحاق، يوسف،... نبي

Suggestion des différents significations d’un mot( 1معنى رب : معنى ) ، (2اله سيد> ) ،

Améliorations:Régler les limitations des N-grammes pour Les mots vocalisés

44

Page 45: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Proposed Search Features :: Suggestion System (2)

Different vocalizations الملك الملك : الملك، ، الملك ...

Collocated words بصير سميع : سميع عليم، سميع> لله الحمد : الحمد

Keyboard mapping fsl: بسم (f ب, s س, l م)

Different significations 1رب : st meaning (god), 2nd meaning (master)

45

Page 46: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Proposed Search Features :: Linguistic aspects

Romanization .kalīfaẗ (ISO233), xalyfap (Buckwalter), _halyfaT (Arabtex) : خليفة

Syntactic Coloration

Partial vocalization search ل>كـم to locate لـكـلـك, مـم … and ignore لـكـم

Multi-level derivation (Word: اسقينا , level: lemma) to find واسقيناكم , ألسقيناهم, فاسقيناكموه.

Specific-derivations Conjugaison in perfective of قال to find قالت, قال, قالوا, قلن ...

46

Page 47: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Proposed Search Features :: Linguistic aspects

Vocal Search

Word linguistic annotation

….

47

Page 48: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Proposed Search Features :: Linguistic aspects

Word properties embedded query :ملكجذر: } مفردعدد: اسم نوع }

Numerical values search 309 replaced by وتسعة ثالثمائة

Fuzzy string search مؤصدةmay replace مءصدة

Linguistic examples search Rhetorical deletion ( البالغي (الحذف Grammatical Shift (االلتفات

Uthnmani writing way بصطة may replace بسطة نعمة may replace نعمت

_

48

Page 49: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Proposed Search Features :: Quranic Options

Recitation marks retrieving نعم : سجدة

Structural options 1صفحة: عمجزء :

Divine Name Highlight

49

Page 50: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Translation embedded query { text: mercy lang: english author: shekir }

Repetitions and Allegorical ayahs ( والمتشابهات ( التكرار Repetition {55,13} == [ تكذبان كما رب االء repetitions 31 ,[فباي

Abrogators and Abrogated ayahs search ( الناسخ والمنسوخ) Quranic parables (األمثال)

parable (سورة :البقرة)

ه بنورهم وتركهم كمثلمثلهم ] ا اضاءت ما حوله ذهب اللـ الذي استوقد نارا فلم[في ظلمات ال يبصرون

Proposed Search Features :: Quranic Options (2)50

Page 51: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Proposed Search Features :: Semantic Queries

Semantically related words Syn(جنة ) to find , فردوس, نعيم … جنة Ant ( جنة) to find , سقر, , جهنم سعير … جحيم Is ( جنة) to find فردوس عدن، … (based on ontology)

Faceted Thematic Search

-

51

Page 52: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Proposed Search Features :: Semantic Queries

Natural Questions: كم؟ لم؟ متى؟ اين؟ ما؟ من؟الحطمة؟ هي ما What is Al-hottamat?

ه الموقدة ] 6الهمزة - [نار اللـ It is the fire kindled by Allah

?Who are the prophets األنبياء؟ هممنإنا اوحينا إليك كما اوحينا إلى نوح والنبيين من بعده واوحينا إلى إبراهيم وإسماعيل وإسحاق ]

النساء - [ويعقوب واألسباط وعيسى وايوب ويونس وهارون وسليمان وآتينا داوود زبورا163

?Where was Rome defeated غلبت/هزمت الروم؟اين3الروم - [في ادنى األرض وهم من بعد غلبهم سيغلبون]

?How long did People of Cave stay مكث اصحاب الكهف؟كم25الكهف - [ولبثوا في كهفهم ثالث مائة سنين وازدادوا تسعا ]

?When is the Day of Resurrection يوم القيامة؟متىاعة تكون قريبا] ه وما يدريك لعل الس اعة قل إنما علمها عند اللـ - [يسألك الناس عن الس

25الكهف

يتشكل الجنين؟ كيف How has the embryo be formed? ثم خلقنا النطفة علقة فخلقنا العلقة مضغة فخلقنا المضغة عظاما فكسونا العظام لحما ثم ]

ه احسن الخالقين   - [انشأناه خلقا آخر فتبارك اللـ 14المؤمنون

52

Page 53: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Proposed Search Features :: Semantic Queries (2)

Auto Vocalisation الله من الله رسول من رسول

Entity extraction تسعا وازدادوا سنين مائة as (Time/number, 309) ثالث as (place, Mekka) ببكة البصر as (time unit, ??) كلمح ذرة as (size unit, ??) مثقال النبي ايها as (person, Mohammad) يا

Proper nouns search (co-reference resolution)؟بنيامين

[  احب إلى ابينا منا ونحن عصبة إن ابانا لفي ضالل مبيناخوه إذ قالوا ليوسف و ] -   14المؤمنون

--

53

Page 54: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Proposed Search Features :: Statistical system

Frequencies of different units How many words of «الله » in Surah “المجادلة”? What are the ten most frequently cited words in the

whole Qur’an? How many the word of Sea/بحر and its derivations are

mentioned in the whole Qur’an? How many letters in the Surah طه? What’s the longest Ayah? How many Marks of Sajdah in the whole Qur’an?

(different rewayates)

54

Page 55: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Discussion of search features55

To validate Usefulness, Importance and Clarity of each feature, we’ve launched a survey to gather the opinions.

We mixed the aimed audience to get high quality feedbacks from : Regular users, Quran scholars, Arabic morphology experts, Natural Language Processing /Information Retrieval

researchers, philosophers , working on religious scriptures comparing.

55

Page 56: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Survey Takers5656

Page 57: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Survey Takers5757

Page 58: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Survey Results5858

Page 59: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Conception59

Previous Work: the Engineer degree graduation project entitled

“Development of a search and indexing engine for Qur’anic documents” [Dahmani2010]

Improvements: Moving into a Full vocalized search engine Customization of text processing phases, considering both uthmani

and standard scripts Adopting the Quranic word as a search unit

59

Page 60: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Conception :: Full Vocalized Search Engine60

Barriers: Comparing vocalized, partially vocalized, and unvocalized texts Distinguishing between original vowels and declension case

markers Lack of vocalized Arabic linguistic resources

Texts, ontologies, thesauruses, corpuses

Advantages: Lift the ambiguities caused by ignoring vocalizations Make searching results, suggestions, and statistics more

accurate. Refine the meanings detection

( a first step in the semantic approach )

60

Page 61: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Conception :: Text processing 61

We consider both standard script and uthmani script to resolve difficulties such as: Searching with an Uthmani writing form of a word. Calculating statistics knowing based on the uthmani

writing. Matching the same Word-By-Word structure of some

Quranic linguistic resources

61

Page 62: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Conception :: Text processing :: Global schema 6262

Page 63: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Conception :: Text processing :: Substitution63

New phase! Purpose? Cases of substitution:

Romanization: Guessing policy:

Nature of used characters Arabic valid words Word existence in Quran Predefined priorities

Numbers as words: Rules:

We don’t say رجل رجل we say ,صفر ال One never mentioned as واحد but as احد Some numbers accept gender: اثنتان اثنان Other numbers change their forms in the opposite gender of the count

noun: سماوات ابحر, سبع سبعة A hundred مئة had a special writing in Quran: مائة Some numbers mentioned indirectly: _ _ _ عاما_ خمسين اال سنة الف

63

Page 64: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Conception :: Text processing :: Tokenization64

Phases: Phrases to words (tokens) Words to their parts (Sub-tokens)

64

Page 65: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Conception :: Text processing :: Tokenization6565

Page 66: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Conception :: Text processing :: Normalization66

Normalize Uthmani text into Standard textStrip all recitation markskeep the vowels except the declension case ending

vowel

66

Page 67: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Conception :: Text processing :: Filtering stop words67

Stop-words selection strategy: Chosen from the list of the most frequent words

in Qur’an, Considering vocalization Preferring:

Particles such as لكن Pronouns such as انت Clitics such as <ف

67

Page 68: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Conception :: Text processing :: Stemming68

We proposed stripping the affixes in tokenizationIn Stemming, we bring the word back either to:

ROOT: Large set of words, different meaning STEM: Smaller set of words, similar meaning

68

Page 69: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Conception :: Quranic Word as Search Unit69

Purpose: obtain a quick efficient stable method to retrieve specific Quranic words.

Requirements: A Quranic words corpus , enriched with linguistic annotations

Word occurance as a unit Word form as a unit

Information Schema: Identifiers: a global identifier, a secondary identifier based on the order in

the ayah added to ayah identifier and surah identifier; Different forms: Uthmani vocalized word (the main form), Standard

vocalized word, Standard unvocalized word; Transliterations: ISO233, Buckwalter, Arabtex; Translations: English, other languages; Different levels of stemming: Lemma, Stem, Root; Other properties: Part Of Speech, type, state, case, mood, voice, number,

gender, person.

69

Page 70: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Conception :: 2-steps search strategy70

1st step: retrieving the best keywords set based on the user query by searching in: A word-as-a-unit index A Quranic words ontology

2nd step: retrieving the corresponding ayahs using the keywords set resulted from the first step

70

Page 71: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

71

Conception :: 2-steps search :: applications

Page 72: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Conception :: Word Search :: Word properties72

Objective: allow the users to locate ayahs based on linguistic properties of words such as POS, type, state, case, mood, voice, number, gender, person.

Methods: Fielded search:

A fielded search is an advanced query feature that enables users to select and associate the different document fields to which he wishes to limit the query, then use the required keywords within these fields.

72

Page 73: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Conception :: Word Search :: Semantically Related Words

73

Objective: offer the related words of a keyword entered by the user.

Algorithm: The user specifies:

The word The semantic relation: Synonymy, Antonymy, Hypernymy,

Hyponymy, Meronymy, Holonymy, Troponymy. Inquiring the ontology for related words Using those keywords to retrieve the

corresponding ayahs.

73

Page 74: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Conception :: Word Search :: Multi-level Derivations

74

Objective: get a set of words that share the same origin such as stem and root.

Algorithm: The user specify:

the keyword The a level of derivation.

Recovering the origin of the word in the specified derivation level

Retrieving all the set of words that share this origin.

74

Page 75: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Conception :: Word Search :: Specific Derivations75

Objective: find the words resultants of applying a specific derivation operation on the user given word.

Algorithm: The user should:

Enter the keyword Specify which derivation.

Generating the set of derived words either by: fetching in the word index using linguistic tools such as verb conjugators. be

filtered as a second step by intersection with the set of Quranic words.

The resulted set will be used to locate the corresponding ayahs.*

75

Page 76: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Conception :: Word Search :: Fuzzy Search76

Objective: fetch using the set of words that are nearly similar to the input word in writing or pronunciation.

Methods: Liechtenstein distance (previously unknown text) Ngrams Spell-checker Soundtex (Phonetic )

76

Page 77: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Conception :: Word Search :: Fuzzy Search77

Arabic Similarities Specifications مؤصدة and مءصدة الحمد and الحمد عشر and عشر يضلله and يضله

Examples Mis-order of letters: زنبجيل for زنجبيل Phonetic similarity: هرم for ارم Spelling similarity: الضحي for الضحى

Page 78: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Open Source but WHY?There are a number of advantages lead us to open source, the

following points examine the most important of these[Web-Oss-watch]:

Collaborative bug-fixing & Fast security vulnerabilities detection

>Given enough eyeballs, all bugs are shallow< -- an open source slogan

Customization. Translation & Localization. Development discontinuation. Being part of a community. Low cost.

78

Page 79: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Used Technologies :: PythonPython is a powerful dynamic programming language, used widely. Features:

powerful and fast plays well with others runs everywhere friendly and easy to learn Free Open

79

Page 80: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Used Technologies :: Whoosh API

Whoosh is a full-text indexing and searching library implemented in Python

Features: Pure Pythonic API Fielded indexing and search Fast indexing and retrieval Powerful query language

Useful for circumstances such as: Anywhere a pure-Python solution is desirable to avoid having to

build/compile native libraries As a research platform (Python is easier to read!) When the search features are more important to us than the raw

speed.

80

Page 81: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Implementation :: Previous Code Base81

Implemented on [Chelli&Dahmani2010]Licensed under GPL*

(Server applications issue)Based on Whoosh Indexing LibraryOffering Many Search OperationsResults in HTML format

Raw format Can be used in Python

Requires to write wrappers for other languagesA basic resource manager

Has a missing piece

81

Page 82: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Implementation :: Our improvements82

The code base: has had 981 commits made representing 15,243 lines of code mostly written in Python with a well-commented source code. took an estimated 4 years of effort (COCOMO model)

Reference: Ohloh Website.

82

Page 83: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Implementation :: Our improvements New Output System

83

A New Output System: JSON-Based ==> Simpler & more extensible Centralized ==> Changes on one & only one place Extended & Extensible Results Structure Customizable Search Request using flags Including a Statistic Calculating Unit Offering Meta-Data for request

83

Page 84: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Implementation :: Our improvements Multiple Search Units

84

Translation-as-unit:

Word-as-unit

84

Page 85: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Implementation :: Our improvements Many new features

85

Fuzzy Search Feature

Retrieving the neighbors of each ayah

85

Page 86: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Implementation :: Our improvements Many new features (2)

86

Manipulating different Quranic Scripts

More suggestion operations

Showing the linguistic annotations

Retrieving & Showing transliterated keywords (Buckwalter)

86

Page 87: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Implementation :: Our improvements Resources Importing Manager

87

Resources Importing Manager: Downloading original resources (Licensing issue) Parsing & Importing the data to our intermediate database Indexing the database Updating auto-generated data files

87

Page 88: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Implementation :: Our improvements Packaging System

88

Automating the API building Packaging into:

Source Tarball Binary Tarball Python egg package Debian deb package Red-hat rpm package Windows Installer Mac OS (Perspective)

88

Page 89: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Implementation :: Our improvements ->More<-

89

Coding Standardization Following Python Conventions (PEP8) Using Pylint (a source code bug and quality checker)

Documentation Covering Enriching the code with Readme files

New Console interface

89

Page 90: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Implementation :: Open Issues90

Implement the modularity for the Query Parser: This is important to enable the extensibility feature and fix the problem of mixing (the combination) the different operations made during parsing.

Restrict the anonymous requests to the API: restricting requests protect the API from flooding either intended or not. This can be done by: Limit the maximum of simultaneous requests globally and by IP. Implement an identification system that works with remote clients.

Move to the last version of Whoosh library: Whoosh is almost in the version 3.X in its stable release while we still using an older version which is 0.3. The moving to the last version is very recommended to benefit of the improvements made. Though, it will not be an easy operation since our API is intertwined with the older version. Especially for the Query Parser.

90

Page 91: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Implementation :: Interfaces9191

Page 92: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Implementation :: Open Issues92

Complete the features implementationEnrich the linguistic resourcesImplement the modularity for the Query ParserRestrict the anonymous requests to the APIMove to the main stream of Whoosh libraryMaintain compatibility between Python versionsCover with documentationOptimize code and performance

92

Page 93: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Implementation :: Open Issues93

Enriching the linguistic resources: the actual used resources are poor comparing to what we really need. Integrate Qurany project to enrich the actual faceted

thematic search. Integrate the boundary annotations to enable the

retrieving of boundaries in Quran. Propose a standard format for new linguistic and

Quranic resources. Textify the binary database to enable the possibility of

logging of changes and take the benefits of revision control systems such as GIT.

93

Page 94: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Implementation :: Open Issues94

Complete the features implementationFielded search YESLogical relations YES

Phrase search YESInterval search YESFull Regex NOWildcards PARTIALLY

Boosting keywords YES

Pagination YESScoring YESSorting YES

Keywords Highlight YES

Uthmani full marks YES

Real time output NOResults grouping NO

Spell correction PARTIALLYRelated keywords PARTIALLYDifferent vocalizations YESCollocated words NOKeyboard mapping NODifferent significations NO

94

Romanization PARTIALLY

Partial vocalization PARTIALLYMulti-level derivation YES

Syntactic Coloration NOVocal Search NO

Specific-derivations NO

Linguistic annotations PARTIALLYFuzzy string PARTIALLYWord properties PARTIALLY

Linguistic examples NO

Structural options YES

Translation search YES

Uthmani writing way NO

Recitation marks PARTIALLYDivine Names Highlight NO

Repetitions&Allegoricals NO

Abrogators&Abrogated NO

Qur’anic Parables NO

Semantically related words PARTIALLYFaceted Thematic Search PARTIALLY

Entity Extraction NO

Questions Answering (QA) NOAutomatic vocalization NO

Co-reference resolution NO

Vocalized word frequency YESUnvocalized word frequency YES

Another Qur’anic units frequency NORoot/Stem/Lemma frequency NO

Page 95: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Implementation :: Open Issues95

Move to Python 3.X: Python 2 is disappearing and sooner or later it’ll be fully replaced. There are many tools offer some automatic scripts to convert a code from 2 into 3. Though, the big part often should be done manually.

Cover with documentation: the documentation is so important, it’s expensive but it encourages the community to involve in the project. This can be done by: Enrich the readme files; Enrich the code with appropriate comments; Create a usage How-To and straighten it with many demos; The man page for the console interface.

Optimize code and performance: proceed the fixing of pylint code analysis warnings and use Profile to check the performance of each search feature in order to improve it.

95

Page 96: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Implementation :: Interfaces9696

Page 97: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Implementation :: interfaces :: API97

Powerful Points:1. Free Libre Open

2. A Python API

3. A founded base

4. Lot of features

Page 98: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Implementation :: interfaces :: API#Sample9898

Page 99: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Implementation :: Interfaces :: JSON web service9999

Page 100: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Implementation :: Interfaces :: Console100100

Page 101: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

101

Examples of use

As a desktop applicationAs a web interface

www.alfanous.orgAs a smart phone app

iPhone, iPad Windows phone

Page 102: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Examples of use :: Alfanous.org102

Remarkable Features: Localizable

Awarded: As the best-in-technicality

website in Algeria Web Awards 2012

Page 103: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Examples of use :: Alfanous.org (Responsive)103

Remarkable Features: User experience Responsiveness Simplicity

Awarded: chosen as the best website

categorized under the religious websites in Algeria Web Awards 2013

103

Page 104: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Examples of use :: iPhone Application104

Developed by: iPhone-islam (objective-C)

Remarkable Features: running on iPhone and iPad series

104

Page 105: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Examples of use :: Windows phone APP105

Developed by: Moumen bou Abdellah (C#)

Remarkable Features: Running on windows phone

105

Page 106: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Examples of use :: Alfanous Desktop Interface106

Remarkable Features: Offline use

106

Page 107: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Conferences

1. An Arabic paper in NITS 2011 KSA: Title: An Application Programming Interface for

indexing and search in Noble Quran Authors: Assem Chelli, Merouane Dahmani, Amar Balla,

Taha Zerrouki.2. An English paper in a pre-conference

workshop in LREC 2012 Turkey which is about ”LRE-Rel: Language Resource and Evaluation for Religious Texts”

Title: Advanced Search in Quran: Classification and Proposition of All Possible Features.

Authors: Assem Chelli, Amar Balla, Taha Zerrouki.

107

Page 108: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

Conclusion & Perspectives108

We went through the implementation of many search features that we previously enlisted.

Unfortunately, there are more improvements to be done and many issues to be resolved. We left them as perspectives: Achieving an accurate statistics gathering system; Implementation of a more adequate suggestion system; Clear the way toward a semantic search engine; Proceeding the full conception of all search features. Complete implementation of all open issues.

108

Page 109: Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending

THANK YOU FOR YOUR ATTENTION …

Any Questions ?Contacts:Email: [email protected]: @assem-chTwitter: @assem_ch

Project Links: Website: www.alfanous.org User feedback: feedback.alfanous.org Source-code: www.github.com/assem-ch/alfanous