Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
-
Upload
assem-chelli -
Category
Engineering
-
view
601 -
download
2
Transcript of Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
1
بسم الله الرحمن الرحيم
الر كتاب احكمت اياته ثم فصلت من *
*لدن حكيم خبير
THESIS OF MAGISTER
Proposal of an Advanced Retrieval System for Noble Qur’an
PRESENTED BYASSEM CHELLI
SUPERVISED BY PR. AMAR BALLA
M. TAHA ZERROUKI
Plan
IntroductionProblematicState of Art
Search Engines Arabic Language Noble Quran
ObjectivesProposed search
featuresConception
Implemented workPublished papersConclusion &
Perspectives
_
Introduction
Qur’an, in Arabic, means the Read or the Recitation. Muslim scholars define it as:
« the words of Allah revealed to His Prophet Muhammad, written in Mus’haf and transmitted by successive generations »
Qur’an is a sacred book for all Muslims Qur’an is also the first reference to Islamic law.The Muslims, through 14 centuries, are still:
Studying it, Teaching it, Writing books about it, Developing applications for it -recently-.
4
Problematic
Qur’an is an important source of information about all aspects of life: Scientific, Social, Historical, Political, Ethical, Juridical,
etc. With a huge amount of information.
Quran is extremely difficult for regular search tools to successfully extract key information, so we should find other ways to enquire!
The appropriate solution for that is an Advanced Retrieval System Why a Retrieval System? Why advanced?
5
Indexing
Indexing consists in : Analyzing each document in the collection to create a
set of keywords. Creating a representation of documents in the system. Supporting other domains:
Auto-Clustering of documents, Related keywords suggestion Documents Auto-Analysis, Calculating collocated terms, Auto-summarization. Etc.
6
Full-text searchA technology of finding documents matching a set of
words.
Most of the web search engines such as Google and Bing! use full-text search engines at the heart of their service
The core of a full-text search engine is split into two main operations: Indexing the information into an efficient format Searching the relevant information from this pre-computed index
7
Indexing :: Phases
Example: « Assem is >defending< his thesis!! »
Tokenization: Assem + is + >defending< + his + thesis!!
Normalization: assem + is + defending + his + thesisFiltering stop words : assem + $ + defending + $ +
thesisStemming: assem + $ + defend + $ + thesis
Resulted keywords: assem, defend, thesis
8
Indexing :: Index types
• Document 1 | The cow says moo• Document 2 | The cat and the
hat• Etc.
Document Index
• Document 1 | the, cow, says, moo
• Document 2 | the, cat, and, the, hat
• Etc.
Forward index
• “the” | doc 1, doc 2, …• “cow” | doc 1, …• Etc.
Inverted index
Querying (Search)
Querying is the phase of interaction between the system and the user.
Search takes a user query and returns the effective list of matching results sorted by relevance.
Relevance: A degree of relationship between the document and the query
10
Querying process
Semantic Approach12
Objective: improve search accuracy by understanding searcher intent and the contextual meaning of terms to generate more relevant results.
Semantic search does not just mean contextual search
It is a smart search that would consider several factors to provide the most relevant and useful search queries.
Semantic Approach :: factors13
Current trendLocation of searchIntend of the searchVariations of wordsSynonyms Generalized and Specialized queriesConcept matchingNatural language queriesChange of meaning based on the group of
words
13
Semantic Approach :: factors14
Current trend Who wins the Classico? last one of course
Location of search Weather temperature? here in Algiers preferably
Intend of the search Earth quake Checking if one happened, or looking for articles
Variations of words Man, Men, Man’s.
Synonyms Biggest mountain , Highest mountain
Generalized and Specialized queries Health vs Diabetes
Concept matching Half life the game or the physical constant
Natural language queries What time is it in Cairo?
Change of meaning based on the group of words New egg health benefits New egg health products
14
Arabic :: Orthography
A Semitic language The language of Quran A Right-to-Left language
Arabic is a language semi cursive most letters are attached to each other, changing shapes
15
16
Arabic :: Lexicography
The classical Arabic grammar has only three subsetsVerbs
Verbs with a simple root ( المجرد فعل :(الفعل Hamzated verb (مهموز), Assimilated verb (مثال), Hollow verb
.(مضعف) Geminated verb ,(ناقص) Weakened verb ,(اجوف) Verbs with augmented root ( المزيد (الفعل
فاعل،>> ، استفعل فعل انفعل، افتعل، تفاعل، تفعل، ، افعلNouns
Primitive nouns ( الجامدة : (األسماء Nouns derived from verbals ( المشتقة (األسماء
Numbers, Demonstrative pronouns, Relative pronouns, Personal pronouns, Function words
Particles
Arabic :: Morphology
• Arabic is a fusional language, considered as an intro-flexion language:
•Consonants indicate the meaning •Vowels mark the flexion
• Arabic language is very rich and based on the structure of patterns (about 500) and roots (about 7000).
• Theoretically:• A single Arabic root can generate hundreds of
words (noun, verb, ...) by applying patterns. • A single Arabic word can exist in about a hundred
of forms by adding certain suffixes and prefixes
17
18
Arabic :: Flexional Morphology
• Arabic uses for the conjugation of verbs and declension of nouns, some indications (Generally Affixes) of:• aspect, mood, time, person, gender, number,
case.
• These flexional marks can distinguish:• Mode of verbs: Perfective, Imperfective …• Function of nouns: Nominative, Accusative,
Genitive
19
Arabic :: Flexion
• Flexion of verbs (Conjugation)o Aspecto Mood
Doubted, Affirmed (Actual or Eventual)
o Tense Perfective (الماضي): فعلت فعلت، فعلت، Imperfective (المضارع) Imperative (األمر)
20
Arabic :: Flexion :: Verbs
21
Arabic :: Flexion :: Verbs
• Perfective (الماضي): • 1st person: فعلنا فعلت،• 2nd person: فعلتن فعلتم، فعلتما، فعلت، فعلت،• 3rd person: فعلن فعلوا، فعلتا، فعال، فعلت، فعل،
• Imperfective (المضارع)• Nominative, • Accusative, • Jussive,
• Imperative (األمر)
22
Arabic :: Flexion :: Nouns• Flexion of nouns (declension)
o 3 cases: Nominative (الرفع) Accusative (النصب) Genitive (الكسر)
o Depends on: Number: Singular (المفرد), Dual (المثنى), Plural (الجمع) Form: Triptote , Diptote , etc.
-
23
Arabic :: Flexion :: Nounso Declension of Singular nouns
Triptotes ( المنصرفة كتاب :(األسماء كتابا كتاب Diptotes ( الصرف من الممنوعة صحراء :(األسماء قاحلة Five Nouns ( الخمسة اخي :(األسماء اخا اخو Deverbals with defective roots : ماض
o Declension of dual nouns: كتابان كتابينo Declension of plural nouns
External masculine plural ( السالم مذكر :(جمعكاتبين كاتبون
o Declension of function words Invariables : منذ Variables: كل
24
Arabic :: Derivational morphologyo Deverbal noun (المصدر): ,ود مودة , ودادة , وداد , ودo Active participle ( فاعل (hitter) ضارب :(اسم
o Passive participle ( مفعول (struck) مضروب :(اسم
o Nouns of time and place ( والمكان الزمان مدرسة :(اسماء(school), مغ>رب (sunset)
o The Nomen Vicis ( المرة (a hit) ضربة :(اسم
o The Nomen Speciei ( الهيئة _ :(اسم األميرات_ جلسة she) جلستsat like princesses)
Arabic :: Ambiguities :: Absence of Vocalization
If text has the word (الملك),
How should search engine understand the meaning?
Is it ? 1. ,« Angel | الملك »2. « Kingdom | الملك »3. « King | الملك »
25
For the word «وعد » , the letter wâw «واو » is :
1. A part of the word:(to promise) وعد
2. Not a part of the word:عدو (and + to count)
Arabic :: Ambiguities :: Prefixes26
For the word «وله », the letter ha’ (هاء) is :
1. A part of the word:(admire) وله
2. Not a part of the word: هول (crown + him)و هل (and + he <-> has)
Arabic :: Ambiguities :: Suffixes27
Quran :: Structure
The Qur’an consists of 114 surahs, the surahs are divided into ayahs. the main fragmentation, specified by the prophet.
28
القران
1سورة اية•اية•اية•اية••...
2سورة اية•اية•اية•اية••...
...
114سورة اية•اية•اية•اية••...
Quran :: Structure
There are many fragmentations: Primary structure: surah, ayah, word and letter; Special locations: First ayahs of Surah ( السورة Last ,(فواتح
ayahs of Surah ( السورة فاصلة ) Qur’anic comma ,( خواتيم(وقف) Waqf ,( سجدة ) Sajdah ,( قرانية
Other Structures: page, Juz’ (جزء) , Hizb( حزب), Nisf( نصف), Rubu’( ربع ), Thumn( ثمن)
القراناول جزء
حزب
نصف
ربع
ثمنثمن
ربع
نصف
حزب
...جزء ثالثو
ن
29
القران
1سورة اية•اية•اية•اية••...
2سورة اية•اية•اية•اية••...
...
114سورة اية•اية•اية•اية••...
Quran :: Structure :: Stops (Waqfs)3030
Quran :: Uthmani Script
standard uthmani position changes
سأريكم سأوريكم : )145األعراف( في الزيادةالواو
العالمين العلمين في مواضعها جميعالقران األلف حذف
الغاوون الغاون : موضع) 94الشعراء( واخر الواو حذف
النبيين النبين في مواضعها جميعالقران الياء حذف
الليل اليل في مواضعها جميعالقران الالم حذف
ننجي نجي : )88األنبياء( النون حذف
وجيء وجائ : موضع) 69الزمر( واخر األلف زيادة
31
Quran :: Sciences32
Specific to Quran Tafssīr (التفسير) Knowledge of Makkan and Medinan ayahs Knowledge of the causes of revelation Knowledge of the beginnings of surahs Science of allegorical ayahs ( المتشابه (علم Qur’anic Parables ( القرانية (األمثال
32
Quran :: Sciences33
Shared with other resources Legislative Study:
Fiqh ( الفقه) Abrogating and Abrogated ayahs ( والمنسوخ (الناسخ General and Particular ( والعام (الخاص
Lingustic Study: Orthography ( الخط مرسوم (علم Grammatical analysis of the Qur’an ( القران الفاظ (اعراب Morphology ( الصرف) Rhetoric ( البالغة) Lexicology ( المعاجم (علم
Scientific Study Scientific Miracles in Quran Numerical study of verses (ignoring the debate about it)
Quran :: indexes
Syntactic
Semantic
Structural
Statistical
Thematic
The indexes are catigorized by purpose on 5 main categories:
Quran :: Indexes :: Projects35
Midād lbayān Word morphology index
Zerrouki’s Indexes Word morphology index Topic index Synonym index
Qur’anic Arabic Corpus Word_by_word morphology index
Tanzil Project Ayah index (Electronic Mushaf) Sructural index Surah index
Boundary-Annotated Qur’an Corpus Word_by_word Waqf index (+mapping Uthmani-Standard)
Qurany Concepts Tool Concept index
Quran :: Ontologies + examples36
Qur’anic Concepts OntologyHenni’s Ontology
Quran :: Indexes/Ontologies projects Global critics
37
Not Available|Not Open Except Zerrouki’s , Quranic Arabic Corpus, Tanzil
Discontinued Development Except Quranic Arabic Corpus, Tanzil
37
Quranic Search Tools38
Alawfa (األوفى) Al-Monaqeb-Alqurany ( القراني (المنقبQuran complex search serviceQuranic Researcher ( القراني (الباحثQuranologie ( القران (علمQuranic Corpus Word-by-Word SearchTanzil Quran Browser (تنزيل)Zekr (ذكر)
38
Quranic Search Tools :: Global Critics 39
They are not Full-Text Search Engines except Tanzil’s and Zekr’s advanced Search.
Basic Search OperationsSimple Query SystemWeak or unsupported linguistic operations
except Quranic Corpus word_by_word searchNo Semantic ApproachClosed source
except ZekrImplemented as Interfaces, not as APIs or
Librairies.
39
Objectives40
Design a retrieval system that fits perfectly the Qur’an search needs. Yet, first we should list and classify all the search
features that are possible and helpful. Then, we need to study how to implement each
feature and what is its requirements.
40
Proposed Search Features :: Advanced Query
Fielded search الفاتحةسورة:
Logical relations الزكاة والصالة
Phrase search ” لله “الحمد
Interval search : اآلية_ [5 الى 1]رقم
Full Regular expression [ ا ما or من to search for م [ن
Wildcards (Jokers) بصطة , ب؟طة بسطة األنبياء , *نبي* ، النبيين ,,, نبي
41
Proposed Search Features :: Output Improvements
Pagination
Sorting Relevance Mushaf natural order Revelation order Numirical, Alphabitical, or Abjad order
Keyword Highlightخلقت > ومن <style>/وحيدا< styleذرني
42
Proposed Search Features :: Output Improvements (2)
Real time output
Results grouping by surahs by topics by taffssir dependency by revelation events by allegorical ayahs by parables
Uthmani script with full diacritical marks
43
Proposed Search Features :: Suggestion System
Spell corrections ابراهيمابراهام:
Semantically related words (Ontology-based)
، يعقوب : اسرائيل اسحاق، يوسف،... نبي
Suggestion des différents significations d’un mot( 1معنى رب : معنى ) ، (2اله سيد> ) ،
Améliorations:Régler les limitations des N-grammes pour Les mots vocalisés
44
Proposed Search Features :: Suggestion System (2)
Different vocalizations الملك الملك : الملك، ، الملك ...
Collocated words بصير سميع : سميع عليم، سميع> لله الحمد : الحمد
Keyboard mapping fsl: بسم (f ب, s س, l م)
Different significations 1رب : st meaning (god), 2nd meaning (master)
45
Proposed Search Features :: Linguistic aspects
Romanization .kalīfaẗ (ISO233), xalyfap (Buckwalter), _halyfaT (Arabtex) : خليفة
Syntactic Coloration
Partial vocalization search ل>كـم to locate لـكـلـك, مـم … and ignore لـكـم
Multi-level derivation (Word: اسقينا , level: lemma) to find واسقيناكم , ألسقيناهم, فاسقيناكموه.
Specific-derivations Conjugaison in perfective of قال to find قالت, قال, قالوا, قلن ...
46
Proposed Search Features :: Linguistic aspects
Vocal Search
Word linguistic annotation
….
47
Proposed Search Features :: Linguistic aspects
Word properties embedded query :ملكجذر: } مفردعدد: اسم نوع }
Numerical values search 309 replaced by وتسعة ثالثمائة
Fuzzy string search مؤصدةmay replace مءصدة
Linguistic examples search Rhetorical deletion ( البالغي (الحذف Grammatical Shift (االلتفات
Uthnmani writing way بصطة may replace بسطة نعمة may replace نعمت
_
48
Proposed Search Features :: Quranic Options
Recitation marks retrieving نعم : سجدة
Structural options 1صفحة: عمجزء :
Divine Name Highlight
49
Translation embedded query { text: mercy lang: english author: shekir }
Repetitions and Allegorical ayahs ( والمتشابهات ( التكرار Repetition {55,13} == [ تكذبان كما رب االء repetitions 31 ,[فباي
Abrogators and Abrogated ayahs search ( الناسخ والمنسوخ) Quranic parables (األمثال)
parable (سورة :البقرة)
ه بنورهم وتركهم كمثلمثلهم ] ا اضاءت ما حوله ذهب اللـ الذي استوقد نارا فلم[في ظلمات ال يبصرون
Proposed Search Features :: Quranic Options (2)50
Proposed Search Features :: Semantic Queries
Semantically related words Syn(جنة ) to find , فردوس, نعيم … جنة Ant ( جنة) to find , سقر, , جهنم سعير … جحيم Is ( جنة) to find فردوس عدن، … (based on ontology)
Faceted Thematic Search
-
51
Proposed Search Features :: Semantic Queries
Natural Questions: كم؟ لم؟ متى؟ اين؟ ما؟ من؟الحطمة؟ هي ما What is Al-hottamat?
ه الموقدة ] 6الهمزة - [نار اللـ It is the fire kindled by Allah
?Who are the prophets األنبياء؟ هممنإنا اوحينا إليك كما اوحينا إلى نوح والنبيين من بعده واوحينا إلى إبراهيم وإسماعيل وإسحاق ]
النساء - [ويعقوب واألسباط وعيسى وايوب ويونس وهارون وسليمان وآتينا داوود زبورا163
?Where was Rome defeated غلبت/هزمت الروم؟اين3الروم - [في ادنى األرض وهم من بعد غلبهم سيغلبون]
?How long did People of Cave stay مكث اصحاب الكهف؟كم25الكهف - [ولبثوا في كهفهم ثالث مائة سنين وازدادوا تسعا ]
?When is the Day of Resurrection يوم القيامة؟متىاعة تكون قريبا] ه وما يدريك لعل الس اعة قل إنما علمها عند اللـ - [يسألك الناس عن الس
25الكهف
يتشكل الجنين؟ كيف How has the embryo be formed? ثم خلقنا النطفة علقة فخلقنا العلقة مضغة فخلقنا المضغة عظاما فكسونا العظام لحما ثم ]
ه احسن الخالقين - [انشأناه خلقا آخر فتبارك اللـ 14المؤمنون
52
Proposed Search Features :: Semantic Queries (2)
Auto Vocalisation الله من الله رسول من رسول
Entity extraction تسعا وازدادوا سنين مائة as (Time/number, 309) ثالث as (place, Mekka) ببكة البصر as (time unit, ??) كلمح ذرة as (size unit, ??) مثقال النبي ايها as (person, Mohammad) يا
Proper nouns search (co-reference resolution)؟بنيامين
[ احب إلى ابينا منا ونحن عصبة إن ابانا لفي ضالل مبيناخوه إذ قالوا ليوسف و ] - 14المؤمنون
--
53
Proposed Search Features :: Statistical system
Frequencies of different units How many words of «الله » in Surah “المجادلة”? What are the ten most frequently cited words in the
whole Qur’an? How many the word of Sea/بحر and its derivations are
mentioned in the whole Qur’an? How many letters in the Surah طه? What’s the longest Ayah? How many Marks of Sajdah in the whole Qur’an?
(different rewayates)
54
Discussion of search features55
To validate Usefulness, Importance and Clarity of each feature, we’ve launched a survey to gather the opinions.
We mixed the aimed audience to get high quality feedbacks from : Regular users, Quran scholars, Arabic morphology experts, Natural Language Processing /Information Retrieval
researchers, philosophers , working on religious scriptures comparing.
55
Survey Takers5656
Survey Takers5757
Survey Results5858
Conception59
Previous Work: the Engineer degree graduation project entitled
“Development of a search and indexing engine for Qur’anic documents” [Dahmani2010]
Improvements: Moving into a Full vocalized search engine Customization of text processing phases, considering both uthmani
and standard scripts Adopting the Quranic word as a search unit
59
Conception :: Full Vocalized Search Engine60
Barriers: Comparing vocalized, partially vocalized, and unvocalized texts Distinguishing between original vowels and declension case
markers Lack of vocalized Arabic linguistic resources
Texts, ontologies, thesauruses, corpuses
Advantages: Lift the ambiguities caused by ignoring vocalizations Make searching results, suggestions, and statistics more
accurate. Refine the meanings detection
( a first step in the semantic approach )
60
Conception :: Text processing 61
We consider both standard script and uthmani script to resolve difficulties such as: Searching with an Uthmani writing form of a word. Calculating statistics knowing based on the uthmani
writing. Matching the same Word-By-Word structure of some
Quranic linguistic resources
61
Conception :: Text processing :: Global schema 6262
Conception :: Text processing :: Substitution63
New phase! Purpose? Cases of substitution:
Romanization: Guessing policy:
Nature of used characters Arabic valid words Word existence in Quran Predefined priorities
Numbers as words: Rules:
We don’t say رجل رجل we say ,صفر ال One never mentioned as واحد but as احد Some numbers accept gender: اثنتان اثنان Other numbers change their forms in the opposite gender of the count
noun: سماوات ابحر, سبع سبعة A hundred مئة had a special writing in Quran: مائة Some numbers mentioned indirectly: _ _ _ عاما_ خمسين اال سنة الف
63
Conception :: Text processing :: Tokenization64
Phases: Phrases to words (tokens) Words to their parts (Sub-tokens)
64
Conception :: Text processing :: Tokenization6565
Conception :: Text processing :: Normalization66
Normalize Uthmani text into Standard textStrip all recitation markskeep the vowels except the declension case ending
vowel
66
Conception :: Text processing :: Filtering stop words67
Stop-words selection strategy: Chosen from the list of the most frequent words
in Qur’an, Considering vocalization Preferring:
Particles such as لكن Pronouns such as انت Clitics such as <ف
67
Conception :: Text processing :: Stemming68
We proposed stripping the affixes in tokenizationIn Stemming, we bring the word back either to:
ROOT: Large set of words, different meaning STEM: Smaller set of words, similar meaning
68
Conception :: Quranic Word as Search Unit69
Purpose: obtain a quick efficient stable method to retrieve specific Quranic words.
Requirements: A Quranic words corpus , enriched with linguistic annotations
Word occurance as a unit Word form as a unit
Information Schema: Identifiers: a global identifier, a secondary identifier based on the order in
the ayah added to ayah identifier and surah identifier; Different forms: Uthmani vocalized word (the main form), Standard
vocalized word, Standard unvocalized word; Transliterations: ISO233, Buckwalter, Arabtex; Translations: English, other languages; Different levels of stemming: Lemma, Stem, Root; Other properties: Part Of Speech, type, state, case, mood, voice, number,
gender, person.
69
Conception :: 2-steps search strategy70
1st step: retrieving the best keywords set based on the user query by searching in: A word-as-a-unit index A Quranic words ontology
2nd step: retrieving the corresponding ayahs using the keywords set resulted from the first step
70
71
Conception :: 2-steps search :: applications
Conception :: Word Search :: Word properties72
Objective: allow the users to locate ayahs based on linguistic properties of words such as POS, type, state, case, mood, voice, number, gender, person.
Methods: Fielded search:
A fielded search is an advanced query feature that enables users to select and associate the different document fields to which he wishes to limit the query, then use the required keywords within these fields.
72
Conception :: Word Search :: Semantically Related Words
73
Objective: offer the related words of a keyword entered by the user.
Algorithm: The user specifies:
The word The semantic relation: Synonymy, Antonymy, Hypernymy,
Hyponymy, Meronymy, Holonymy, Troponymy. Inquiring the ontology for related words Using those keywords to retrieve the
corresponding ayahs.
73
Conception :: Word Search :: Multi-level Derivations
74
Objective: get a set of words that share the same origin such as stem and root.
Algorithm: The user specify:
the keyword The a level of derivation.
Recovering the origin of the word in the specified derivation level
Retrieving all the set of words that share this origin.
74
Conception :: Word Search :: Specific Derivations75
Objective: find the words resultants of applying a specific derivation operation on the user given word.
Algorithm: The user should:
Enter the keyword Specify which derivation.
Generating the set of derived words either by: fetching in the word index using linguistic tools such as verb conjugators. be
filtered as a second step by intersection with the set of Quranic words.
The resulted set will be used to locate the corresponding ayahs.*
75
Conception :: Word Search :: Fuzzy Search76
Objective: fetch using the set of words that are nearly similar to the input word in writing or pronunciation.
Methods: Liechtenstein distance (previously unknown text) Ngrams Spell-checker Soundtex (Phonetic )
76
Conception :: Word Search :: Fuzzy Search77
Arabic Similarities Specifications مؤصدة and مءصدة الحمد and الحمد عشر and عشر يضلله and يضله
Examples Mis-order of letters: زنبجيل for زنجبيل Phonetic similarity: هرم for ارم Spelling similarity: الضحي for الضحى
Open Source but WHY?There are a number of advantages lead us to open source, the
following points examine the most important of these[Web-Oss-watch]:
Collaborative bug-fixing & Fast security vulnerabilities detection
>Given enough eyeballs, all bugs are shallow< -- an open source slogan
Customization. Translation & Localization. Development discontinuation. Being part of a community. Low cost.
78
Used Technologies :: PythonPython is a powerful dynamic programming language, used widely. Features:
powerful and fast plays well with others runs everywhere friendly and easy to learn Free Open
79
Used Technologies :: Whoosh API
Whoosh is a full-text indexing and searching library implemented in Python
Features: Pure Pythonic API Fielded indexing and search Fast indexing and retrieval Powerful query language
Useful for circumstances such as: Anywhere a pure-Python solution is desirable to avoid having to
build/compile native libraries As a research platform (Python is easier to read!) When the search features are more important to us than the raw
speed.
80
Implementation :: Previous Code Base81
Implemented on [Chelli&Dahmani2010]Licensed under GPL*
(Server applications issue)Based on Whoosh Indexing LibraryOffering Many Search OperationsResults in HTML format
Raw format Can be used in Python
Requires to write wrappers for other languagesA basic resource manager
Has a missing piece
81
Implementation :: Our improvements82
The code base: has had 981 commits made representing 15,243 lines of code mostly written in Python with a well-commented source code. took an estimated 4 years of effort (COCOMO model)
Reference: Ohloh Website.
82
Implementation :: Our improvements New Output System
83
A New Output System: JSON-Based ==> Simpler & more extensible Centralized ==> Changes on one & only one place Extended & Extensible Results Structure Customizable Search Request using flags Including a Statistic Calculating Unit Offering Meta-Data for request
83
Implementation :: Our improvements Multiple Search Units
84
Translation-as-unit:
Word-as-unit
84
Implementation :: Our improvements Many new features
85
Fuzzy Search Feature
Retrieving the neighbors of each ayah
85
Implementation :: Our improvements Many new features (2)
86
Manipulating different Quranic Scripts
More suggestion operations
Showing the linguistic annotations
Retrieving & Showing transliterated keywords (Buckwalter)
86
Implementation :: Our improvements Resources Importing Manager
87
Resources Importing Manager: Downloading original resources (Licensing issue) Parsing & Importing the data to our intermediate database Indexing the database Updating auto-generated data files
87
Implementation :: Our improvements Packaging System
88
Automating the API building Packaging into:
Source Tarball Binary Tarball Python egg package Debian deb package Red-hat rpm package Windows Installer Mac OS (Perspective)
88
Implementation :: Our improvements ->More<-
89
Coding Standardization Following Python Conventions (PEP8) Using Pylint (a source code bug and quality checker)
Documentation Covering Enriching the code with Readme files
New Console interface
89
Implementation :: Open Issues90
Implement the modularity for the Query Parser: This is important to enable the extensibility feature and fix the problem of mixing (the combination) the different operations made during parsing.
Restrict the anonymous requests to the API: restricting requests protect the API from flooding either intended or not. This can be done by: Limit the maximum of simultaneous requests globally and by IP. Implement an identification system that works with remote clients.
Move to the last version of Whoosh library: Whoosh is almost in the version 3.X in its stable release while we still using an older version which is 0.3. The moving to the last version is very recommended to benefit of the improvements made. Though, it will not be an easy operation since our API is intertwined with the older version. Especially for the Query Parser.
90
Implementation :: Interfaces9191
Implementation :: Open Issues92
Complete the features implementationEnrich the linguistic resourcesImplement the modularity for the Query ParserRestrict the anonymous requests to the APIMove to the main stream of Whoosh libraryMaintain compatibility between Python versionsCover with documentationOptimize code and performance
92
Implementation :: Open Issues93
Enriching the linguistic resources: the actual used resources are poor comparing to what we really need. Integrate Qurany project to enrich the actual faceted
thematic search. Integrate the boundary annotations to enable the
retrieving of boundaries in Quran. Propose a standard format for new linguistic and
Quranic resources. Textify the binary database to enable the possibility of
logging of changes and take the benefits of revision control systems such as GIT.
93
Implementation :: Open Issues94
Complete the features implementationFielded search YESLogical relations YES
Phrase search YESInterval search YESFull Regex NOWildcards PARTIALLY
Boosting keywords YES
Pagination YESScoring YESSorting YES
Keywords Highlight YES
Uthmani full marks YES
Real time output NOResults grouping NO
Spell correction PARTIALLYRelated keywords PARTIALLYDifferent vocalizations YESCollocated words NOKeyboard mapping NODifferent significations NO
94
Romanization PARTIALLY
Partial vocalization PARTIALLYMulti-level derivation YES
Syntactic Coloration NOVocal Search NO
Specific-derivations NO
Linguistic annotations PARTIALLYFuzzy string PARTIALLYWord properties PARTIALLY
Linguistic examples NO
Structural options YES
Translation search YES
Uthmani writing way NO
Recitation marks PARTIALLYDivine Names Highlight NO
Repetitions&Allegoricals NO
Abrogators&Abrogated NO
Qur’anic Parables NO
Semantically related words PARTIALLYFaceted Thematic Search PARTIALLY
Entity Extraction NO
Questions Answering (QA) NOAutomatic vocalization NO
Co-reference resolution NO
Vocalized word frequency YESUnvocalized word frequency YES
Another Qur’anic units frequency NORoot/Stem/Lemma frequency NO
Implementation :: Open Issues95
Move to Python 3.X: Python 2 is disappearing and sooner or later it’ll be fully replaced. There are many tools offer some automatic scripts to convert a code from 2 into 3. Though, the big part often should be done manually.
Cover with documentation: the documentation is so important, it’s expensive but it encourages the community to involve in the project. This can be done by: Enrich the readme files; Enrich the code with appropriate comments; Create a usage How-To and straighten it with many demos; The man page for the console interface.
Optimize code and performance: proceed the fixing of pylint code analysis warnings and use Profile to check the performance of each search feature in order to improve it.
95
Implementation :: Interfaces9696
Implementation :: interfaces :: API97
Powerful Points:1. Free Libre Open
2. A Python API
3. A founded base
4. Lot of features
Implementation :: interfaces :: API#Sample9898
Implementation :: Interfaces :: JSON web service9999
Implementation :: Interfaces :: Console100100
101
Examples of use
As a desktop applicationAs a web interface
www.alfanous.orgAs a smart phone app
iPhone, iPad Windows phone
Examples of use :: Alfanous.org102
Remarkable Features: Localizable
Awarded: As the best-in-technicality
website in Algeria Web Awards 2012
Examples of use :: Alfanous.org (Responsive)103
Remarkable Features: User experience Responsiveness Simplicity
Awarded: chosen as the best website
categorized under the religious websites in Algeria Web Awards 2013
103
Examples of use :: iPhone Application104
Developed by: iPhone-islam (objective-C)
Remarkable Features: running on iPhone and iPad series
104
Examples of use :: Windows phone APP105
Developed by: Moumen bou Abdellah (C#)
Remarkable Features: Running on windows phone
105
Examples of use :: Alfanous Desktop Interface106
Remarkable Features: Offline use
106
Conferences
1. An Arabic paper in NITS 2011 KSA: Title: An Application Programming Interface for
indexing and search in Noble Quran Authors: Assem Chelli, Merouane Dahmani, Amar Balla,
Taha Zerrouki.2. An English paper in a pre-conference
workshop in LREC 2012 Turkey which is about ”LRE-Rel: Language Resource and Evaluation for Religious Texts”
Title: Advanced Search in Quran: Classification and Proposition of All Possible Features.
Authors: Assem Chelli, Amar Balla, Taha Zerrouki.
107
Conclusion & Perspectives108
We went through the implementation of many search features that we previously enlisted.
Unfortunately, there are more improvements to be done and many issues to be resolved. We left them as perspectives: Achieving an accurate statistics gathering system; Implementation of a more adequate suggestion system; Clear the way toward a semantic search engine; Proceeding the full conception of all search features. Complete implementation of all open issues.
108
THANK YOU FOR YOUR ATTENTION …
Any Questions ?Contacts:Email: [email protected]: @assem-chTwitter: @assem_ch
Project Links: Website: www.alfanous.org User feedback: feedback.alfanous.org Source-code: www.github.com/assem-ch/alfanous