Arabic Content with Apache Solr
-
Upload
ramzi-alqrainy -
Category
Engineering
-
view
792 -
download
10
description
Transcript of Arabic Content with Apache Solr
![Page 1: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/1.jpg)
![Page 2: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/2.jpg)
Arabic Content with Apache Solr Ramzi Alqrainy
![Page 3: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/3.jpg)
Ramzi Alqrainy • MSc. In computer science, University of
Jordan, Amman - Jordan • Senior Enterprise Search / Data Engineer @
OpenSooq.com • Technical Reviewer for “Scaling Apache Solr”
and “Apache Solr Search Patterns” (Books) • Co-founder of Solr.ar group • Built 8 search engines for different models in
the last 2 years • Active blogger and Presenter about
Information Retrieval
![Page 4: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/4.jpg)
Agenda
• Why is Arabic Language Important ?
• Arabic Language is Complex
• How we use Apache Solr @ OpenSooq ?
• Localization Concept with SolrCloud
• Ranking and Relevancy
• Apache Solr Implementations @ OpenSooq
![Page 5: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/5.jpg)
Why is Arabic Language Important ?
![Page 6: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/6.jpg)
Why is Arabic Language Important ?
Sample Arabic document without dots
![Page 7: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/7.jpg)
Why is Arabic Language Important ?
Sample Arabic document with dots
![Page 8: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/8.jpg)
Why is Arabic Language Important ?
• The Arabic Language is ranked as the fourth top language on the web
• The number of Arab Internet users grew from 65 million in 2011 to 135 million in 2013
![Page 9: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/9.jpg)
Arabic Language is Complex • Arabic Orthography and Print
§ Arabic has a right-‐to-‐le0 connected script that uses 28 basic le7ers, which change shape depending on their posi:ons in words.
• Arabic Diacritics
§ Diacri:cs help disambiguate the meaning of words.
§ For example, the two words Alam)عَلَم -‐ meaning “flag”) and Eilm)عِلم -‐ meaning
“knowledge”) share the same le7ers علم )Elm( but differ in diacri:cs.
![Page 10: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/10.jpg)
Arabic Language is Complex
• Arabic Morphology
§ Arabic words are divided into three main types: nouns, verbs, and par:cles.
§ Arabic nouns, which include adjec:ves and adverbs, and verbs are derived from a closed set of around 10,000 roots
![Page 11: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/11.jpg)
Arabic Language is Complex
• Arabic Dialects § There are 6 dominant with many more varia:ons of them and dozens more less spoken
dialects.
§ EG. The concept corresponding to “I want” is expressed as عاوز )Eawz( in Egyp:an, أبغى (Abgy) in Gulf, أبي )Aby( in Iraqi, and بدي )bdy( in Levan:ne.
• Arabizi (Transliteration) § Arabic is some:mes wri7en using La:n characters in transliterated form. § Arabizi uses numerals to represent Arabic le7ers. § EG. "2" and “3” represent the le7ers أ (that sounds like “a” as in apple) and ع )E( (that is
a gu7ural “aa”) respec:vely.
![Page 12: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/12.jpg)
How we use Apache Solr @ OpenSooq ? • A leading classifieds ads website in the Middle East and North Africa.
• Right now : Average > 7K Concurrent Users.
• Activity-Per-Second : 240 APS. • Adding/Edi:ng/Dele:ng Post • Adding Comments • Sending Message to Buyer/Seller, etc.
• More than 40k hits on Apache Solr Per Minute.
![Page 13: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/13.jpg)
How we use Apache Solr @ OpenSooq ?
• Arabic Search Engine
![Page 14: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/14.jpg)
Arabic Normalization
• There are common spelling mistakes that are widely accepted. For example, the verb ادرس (Adrs) in impera:ve mood (meaning “study” – in a command form) would turn to أدرس .
• Arabic content would be normalized according to the following steps: § Remove punctua:on § Remove diacri:cs (primarily weak vowels). § Remove non le7ers § Replace ا , إ , and أ with ا from first le7er in each word (A -‐ alef) § Replace final ى with ي (Ya) § Replace final ة with ه )Ha(
![Page 15: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/15.jpg)
Arabic Light Stemmer • A light stemmer is not dictionary driven.
• This algorithm follows a rule-based prefix-removal mechanism.
![Page 16: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/16.jpg)
Arabic Light Stemmer • The light stemmer, light10, outperformed the other approaches. It is becoming
widely used in Arabic information retrieval.
![Page 17: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/17.jpg)
Arabic Light Stemmer • Sometimes a stemmer might not do what you want out of the box.
• Protects words from being modified by stemmers. Stop words and Synonyms • Removing stop words is important to ensure high performance and improve recall
h7ps://github.com/Ramzi-‐Alqrainy/Arabic-‐IR/blob/master/stopwords-‐ar.txt
• Matching strings of tokens and replacing them with other strings of tokens will improve precision and recall .
![Page 18: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/18.jpg)
Apache Solr Schema.xml • A text field that is appropriate for Arabic
![Page 19: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/19.jpg)
Localization Concept with SolrCloud
![Page 20: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/20.jpg)
Ranking and Relevancy: Boost documents by age
• Just do a descending sort by age = done?
• Boost more recent documents and penalize older documents just for being old • Recency Boosting
Bf=recip(ms(sub(NOW,post_inserted_date)),3.16e-‐11,0.08,0.05) ^5
![Page 21: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/21.jpg)
Tune Solr Recip Function
![Page 22: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/22.jpg)
Solr Implementations @ OpenSooq ?
§ Anti Spam
§ Checking Relevancy
§ Tags Generations
§ Recommendation System
![Page 23: Arabic Content with Apache Solr](https://reader034.fdocuments.us/reader034/viewer/2022042508/558a8dced8b42a817a8b4630/html5/thumbnails/23.jpg)
Thank You
@RamziAlqrainy
https://github.com/Ramzi-Alqrainy
http://solr-enterprise-search-server.blogspot.com/