Knowledge Center for Processing Hebrew
description
Transcript of Knowledge Center for Processing Hebrew
![Page 1: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/1.jpg)
Knowledge Center for Processing Hebrew
Alon Itai – CS Technion
![Page 2: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/2.jpg)
Tools for underrepresented languages
Computer tools and especially the Internet are Anglophile.
Search engines are not tooled for morphologically rich languages.
![Page 3: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/3.jpg)
Search “dog” “dogs” “and dogs”
כלבויקיפדיה - כלב כלבים מאולפים מחפשים ביתיונק כלב((כלב | כלביםזאב ביתיכלב – הבית מכונה בלשון המדעכלב אוגר זהב עמותת SOS מתאיםכלבחיות - בחירת כלב - לוח חיות מחמד - כלבים חתולים דגים תוכים לאימוץ ומסירהאתר המציע שידוכים בין גזעים, בייביסיטרים, כלב | כלבים
.תזונה וטיפוח, וטרינרים, פנסיונים, מאלפים ולוח מודעותDog ,אתר הכלבים מכיל הרבה מידע, מאמרים, קורסים
וכל הקשור בהםכלביםתמונות וקטעי וידאו של dog תמונת החודש · הכלב והחוק · רפואה וטיפול · כלביםגזעי
כלבי הצלה · קטעי וידאו · · · קורסים · מאמרים · לוח מודעות...תמונת השנה · פינת האימוץ
מאולפים מחפשים בית כלביםכלביםרוני אילוף
כלביםאתרי קטגורית !הב-הב אתר חיות המחמד של ישראל
כלביםקובי חזן אילוף כלביםהיחידה המיוחדת לאילוף
זולו משחקים פאזלים - משחק לגיל הרך - פאזל חתול עם כלבוכלבעל אלמנה
PETNET.co.il - רועים בלגיוכלבניופאונדלנד, כלבי רועים נחייהוכלבליווי, עזרת זולת רפואית
![Page 4: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/4.jpg)
Tools for underrepresented languages.
Computer tools and especially the Internet are Anglophile.
Search engines are not tooled for morphologically rich languages.
Email and chats do not cope well with strange alphabets
use (pidgin( English for communication,…
The local language is used less and less.
![Page 5: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/5.jpg)
The problem
Because of the small number of speakers, there is little economic incentive for commercial companies to develop tools.
Even when tools are available – no open source Tools developed at Universities are not fit for
general use:not robust enough no standard interfacelack of documentation
![Page 6: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/6.jpg)
Duplication of Effort Every researcher has to redevelop her own
tools, before conducting original research For example: In Hebrew, there are many
morphological analyzers:1. Choueka and Shapira 1964,2. Ornan 1987, Lavie et al. 1988, 3. Bentur et al. 1992, 4. Segal 1999, 5. HSPELL6. Yona and Wintner 2005
![Page 7: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/7.jpg)
The Knowledge Center
In 2003, the Israeli Ministry of Science and Technology established a Knowledge Center for Processing Hebrew.
Its aim to develop products (software and databases( for processing Hebrew and make them available to the public, both in academia and industry.
Researchers from four universities are involved in the Center's activities.
![Page 8: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/8.jpg)
The researchers
Yoad Winter (Technion(, Shuly Wintner (Haifa University(, Michael Elhadad (Ben Gurion University(, Arnon Cohen (Ben Gurion University(, Yoram Singer (Hebrew University( Eli Shamir (Hebrew University( Alon Itai (Technion(
![Page 9: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/9.jpg)
The model
The ministry provides initial funds. The Center should be self-sustainable – it should
finance itself by selling products.
The problems: The market is too small, had it been large then
there would have been no need for the center. Contradicts our philosophy of open research and
open code.
![Page 10: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/10.jpg)
Licensing Policy
Available under GPL – Gnu Public License. You get if for free if all products derived from it are also under GPL.
Payments only for special services. Can get a non-exclusive license for
commercial use.
![Page 11: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/11.jpg)
XML
EXAMPLE
-<item id=“17580” script=“formal” transliterated=“bwqr” undotted=“בוקר“ dotted=“ר fקgֹּב“ >
<noun gender=“masculine” number=“singular” plural=“im”>
<replace gender=“masculine” number=“plural” script=“formal” transliterated=“bqarim” undotted=“בקרים“/>
</noun>
</item>
All products are represented by XML.•Readable both by machines and by humans•Enables using off-shelf tools for on screen presentation and validation
Info for the morphological parser
![Page 12: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/12.jpg)
XML (2(
Facilitates interface between tools:
For example, the output of the morphological analyzer is the input for the morphological disambiguator.
Thus one can match different morphological analyzers with different disambiguators and compare their results
![Page 13: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/13.jpg)
Products
Morphological analyzers Morphological disambiguators Lexicon Corpora Speech data base Tools for editing lexicons and tagging
corpora. PR: forum,…
![Page 14: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/14.jpg)
The lexicon by part of speech
noun10332preposition100
verb4485conjunction62Proper Name4227pronoun60
adjective1612interjection40
adverb352interrogative9
quantifier132negation6
Total : 21,417
![Page 15: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/15.jpg)
GUI for editing the lexicon
![Page 16: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/16.jpg)
![Page 17: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/17.jpg)
Morphological disambiguators
Roy Bar-Haim constructed a HMM-based parser which partitions each word in a corpus into morphemes – success rate 96%.
Erel Segal combined a Brill-like method with a priori occurrence probabilities .
Meni Adler used HMM on whole words. All three disambiguators are available at
the Center.
![Page 18: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/18.jpg)
Corpora
Size
Unique tokens total
קורפוס
11,062,232319,666
11,216,867304,160 7Arutz
1,300,326166,780 Sha’ar la-matkhil
(dotted(
17,732,122 262,338 Knesset
![Page 19: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/19.jpg)
Corpora (2(
6000 sentences of manually tagged corpus (12,000 tokens(.
![Page 20: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/20.jpg)
Tree bank
6000 syntactically parsed sentences. Used for automatic parsing.
![Page 21: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/21.jpg)
Conclusions
The Center is an example of cooperation between researchers in several universities.
Many users have downloaded the products.
10 companies have purchased licenses.
![Page 22: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/22.jpg)
Conclusions (2(
Money is running out, … The model requires money, experts, and
commitment. Not suitable for languages with very few
speakers, or for poor communities.
![Page 23: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/23.jpg)
Modern Hebrew
Official Language of the State of Israel Spoken by 7 M people Related, but linguistically distinct, from Biblical
Hebrew. Morphologically rich
![Page 24: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/24.jpg)
Semitic Word Formation
root + pattern word
rootpattern
CaCaC yiCCoC
ktb
šbr
katab (he wrote( yiktob (he will write(
šabar (he broke( yišbor (he will break(
![Page 25: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/25.jpg)
Writing System
Most vowels are omitted Particles are prepended to words,
Example:
h – definite article,
b – preposition (in(
w – conjunction (and(
wbbyt = w + b + ha +byt
and in the house
![Page 26: Knowledge Center for Processing Hebrew](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814e41550346895dbbafdc/html5/thumbnails/26.jpg)
Morphological Ambiguity
Most words are morphologically ambiguous Example: šbth שבתה
1. šavta = šbt + CaCCa = stopped working
2. šavta = šbh + CaCCa = took prisoner
3. šabatah = her Saturday
4. še-b-te = that in tea
5. še-b-ha-te = that in the tea
6. še-bit-h = that her daughter
…