Programming the Arabic Treebank

81
Programming the Arabic Treebank Otakar Smrˇ z Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University in Prague Dublin City University April 18, 2008 Otakar Smrˇ z (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 1 / 30

Transcript of Programming the Arabic Treebank

Page 1: Programming the Arabic Treebank

Programming the Arabic Treebank

Otakar Smrz

Institute of Formal and Applied Linguistics

Faculty of Mathematics and Physics

Charles University in Prague

Dublin City University

April 18, 2008

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 1 / 30

Page 2: Programming the Arabic Treebank

Outline

1 Methodology

Functional Morphology

Dependency Syntax

Tectogrammatics

2 Software

TrEd Environment

ElixirFM

Encode Arabic3 References

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 2 / 30

Page 3: Programming the Arabic Treebank

Outline

1 Methodology

Functional Morphology

Dependency Syntax

Tectogrammatics2 Software

TrEd Environment

ElixirFM

Encode Arabic

3 References

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 2 / 30

Page 4: Programming the Arabic Treebank

Outline

1 Methodology

Functional Morphology

Dependency Syntax

Tectogrammatics2 Software

TrEd Environment

ElixirFM

Encode Arabic3 References

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 2 / 30

Page 5: Programming the Arabic Treebank

Prague Arabic Dependency Treebank

PADT is a project of linguistic annotation of Modern Written Arabic

based on the theory of Functional Generative Description.

PADT consists mainly of the morphological and analytical levels of

description. The annotation of tectogrammatics and information

structure is being established.

PADT 1.0 was published in 2004 and has been used by tens of

academic and commercial institutions.

PADT 2.0 is due in 2008 and will cover over one million words of

text. It merges original Prague Arabic Dependency Treebank an-

notations with converted and enhanced Penn Arabic Treebank.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 3 / 30

Page 6: Programming the Arabic Treebank

Prague Arabic Dependency Treebank

PADT is a project of linguistic annotation of Modern Written Arabic

based on the theory of Functional Generative Description.

PADT consists mainly of the morphological and analytical levels of

description. The annotation of tectogrammatics and information

structure is being established.

PADT 1.0 was published in 2004 and has been used by tens of

academic and commercial institutions.

PADT 2.0 is due in 2008 and will cover over one million words of

text. It merges original Prague Arabic Dependency Treebank an-

notations with converted and enhanced Penn Arabic Treebank.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 3 / 30

Page 7: Programming the Arabic Treebank

Prague Arabic Dependency Treebank

PADT is a project of linguistic annotation of Modern Written Arabic

based on the theory of Functional Generative Description.

PADT consists mainly of the morphological and analytical levels of

description. The annotation of tectogrammatics and information

structure is being established.

PADT 1.0 was published in 2004 and has been used by tens of

academic and commercial institutions.

PADT 2.0 is due in 2008 and will cover over one million words of

text. It merges original Prague Arabic Dependency Treebank an-

notations with converted and enhanced Penn Arabic Treebank.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 3 / 30

Page 8: Programming the Arabic Treebank

Prague Arabic Dependency Treebank

PADT is a project of linguistic annotation of Modern Written Arabic

based on the theory of Functional Generative Description.

PADT consists mainly of the morphological and analytical levels of

description. The annotation of tectogrammatics and information

structure is being established.

PADT 1.0 was published in 2004 and has been used by tens of

academic and commercial institutions.

PADT 2.0 is due in 2008 and will cover over one million words of

text. It merges original Prague Arabic Dependency Treebank an-

notations with converted and enhanced Penn Arabic Treebank.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 3 / 30

Page 9: Programming the Arabic Treebank

Expected PADT 2.0

Data SetCorpus Functional Morphology Dependency Syntax Tectogrammatics

‘words’ tokens paras docs tokens paras docs tokens paras docs

Pra

gue

AEP 99360 116717 3006 327 116717 3006 327 9690 242 29

EAT 48371 55097 1667 207 55097 1667 207 13934 436 58

ASB 16815 20145 663 44 6527 273 17

NHR 21445 25329 426 34 12613 209 17

HYT 85683 100537 1782 204 41855 796 91 5228 106 10

XIN 61500 71548 2389 321 41716 1429 196 2042 75 13

Penn

1v3 141515 161217 4790 628 161217 4790 628

2v2 140821 163973 2929 476 163973 2929 476

3v2 335250 394466 12445 589 394466 12445 589

4v1 161665 192976 6176 397

Prague 333174 389373 9933 1137 274525 7380 855 30894 859 110

Penn 779251 912632 26340 2090 719656 20164 1693

PADT 2.0 1112425 1302005 36273 3227 994181 27544 2548 30894 859 110

The numbers can develop until the official release.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 4 / 30

Page 10: Programming the Arabic Treebank

Expected PADT 2.0

Data SetCorpus Functional Morphology Dependency Syntax Tectogrammatics

‘words’ tokens paras docs tokens paras docs tokens paras docs

Pra

gue

AEP 99360 116717 3006 327 116717 3006 327 9690 242 29

EAT 48371 55097 1667 207 55097 1667 207 13934 436 58

ASB 16815 20145 663 44 6527 273 17

NHR 21445 25329 426 34 12613 209 17

HYT 85683 100537 1782 204 41855 796 91 5228 106 10

XIN 61500 71548 2389 321 41716 1429 196 2042 75 13

Penn

1v3 141515 161217 4790 628 161217 4790 628

2v2 140821 163973 2929 476 163973 2929 476

3v2 335250 394466 12445 589 394466 12445 589

4v1 161665 192976 6176 397

Prague 333174 389373 9933 1137 274525 7380 855 30894 859 110

Penn 779251 912632 26340 2090 719656 20164 1693

PADT 2.0 1112425 1302005 36273 3227 994181 27544 2548 30894 859 110

The numbers can develop until the official release.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 4 / 30

Page 11: Programming the Arabic Treebank

Outline

1 Methodology

Functional Morphology

Dependency Syntax

Tectogrammatics2 Software

TrEd Environment

ElixirFM

Encode Arabic3 References

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 5 / 30

Page 12: Programming the Arabic Treebank

Morphology Disambiguation

Arabic is a language of rich morphology, both derivational and

inflectional, with highly ambiguous orthography.

Boundaries of syntactic units, the tokens, are obscure in writing—

orthographical words, the strings, consist of up to four lexemes.

Disambiguation encompasses subproblems like tokenization

, full

morphological tagging or its simplified ‘part-of-speech’ versions

,

lemmatization

, diacritization or restoration of the structural com-

ponents of words

, plus combinations thereof.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 6 / 30

Page 13: Programming the Arabic Treebank

Morphology Disambiguation

Arabic is a language of rich morphology, both derivational and

inflectional, with highly ambiguous orthography.

Boundaries of syntactic units, the tokens, are obscure in writing—

orthographical words, the strings, consist of up to four lexemes.

Disambiguation encompasses subproblems like tokenization

, full

morphological tagging or its simplified ‘part-of-speech’ versions

,

lemmatization

, diacritization or restoration of the structural com-

ponents of words

, plus combinations thereof.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 6 / 30

Page 14: Programming the Arabic Treebank

Morphology Disambiguation

Arabic is a language of rich morphology, both derivational and

inflectional, with highly ambiguous orthography.

Boundaries of syntactic units, the tokens, are obscure in writing—

orthographical words, the strings, consist of up to four lexemes.

Disambiguation encompasses subproblems like tokenization

, full

morphological tagging or its simplified ‘part-of-speech’ versions

,

lemmatization

, diacritization or restoration of the structural com-

ponents of words

, plus combinations thereof.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 6 / 30

Page 15: Programming the Arabic Treebank

Morphology Disambiguation

Arabic is a language of rich morphology, both derivational and

inflectional, with highly ambiguous orthography.

Boundaries of syntactic units, the tokens, are obscure in writing—

orthographical words, the strings, consist of up to four lexemes.

Disambiguation encompasses subproblems like tokenization, full

morphological tagging or its simplified ‘part-of-speech’ versions

,

lemmatization

, diacritization or restoration of the structural com-

ponents of words

, plus combinations thereof.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 6 / 30

Page 16: Programming the Arabic Treebank

Morphology Disambiguation

Arabic is a language of rich morphology, both derivational and

inflectional, with highly ambiguous orthography.

Boundaries of syntactic units, the tokens, are obscure in writing—

orthographical words, the strings, consist of up to four lexemes.

Disambiguation encompasses subproblems like tokenization, full

morphological tagging or its simplified ‘part-of-speech’ versions,

lemmatization

, diacritization or restoration of the structural com-

ponents of words

, plus combinations thereof.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 6 / 30

Page 17: Programming the Arabic Treebank

Morphology Disambiguation

Arabic is a language of rich morphology, both derivational and

inflectional, with highly ambiguous orthography.

Boundaries of syntactic units, the tokens, are obscure in writing—

orthographical words, the strings, consist of up to four lexemes.

Disambiguation encompasses subproblems like tokenization, full

morphological tagging or its simplified ‘part-of-speech’ versions,

lemmatization, diacritization or restoration of the structural com-

ponents of words

, plus combinations thereof.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 6 / 30

Page 18: Programming the Arabic Treebank

Morphology Disambiguation

Arabic is a language of rich morphology, both derivational and

inflectional, with highly ambiguous orthography.

Boundaries of syntactic units, the tokens, are obscure in writing—

orthographical words, the strings, consist of up to four lexemes.

Disambiguation encompasses subproblems like tokenization, full

morphological tagging or its simplified ‘part-of-speech’ versions,

lemmatization, diacritization or restoration of the structural com-

ponents of words, plus combinations thereof.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 6 / 30

Page 19: Programming the Arabic Treebank

He will notify them about that through text messages . . .

. '. '.�è�

�Q���

��®Ë @ É

K� A

��

��QË @�

��KQ

�£ á

�«

�½Ë�

�YK.�

Ñ�ë

�Q�.�

j

�J

��

String Token Tag Buckwalter Morph Tags Token Form Token Gloss

Ñ�ë

�Q�.�

j

�J

��

F--------- FUT sa- will

VIIA-3MS-- IV3MS+IV+IVSUFF_MOOD:I yu-h˘

bir-u he-notify

S----3MP4- IVSUFF_DO:3MP -hum them

�½Ë�

�YK.�

P--------- PREP bi- about/by

SD----MS-- DEM_PRON_MS d¯

alika thatá

�« P--------- PREP ↪an by/about

���KQ

�£ N-------2R NOUN+CASE_DEF_GEN t.arıq-i way-of

�

K� A

��

��QË�@ N-------2D DET+NOUN+CASE_DEF_GEN ar-rasa↩il-i the-letters

�è�

�Q���

��®Ë

�@ A-----FS2D

DET+ADJ+NSUFF_FEM_SG++CASE_DEF_GEN al-qas. ır-at-i the-short

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 7 / 30

Page 20: Programming the Arabic Treebank

He will notify them about that through text messages . . .

. '. '.�è�

�Q���

��®Ë @ É

K� A

��

��QË @�

��KQ

�£ á

�«

�½Ë�

�YK.�

Ñ�ë

�Q�.�

j

�J

��

String Token Tag Buckwalter Morph Tags Token Form Token Gloss

Ñ�ë

�Q�.�

j

�J

��

F--------- FUT sa- will

VIIA-3MS-- IV3MS+IV+IVSUFF_MOOD:I yu-h˘

bir-u he-notify

S----3MP4- IVSUFF_DO:3MP -hum them

�½Ë�

�YK.�

P--------- PREP bi- about/by

SD----MS-- DEM_PRON_MS d¯

alika thatá

�« P--------- PREP ↪an by/about

���KQ

�£ N-------2R NOUN+CASE_DEF_GEN t.arıq-i way-of

�

K� A

��

��QË�@ N-------2D DET+NOUN+CASE_DEF_GEN ar-rasa↩il-i the-letters

�è�

�Q���

��®Ë

�@ A-----FS2D

DET+ADJ+NSUFF_FEM_SG++CASE_DEF_GEN al-qas. ır-at-i the-short

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 7 / 30

Page 21: Programming the Arabic Treebank

He will notify them about that through text messages . . .

. '. '.�è�

�Q���

��®Ë @ É

K� A

��

��QË @�

��KQ

�£ á

�«

�½Ë�

�YK.�

Ñ�ë

�Q�.�

j

�J

��

String Token Tag Buckwalter Morph Tags Token Form Token Gloss

Ñ�ë

�Q�.�

j

�J

��

F--------- FUT sa- will

VIIA-3MS-- IV3MS+IV+IVSUFF_MOOD:I yu-h˘

bir-u he-notify

S----3MP4- IVSUFF_DO:3MP -hum them

�½Ë�

�YK.�

P--------- PREP bi- about/by

SD----MS2- DEM_PRON_MS d¯

alika thatá

�« P--------- PREP ↪an by/about

���KQ

�£ N-----MS2R NOUN+CASE_DEF_GEN t.arıq-i way-of

�

K� A

��

��QË�@ N-----FP2D DET+NOUN+CASE_DEF_GEN ar-rasa↩il-i the-letters

�è�

�Q���

��®Ë

�@ A-----FS2D

DET+ADJ+NSUFF_FEM_SG++CASE_DEF_GEN al-qas. ır-at-i the-short

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 7 / 30

Page 22: Programming the Arabic Treebank

Functional Arabic Morphology

Many computational models of Arabic morphology are lexical in

nature. As they are not designed in connection with any syntax–

morphology interface, their interpretation is simply incremental.

Functional Arabic Morphology endorses inferential–realizational

views. It re-establishes the system of inflectional and inherent mor-

phosyntactic properties. It distinguishes various senses in which

the properties are referred to in the grammar.

Definition of lexemes includes the derivational root and pattern

information if appropriate. Modeling of the written language as

well as spoken dialects is expected methodologically identical.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 8 / 30

Page 23: Programming the Arabic Treebank

Functional Arabic Morphology

Many computational models of Arabic morphology are lexical in

nature. As they are not designed in connection with any syntax–

morphology interface, their interpretation is simply incremental.

Functional Arabic Morphology endorses inferential–realizational

views. It re-establishes the system of inflectional and inherent mor-

phosyntactic properties. It distinguishes various senses in which

the properties are referred to in the grammar.

Definition of lexemes includes the derivational root and pattern

information if appropriate. Modeling of the written language as

well as spoken dialects is expected methodologically identical.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 8 / 30

Page 24: Programming the Arabic Treebank

Functional Arabic Morphology

Many computational models of Arabic morphology are lexical in

nature. As they are not designed in connection with any syntax–

morphology interface, their interpretation is simply incremental.

Functional Arabic Morphology endorses inferential–realizational

views. It re-establishes the system of inflectional and inherent mor-

phosyntactic properties. It distinguishes various senses in which

the properties are referred to in the grammar.

Definition of lexemes includes the derivational root and pattern

information if appropriate. Modeling of the written language as

well as spoken dialects is expected methodologically identical.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 8 / 30

Page 25: Programming the Arabic Treebank

MorphoTrees

Suppose you can list morphological analyses for a given input

string . . .

AlY úÍ@

|lY úÍ�@

|lY úÍ�@

ú�Í�@ ↩ala

|ly úÍ�@

|ly úÍ�@

�úÍ�

�@ ↩alıy

|l y ø

�@

|l �@

È�@ ↩al

y ø

A�K

� @ ↩ana

IlY úÍ@

IlY úÍ@

ú�Í@ �↩ila

Ily y ø

úÍ@

Ily úÍ@

ú�Í@ �↩ila

y ø

A�K

� @ ↩ana

Oly úÍ @

Oly úÍ @

úÍ�

�ð waliy

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 9 / 30

Page 26: Programming the Arabic Treebank

MorphoTrees

. . . organize the analyses into a hierarchy with the string as its

root

and the full tokens as the leaves

, grouped by their lemmas

,

canonical forms

, and partitionings of the string into such forms:

AlY úÍ@

|lY úÍ�@

|lY úÍ�@

ú�Í�@ ↩ala

|ly úÍ�@

|ly úÍ�@

�úÍ�

�@ ↩alıy

|l y ø

�@

|l �@

È�@ ↩al

y ø

A�K

� @ ↩ana

IlY úÍ@

IlY úÍ@

ú�Í@ �↩ila

Ily y ø

úÍ@

Ily úÍ@

ú�Í@ �↩ila

y ø

A�K

� @ ↩ana

Oly úÍ @

Oly úÍ @

úÍ�

�ð waliy

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 9 / 30

Page 27: Programming the Arabic Treebank

MorphoTrees

. . . organize the analyses into a hierarchy with the string as its

root and the full tokens as the leaves

, grouped by their lemmas

,

canonical forms

, and partitionings of the string into such forms:

AlY úÍ@

|lY úÍ�@

|lY úÍ�@

ú�Í�@ ↩ala

|ly úÍ�@

|ly úÍ�@

�úÍ�

�@ ↩alıy

|l y ø

�@

|l �@

È�@ ↩al

y ø

A�K

� @ ↩ana

IlY úÍ@

IlY úÍ@

ú�Í@ �↩ila

Ily y ø

úÍ@

Ily úÍ@

ú�Í@ �↩ila

y ø

A�K

� @ ↩ana

Oly úÍ @

Oly úÍ @

úÍ�

�ð waliy

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 9 / 30

Page 28: Programming the Arabic Treebank

MorphoTrees

. . . organize the analyses into a hierarchy with the string as its

root and the full tokens as the leaves, grouped by their lemmas

,

canonical forms

, and partitionings of the string into such forms:

AlY úÍ@

|lY úÍ�@

|lY úÍ�@

ú�Í�@ ↩ala

|ly úÍ�@

|ly úÍ�@

�úÍ�

�@ ↩alıy

|l y ø

�@

|l �@

È�@ ↩al

y ø

A�K

� @ ↩ana

IlY úÍ@

IlY úÍ@

ú�Í@ �↩ila

Ily y ø

úÍ@

Ily úÍ@

ú�Í@ �↩ila

y ø

A�K

� @ ↩ana

Oly úÍ @

Oly úÍ @

úÍ�

�ð waliy

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 9 / 30

Page 29: Programming the Arabic Treebank

MorphoTrees

. . . organize the analyses into a hierarchy with the string as its

root and the full tokens as the leaves, grouped by their lemmas,

canonical forms

, and partitionings of the string into such forms:

AlY úÍ@

|lY úÍ�@

|lY úÍ�@

ú�Í�@ ↩ala

|ly úÍ�@

|ly úÍ�@

�úÍ�

�@ ↩alıy

|l y ø

�@

|l �@

È�@ ↩al

y ø

A�K

� @ ↩ana

IlY úÍ@

IlY úÍ@

ú�Í@ �↩ila

Ily y ø

úÍ@

Ily úÍ@

ú�Í@ �↩ila

y ø

A�K

� @ ↩ana

Oly úÍ @

Oly úÍ @

úÍ�

�ð waliy

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 9 / 30

Page 30: Programming the Arabic Treebank

MorphoTrees

. . . organize the analyses into a hierarchy with the string as its

root and the full tokens as the leaves, grouped by their lemmas,

canonical forms, and partitionings of the string into such forms:

AlY úÍ@

|lY úÍ�@

|lY úÍ�@

ú�Í�@ ↩ala

|ly úÍ�@

|ly úÍ�@

�úÍ�

�@ ↩alıy

|l y ø

�@

|l �@

È�@ ↩al

y ø

A�K

� @ ↩ana

IlY úÍ@

IlY úÍ@

ú�Í@ �↩ila

Ily y ø

úÍ@

Ily úÍ@

ú�Í@ �↩ila

y ø

A�K

� @ ↩ana

Oly úÍ @

Oly úÍ @

úÍ�

�ð waliy

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 9 / 30

Page 31: Programming the Arabic Treebank

Multi-Modal Annotation

fhm Ñê¯

fhm Ñê¯

fhm Ñê¯

Ñê�

�¯ fahimÑê

�¯ fahmÑ

��� fahham

f hm Ñë

¬

f

¬

�¬ fa

hm Ñë

ÐA�ë ham�

Ñ�ë hamm�

Ñ�ë hammÑ

�ë hum

C---------

�¬

fa

VC---2MS--

Ñë�

him

VP-A-3MS--

��Ñ�ë

ha

mm

-a

VP-P-3MS--

��Ñ�ë

hu

mm

-a

VC---2MS--

��Ñ�ë

ha

mm

-i

N------S1R

��Ñ�ë

ha

mm

-u

N------S4R

��Ñ�ë

ha

mm

-a

N------S2R

��Ñ�ë

ha

mm

-i

N------S1I

��Ñ�ë

ha

mm

-un

N------S2I

��Ñ�ë

ha

mm

-in

S----3MP1-

Ñ�ë

hu

m

C---------

----------

--------1-

Ñê�

�¯ fahim to understand

Ñê�¯ fahm understanding

��� fahham to make understand

�¬ fa and, so

ÐA�ë ham to roam, wander

�Ñ

�ë hamm to be on one’s mind

�Ñ

�ë hamm concern, interest

Ñ�ë hum they

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 10 / 30

Page 32: Programming the Arabic Treebank

Multi-Modal Annotation

fhm Ñê¯

fhm Ñê¯

fhm Ñê¯

Ñê�

�¯ fahimÑê

�¯ fahmÑ

��� fahham

f hm Ñë

¬

f

¬

�¬ fa

hm Ñë

ÐA�ë ham�

Ñ�ë hamm�

Ñ�ë hammÑ

�ë hum

C---------

�¬

fa

VC---2MS--

Ñë�

him

VP-A-3MS--

��Ñ�ë

ha

mm

-a

VP-P-3MS--

��Ñ�ë

hu

mm

-a

VC---2MS--

��Ñ�ë

ha

mm

-i

N------S1R

��Ñ�ë

ha

mm

-u

N------S4R

��Ñ�ë

ha

mm

-a

N------S2R

��Ñ�ë

ha

mm

-i

N------S1I

��Ñ�ë

ha

mm

-un

N------S2I

��Ñ�ë

ha

mm

-in

S----3MP1-

Ñ�ë

hu

m

C---------

----------

--------1-

Ñê�

�¯ fahim to understand

Ñê�¯ fahm understanding

��� fahham to make understand

�¬ fa and, so

ÐA�ë ham to roam, wander

�Ñ

�ë hamm to be on one’s mind

�Ñ

�ë hamm concern, interest

Ñ�ë hum they

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 10 / 30

Page 33: Programming the Arabic Treebank

Dependency Syntax

. . . by providing the basic necessities of life to its people, including med-

ical care���J

��J.

��¢Ë@

��é�KA

�«

��QË @ A�î

D�J�K.

áÓ��ð A

�îD

.�ª

��

�Ë��é�

��J�� A

��

� B@

�è� A

�J�mÌ'@ �

H�

A��KP

�ð �Qå

�� Q

��� ñ

��JK.�

. '. '.

bi-tawfıri d. arurıyati al-h. ayati al-↩asasıyati li-sa↪bi-haby-giving-of necessities-of the-life the-basic to-people-of-it

wa-min bayni-ha ar-ri↪ayatu at.-t. ibbıyatuand-from between-of-them the-care the-medical

Dependency trees capture distinct types of relations—immediate

dominance, linear precedence, and coordination membership.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 11 / 30

Page 34: Programming the Arabic Treebank

Dependency Syntax

. . . by providing the basic necessities of life to its people, including med-

ical care���J

��J.

��¢Ë@

��é�KA

�«

��QË @ A�î

D�J�K.

áÓ��ð A

�îD

.�ª

��

�Ë��é�

��J�� A

��

� B@

�è� A

�J�mÌ'@ �

H�

A��KP

�ð �Qå

�� Q

��� ñ

��JK.�

. '. '.

bi-tawfıri d. arurıyati al-h. ayati al-↩asasıyati li-sa↪bi-haby-giving-of necessities-of the-life the-basic to-people-of-it

wa-min bayni-ha ar-ri↪ayatu at.-t. ibbıyatuand-from between-of-them the-care the-medical

Dependency trees capture distinct types of relations—immediate

dominance, linear precedence, and coordination membership.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 11 / 30

Page 35: Programming the Arabic Treebank

Dependency Syntax

. . . by providing the basic necessities of life to its people, including med-

ical care���J

��J.

��¢Ë@

��é�KA

�«

��QË @ A�î

D�J�K.

áÓ��ð A

�îD

.�ª

��

�Ë��é�

��J�� A

��

� B@

�è� A

�J�mÌ'@ �

H�

A��KP

�ð �Qå

�� Q

��� ñ

��JK.�

. '. '.

bi-tawfıri d. arurıyati al-h. ayati al-↩asasıyati li-sa↪bi-haby-giving-of necessities-of the-life the-basic to-people-of-it

wa-min bayni-ha ar-ri↪ayatu at.-t. ibbıyatuand-from between-of-them the-care the-medical

Dependency trees capture distinct types of relations—immediate

dominance, linear precedence, and coordination membership.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 11 / 30

Page 36: Programming the Arabic Treebank

In the section on literature, the magazine presented the issue of the

Arabic language and the dangers that threaten it.

. A�ë

�X

��Y

�î��E ú

�æ�

��Ë @ P

�A

�¢

k

� B@

�ð

�é�

��JK.�

�Q�ªË@

�é�

�ª

��ÊË @

���J

��

��¯

��é

��Ê

�j.

�ÜÏ @ �I�

�k�Q

�£ H.�

�X

� B@

��­

�ÊÓ� ú

¯�

�ð

AuxS

AuxY

AuxP

Adv

Atr

Pred

SbObj

Atr

Atr

Coord

Atr

AuxY

AtrObj

�ð wa- and C---------

ú

¯�

fı in P---------��

­�ÊÓ� milaffi collection/file-of N------S2R

H.�

�X

� B

�@ al-↩adabi the-literature N------S2D

�I

�k�Q

�£ t.arah. at it-presented VP-A-3FS--

��é

��Ê

�j.

�ÜÏ�@ al-magallatu the-magazine N------S1D

���J

��

��¯ qad. ıyata issue-of N------S4R

�é�

�ª

��ÊË

�@ al-lugati the-language N------S2D

�é�

��JK.�

�Q�ªË

�@ al-↪arabıyati the-Arabic A-----FS2D

�ð wa- and C---------

P�A

�¢

k

� B

�@ al-↩ah

˘t.ari the-dangers N------P2D

ú

�æ�

��Ë�@ allatı that SR----FS2-

�X

��Y

��E tuhaddidu they-threaten VIIA-3FS--

A�ë -ha it S----3FS4-

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 12 / 30

Page 37: Programming the Arabic Treebank

In the section on literature, the magazine presented the issue of the

Arabic language and the dangers that threaten it.

. A�ë

�X

��Y

�î��E ú

�æ�

��Ë @ P

�A

�¢

k

� B@

�ð

�é�

��JK.�

�Q�ªË@

�é�

�ª

��ÊË @

���J

��

��¯

��é

��Ê

�j.

�ÜÏ @ �I�

�k�Q

�£ H.�

�X

� B@

��­

�ÊÓ� ú

¯�

�ð

AuxS

AuxY

AuxP

Adv

Atr

Pred

SbObj

Atr

Atr

Coord

Atr

AuxY

AtrObj

�ð wa- and C---------

ú

¯�

fı in P---------��

­�ÊÓ� milaffi collection/file-of N------S2R

H.�

�X

� B

�@ al-↩adabi the-literature N------S2D

�I

�k�Q

�£ t.arah. at it-presented VP-A-3FS--

��é

��Ê

�j.

�ÜÏ�@ al-magallatu the-magazine N------S1D

���J

��

��¯ qad. ıyata issue-of N------S4R

�é�

�ª

��ÊË

�@ al-lugati the-language N------S2D

�é�

��JK.�

�Q�ªË

�@ al-↪arabıyati the-Arabic A-----FS2D

�ð wa- and C---------

P�A

�¢

k

� B

�@ al-↩ah

˘t.ari the-dangers N------P2D

ú

�æ�

��Ë�@ allatı that SR----FS2-

�X

��Y

��E tuhaddidu they-threaten VIIA-3FS--

A�ë -ha it S----3FS4-

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 12 / 30

Page 38: Programming the Arabic Treebank

Tectogrammatics

Describes linguistic meaning in its semantic and pragmatic as-

pects. Restores deep syntax and marks information structure.SENT

LOC

PAT

PRED

ACT

ADDR

PAT

ID

RSTR

CONJ

ID

RSTR

ACT

PAT

�­

�ÊÓ� milaff collection Masc.Sing.Def B

H.

�X

� @ ↩adab literature Masc.Sing.Def C

h �Q�

£ t.arah. to present Ind.Ant.Act B�é

���m.

�× magallah magazine Fem.Sing.Def B

�ñ

�ë huwa someone GenPronoun B

��J

��

��¯ qad. ıyah issue Fem.Sing.Def N

�é

�ª

�Ë lugah language Fem.Sing.Def N

�úG.�

�Q�« ↪arabıy Arabic Adjective N

�ð wa- and Coordination

Q�

¢�

k h˘

at.ar danger Masc.Plur.Def N

X��Y

�ë haddad to threaten Ind.Sim.Act N

�ùë� hiya it PersPronoun B

�ùë� hiya it PersPronoun B

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 13 / 30

Page 39: Programming the Arabic Treebank

Tectogrammatics

Describes linguistic meaning in its semantic and pragmatic as-

pects. Restores deep syntax and marks information structure.SENT

LOC

PAT

PRED

ACT

ADDR

PAT

ID

RSTR

CONJ

ID

RSTR

ACT

PAT

�­

�ÊÓ� milaff collection Masc.Sing.Def B

H.

�X

� @ ↩adab literature Masc.Sing.Def C

h �Q�

£ t.arah. to present Ind.Ant.Act B�é

���m.

�× magallah magazine Fem.Sing.Def B

�ñ

�ë huwa someone GenPronoun B

��J

��

��¯ qad. ıyah issue Fem.Sing.Def N

�é

�ª

�Ë lugah language Fem.Sing.Def N

�úG.�

�Q�« ↪arabıy Arabic Adjective N

�ð wa- and Coordination

Q�

¢�

k h˘

at.ar danger Masc.Plur.Def N

X��Y

�ë haddad to threaten Ind.Sim.Act N

�ùë� hiya it PersPronoun B

�ùë� hiya it PersPronoun B

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 13 / 30

Page 40: Programming the Arabic Treebank

Outline

1 Methodology

Functional Morphology

Dependency Syntax

Tectogrammatics2 Software

TrEd Environment

ElixirFM

Encode Arabic3 References

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 14 / 30

Page 41: Programming the Arabic Treebank

TrEd Environment

The tree editor TrEd is designed and implemented by Petr Pajas.

Its distributed version ntred is co-authored by Zdenek Zabokrtsky.

http://ufal.mff.cuni.cz/˜pajas/tred/

Numerous annotation contexts and macro-style extensions have

been contributed by many other project developers.

Examples of our work include the MorphoTrees context, the Con-

Dep conversion templates, or miscellaneous scripts for error de-

tection and consistency checking.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 15 / 30

Page 42: Programming the Arabic Treebank

TrEd Environment

The tree editor TrEd is designed and implemented by Petr Pajas.

Its distributed version ntred is co-authored by Zdenek Zabokrtsky.

http://ufal.mff.cuni.cz/˜pajas/tred/

Numerous annotation contexts and macro-style extensions have

been contributed by many other project developers.

Examples of our work include the MorphoTrees context, the Con-

Dep conversion templates, or miscellaneous scripts for error de-

tection and consistency checking.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 15 / 30

Page 43: Programming the Arabic Treebank

TrEd Environment

The tree editor TrEd is designed and implemented by Petr Pajas.

Its distributed version ntred is co-authored by Zdenek Zabokrtsky.

http://ufal.mff.cuni.cz/˜pajas/tred/

Numerous annotation contexts and macro-style extensions have

been contributed by many other project developers.

Examples of our work include the MorphoTrees context, the Con-

Dep conversion templates, or miscellaneous scripts for error de-

tection and consistency checking.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 15 / 30

Page 44: Programming the Arabic Treebank

’SBAR’ => [ [

[ ["P---------", ".+"], ["C---------", "Oan˜a"], undef ],

[ ["P---------", "baEoda|qabola"], ["C---------", "Oan"], undef ],

sub { my ($root, undef, @child) = @_;

@child = map { ConDep($_) } @child;

PasteNode($child[1], $child[0]);

PasteNode($_, $child[1]) foreach @child[2 .. @child - 1];

$child[0]->{’afun’} = "AuxP";

$child[1]->{’afun’} = "AuxC";

return $child[0] }

], ... ]

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 16 / 30

Page 45: Programming the Arabic Treebank

ElixirFM

ElixirFM is a high-level implementation of Functional Arabic Mor-

phology. It reuses the Functional Morphology library for Haskell

and extends it.

Morphology is modeled in terms of abstract patterns, paradigms,

grammatical categories, lexemes, and word classes. The com-

putation involved in analysis or generation is conceptually distin-

guished from the general-purpose linguistic model.

The lexicon of ElixirFM is derived from the open-source Buckwalter

lexicon, which is redesigned in important respects, and from the

PADT annotations.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 17 / 30

Page 46: Programming the Arabic Treebank

ElixirFM

ElixirFM is a high-level implementation of Functional Arabic Mor-

phology. It reuses the Functional Morphology library for Haskell

and extends it.

Morphology is modeled in terms of abstract patterns, paradigms,

grammatical categories, lexemes, and word classes. The com-

putation involved in analysis or generation is conceptually distin-

guished from the general-purpose linguistic model.

The lexicon of ElixirFM is derived from the open-source Buckwalter

lexicon, which is redesigned in important respects, and from the

PADT annotations.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 17 / 30

Page 47: Programming the Arabic Treebank

ElixirFM

ElixirFM is a high-level implementation of Functional Arabic Mor-

phology. It reuses the Functional Morphology library for Haskell

and extends it.

Morphology is modeled in terms of abstract patterns, paradigms,

grammatical categories, lexemes, and word classes. The com-

putation involved in analysis or generation is conceptually distin-

guished from the general-purpose linguistic model.

The lexicon of ElixirFM is derived from the open-source Buckwalter

lexicon, which is redesigned in important respects, and from the

PADT annotations.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 17 / 30

Page 48: Programming the Arabic Treebank

Lexicon’s Design

The lexicon is stored in a domain-specific embedded language.

(a) representation of the linguistic data in an abstract and

extensible notation that encodes both orthography and

phonology, and whose interpretation is customizable

(b) organization of the lexicon so that it can possibly be divided

into separate units, as well as be interlinked with external

modules, without any duplication of information

(c) definition of such a format of the lexicon so that editing and

understanding the data is not inappropriately difficult, and

using such data markup whose syntax is lightweight and can

be verified with automatic tools

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 18 / 30

Page 49: Programming the Arabic Treebank

Lexicon’s Design

The lexicon is stored in a domain-specific embedded language.

(a) representation of the linguistic data in an abstract and

extensible notation that encodes both orthography and

phonology, and whose interpretation is customizable

(b) organization of the lexicon so that it can possibly be divided

into separate units, as well as be interlinked with external

modules, without any duplication of information

(c) definition of such a format of the lexicon so that editing and

understanding the data is not inappropriately difficult, and

using such data markup whose syntax is lightweight and can

be verified with automatic tools

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 18 / 30

Page 50: Programming the Arabic Treebank

Lexicon’s Design

The lexicon is stored in a domain-specific embedded language.

(a) representation of the linguistic data in an abstract and

extensible notation that encodes both orthography and

phonology, and whose interpretation is customizable

(b) organization of the lexicon so that it can possibly be divided

into separate units, as well as be interlinked with external

modules, without any duplication of information

(c) definition of such a format of the lexicon so that editing and

understanding the data is not inappropriately difficult, and

using such data markup whose syntax is lightweight and can

be verified with automatic tools

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 18 / 30

Page 51: Programming the Arabic Treebank

Lexicon’s Design

The lexicon is stored in a domain-specific embedded language.

(a) representation of the linguistic data in an abstract and

extensible notation that encodes both orthography and

phonology, and whose interpretation is customizable

(b) organization of the lexicon so that it can possibly be divided

into separate units, as well as be interlinked with external

modules, without any duplication of information

(c) definition of such a format of the lexicon so that editing and

understanding the data is not inappropriately difficult, and

using such data markup whose syntax is lightweight and can

be verified with automatic tools

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 18 / 30

Page 52: Programming the Arabic Treebank

|> "s l k" <| [

FaCaL ‘verb‘ [ "proceed", "behave" ]

‘imperf‘ FCuL,

FiCL ‘noun‘ [ "wire", "thread" ]

‘plural‘ HaFCAL,

FiCL |< Iy ‘adj‘ [ "wire", "by wire" ],

lA >| FiCL |< Iy ‘adj‘ [ "wireless", "radio" ],

FuCUL ‘noun‘ [ "behavior", "conduct" ],

FuCUL |< Iy ‘adj‘ [ "behavioral" ],

MaFCaL ‘noun‘ [ "road", "method" ]

‘plural‘ MaFACiL ]

proceed, behave I(u) salak ½�Ê

��

wire, thread (↩aslak ¼C��

� @) silk ½Ê�

wire, by wire silkıy�

ú¾�Ê�

wireless, radio la-silkıy�

ú¾�Ê�

�B�

behavior, conduct suluk ¼ñ�Ê

��

behavioral sulukıy�

ú»�

ñ�Ê

��

road, method maslak ½�Ê�

�Ó

(masaliku�½Ë

�A

��

�Ó)

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 19 / 30

Page 53: Programming the Arabic Treebank

|> "s l k" <| [

FaCaL ‘verb‘ [ "proceed", "behave" ]

‘imperf‘ FCuL,

FiCL ‘noun‘ [ "wire", "thread" ]

‘plural‘ HaFCAL,

FiCL |< Iy ‘adj‘ [ "wire", "by wire" ],

lA >| FiCL |< Iy ‘adj‘ [ "wireless", "radio" ],

FuCUL ‘noun‘ [ "behavior", "conduct" ],

FuCUL |< Iy ‘adj‘ [ "behavioral" ],

MaFCaL ‘noun‘ [ "road", "method" ]

‘plural‘ MaFACiL ]

proceed, behave I(u) salak ½�Ê

��

wire, thread (↩aslak ¼C��

� @) silk ½Ê�

wire, by wire silkıy�

ú¾�Ê�

wireless, radio la-silkıy�

ú¾�Ê�

�B�

behavior, conduct suluk ¼ñ�Ê

��

behavioral sulukıy�

ú»�

ñ�Ê

��

road, method maslak ½�Ê�

�Ó

(masaliku�½Ë

�A

��

�Ó)

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 19 / 30

Page 54: Programming the Arabic Treebank

User Interface

ElixirFM implements various utility functions for lookup in the lexi-

con

, inflection and derivation of lexemes

, resolution of strings

, ex-

porting and pretty-printing of the information, etc.

lookupEntry "lA-silkIy" ... lookupReflex "wireless" ...

inflect (lA >| FiCL |< Iy ‘adj‘ []) "------F[SP]-D"

derive ("w .s y" <-> HaFCY ‘verb‘ ["recommend"]) "A--P"

resolve "mU.saNY" resolveBy (omitting "�'�'�''�

�'�''�

�'�'") "úæ�ñÓ"

"s l k" ‘merge‘ al >| lA >| FiCL |< Iy |<< "u"

"al-lA-silkIyu" al-la-silkıyu ú¾Ê�CË@

��ú¾�Ê�

�C

���@

��ú¾�

��

�C

���@

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 20 / 30

Page 55: Programming the Arabic Treebank

User Interface

ElixirFM implements various utility functions for lookup in the lexi-

con, inflection and derivation of lexemes

, resolution of strings

, ex-

porting and pretty-printing of the information, etc.

lookupEntry "lA-silkIy" ... lookupReflex "wireless" ...

inflect (lA >| FiCL |< Iy ‘adj‘ []) "------F[SP]-D"

derive ("w .s y" <-> HaFCY ‘verb‘ ["recommend"]) "A--P"

resolve "mU.saNY" resolveBy (omitting "�'�'�''�

�'�''�

�'�'") "úæ�ñÓ"

"s l k" ‘merge‘ al >| lA >| FiCL |< Iy |<< "u"

"al-lA-silkIyu" al-la-silkıyu ú¾Ê�CË@

��ú¾�Ê�

�C

���@

��ú¾�

��

�C

���@

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 20 / 30

Page 56: Programming the Arabic Treebank

User Interface

ElixirFM implements various utility functions for lookup in the lexi-

con, inflection and derivation of lexemes, resolution of strings

, ex-

porting and pretty-printing of the information, etc.

lookupEntry "lA-silkIy" ... lookupReflex "wireless" ...

inflect (lA >| FiCL |< Iy ‘adj‘ []) "------F[SP]-D"

derive ("w .s y" <-> HaFCY ‘verb‘ ["recommend"]) "A--P"

resolve "mU.saNY" resolveBy (omitting "�'�'�''�

�'�''�

�'�'") "úæ�ñÓ"

"s l k" ‘merge‘ al >| lA >| FiCL |< Iy |<< "u"

"al-lA-silkIyu" al-la-silkıyu ú¾Ê�CË@

��ú¾�Ê�

�C

���@

��ú¾�

��

�C

���@

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 20 / 30

Page 57: Programming the Arabic Treebank

User Interface

ElixirFM implements various utility functions for lookup in the lexi-

con, inflection and derivation of lexemes, resolution of strings, ex-

porting and pretty-printing of the information, etc.

lookupEntry "lA-silkIy" ... lookupReflex "wireless" ...

inflect (lA >| FiCL |< Iy ‘adj‘ []) "------F[SP]-D"

derive ("w .s y" <-> HaFCY ‘verb‘ ["recommend"]) "A--P"

resolve "mU.saNY" resolveBy (omitting "�'�'�''�

�'�''�

�'�'") "úæ�ñÓ"

"s l k" ‘merge‘ al >| lA >| FiCL |< Iy |<< "u"

"al-lA-silkIyu" al-la-silkıyu ú¾Ê�CË@

��ú¾�Ê�

�C

���@

��ú¾�

��

�C

���@

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 20 / 30

Page 58: Programming the Arabic Treebank

User Interface

ElixirFM implements various utility functions for lookup in the lexi-

con, inflection and derivation of lexemes, resolution of strings, ex-

porting and pretty-printing of the information, etc.

lookupEntry "lA-silkIy" ... lookupReflex "wireless" ...

inflect (lA >| FiCL |< Iy ‘adj‘ []) "------F[SP]-D"

derive ("w .s y" <-> HaFCY ‘verb‘ ["recommend"]) "A--P"

resolve "mU.saNY" resolveBy (omitting "�'�'�''�

�'�''�

�'�'") "úæ�ñÓ"

"s l k" ‘merge‘ al >| lA >| FiCL |< Iy |<< "u"

"al-lA-silkIyu" al-la-silkıyu ú¾Ê�CË@

��ú¾�Ê�

�C

���@

��ú¾�

��

�C

���@

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 20 / 30

Page 59: Programming the Arabic Treebank

User Interface

ElixirFM implements various utility functions for lookup in the lexi-

con, inflection and derivation of lexemes, resolution of strings, ex-

porting and pretty-printing of the information, etc.

lookupEntry "lA-silkIy" ... lookupReflex "wireless" ...

inflect (lA >| FiCL |< Iy ‘adj‘ []) "------F[SP]-D"

derive ("w .s y" <-> HaFCY ‘verb‘ ["recommend"]) "A--P"

resolve "mU.saNY" resolveBy (omitting "�'�'�''�

�'�''�

�'�'") "úæ�ñÓ"

"s l k" ‘merge‘ al >| lA >| FiCL |< Iy |<< "u"

"al-lA-silkIyu" al-la-silkıyu ú¾Ê�CË@

��ú¾�Ê�

�C

���@

��ú¾�

��

�C

���@

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 20 / 30

Page 60: Programming the Arabic Treebank

User Interface

ElixirFM implements various utility functions for lookup in the lexi-

con, inflection and derivation of lexemes, resolution of strings, ex-

porting and pretty-printing of the information, etc.

lookupEntry "lA-silkIy" ... lookupReflex "wireless" ...

inflect (lA >| FiCL |< Iy ‘adj‘ []) "------F[SP]-D"

derive ("w .s y" <-> HaFCY ‘verb‘ ["recommend"]) "A--P"

resolve "mU.saNY" resolveBy (omitting "�'�'�''�

�'�''�

�'�'") "úæ�ñÓ"

"s l k" ‘merge‘ al >| lA >| FiCL |< Iy |<< "u"

"al-lA-silkIyu" al-la-silkıyu ú¾Ê�CË@

��ú¾�Ê�

�C

���@

��ú¾�

��

�C

���@

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 20 / 30

Page 61: Programming the Arabic Treebank

User Interface

ElixirFM implements various utility functions for lookup in the lexi-

con, inflection and derivation of lexemes, resolution of strings, ex-

porting and pretty-printing of the information, etc.

lookupEntry "lA-silkIy" ... lookupReflex "wireless" ...

inflect (lA >| FiCL |< Iy ‘adj‘ []) "------F[SP]-D"

derive ("w .s y" <-> HaFCY ‘verb‘ ["recommend"]) "A--P"

resolve "mU.saNY" resolveBy (omitting "�'�'�''�

�'�''�

�'�'") "úæ�ñÓ"

"s l k" ‘merge‘ al >| lA >| FiCL |< Iy |<< "u"

"al-lA-silkIyu" al-la-silkıyu ú¾Ê�CË@

��ú¾�Ê�

�C

���@

��ú¾�

��

�C

���@

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 20 / 30

Page 62: Programming the Arabic Treebank

User Interface

ElixirFM implements various utility functions for lookup in the lexi-

con, inflection and derivation of lexemes, resolution of strings, ex-

porting and pretty-printing of the information, etc.

lookupEntry "lA-silkIy" ... lookupReflex "wireless" ...

inflect (lA >| FiCL |< Iy ‘adj‘ []) "------F[SP]-D"

derive ("w .s y" <-> HaFCY ‘verb‘ ["recommend"]) "A--P"

resolve "mU.saNY" resolveBy (omitting "�'�'�''�

�'�''�

�'�'") "úæ�ñÓ"

"s l k" ‘merge‘ al >| lA >| FiCL |< Iy |<< "u"

"al-lA-silkIyu" al-la-silkıyu ú¾Ê�CË@

��ú¾�Ê�

�C

���@

��ú¾�

��

�C

���@

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 20 / 30

Page 63: Programming the Arabic Treebank

User Interface

ElixirFM implements various utility functions for lookup in the lexi-

con, inflection and derivation of lexemes, resolution of strings, ex-

porting and pretty-printing of the information, etc.

lookupEntry "lA-silkIy" ... lookupReflex "wireless" ...

inflect (lA >| FiCL |< Iy ‘adj‘ []) "------F[SP]-D"

derive ("w .s y" <-> HaFCY ‘verb‘ ["recommend"]) "A--P"

resolve "mU.saNY" resolveBy (omitting "�'�'�''�

�'�''�

�'�'") "úæ�ñÓ"

"s l k" ‘merge‘ al >| lA >| FiCL |< Iy |<< "u"

"al-lA-silkIyu" al-la-silkıyu ú¾Ê�CË@

��ú¾�Ê�

�C

���@

��ú¾�

��

�C

���@

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 20 / 30

Page 64: Programming the Arabic Treebank

Form All˜Asilokiy˜apu��é��Jº�

���C

�Ë@

Morph Al + lAsilokiy˜ + ap + u

Tag DET+ADJ+NSUFF_FEM_SG+CASE_DEF_NOM

Gloss the + wireless / radio + [fem.sg.] + [def.nom.]

Lemma lAsilokiy˜_1�

ú¾�

��

�B�

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Form al-lA-silkIyaTu al-la-silkıyatu��é��Jº�

���C

���@

Morph al >| lA >| FiCL |< Iy |< aT |<< "u"

Tag A-----FS1D

Form lA-silkIy la-silkıy�

ú¾�

��

�B�

Morph lA >| FiCL |< Iy

Root "s l k"

Reflex wireless, radio

Class adjective

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 21 / 30

Page 65: Programming the Arabic Treebank

Form All˜Asilokiy˜apu��é��Jº�

���C

�Ë@

Morph Al + lAsilokiy˜ + ap + u

Tag DET+ADJ+NSUFF_FEM_SG+CASE_DEF_NOM

Gloss the + wireless / radio + [fem.sg.] + [def.nom.]

Lemma lAsilokiy˜_1�

ú¾�

��

�B�

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Form al-lA-silkIyaTu al-la-silkıyatu��é��Jº�

���C

���@

Morph al >| lA >| FiCL |< Iy |< aT |<< "u"

Tag A-----FS1D

Form lA-silkIy la-silkıy�

ú¾�

��

�B�

Morph lA >| FiCL |< Iy

Root "s l k"

Reflex wireless, radio

Class adjective

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 21 / 30

Page 66: Programming the Arabic Treebank

Form waOuxoraY ø �Q�

k

� @�ð

Morph wa + OuxoraY

Tag CONJ+ADJ

Gloss and + other / another / additional

Lemma OuxoraY_1 ø �Q�

k

� @

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Form ’u_hrY ↩uh˘

ra ø �Q�

k

� @ wa wa �

ð

Morph FuCLY |<< "u" "wa"

Tag A-----FS1I C---------

Form ’A_har ↩ah˘

ar Q�

k�@ wa wa �

ð

Morph HACaL "wa"

Root "’ _h r" "w"

Reflex other, another and

Class adjective conjunction

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 22 / 30

Page 67: Programming the Arabic Treebank

Form waOuxoraY ø �Q�

k

� @�ð

Morph wa + OuxoraY

Tag CONJ+ADJ

Gloss and + other / another / additional

Lemma OuxoraY_1 ø �Q�

k

� @

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Form ’u_hrY ↩uh˘

ra ø �Q�

k

� @ wa wa �

ð

Morph FuCLY |<< "u" "wa"

Tag A-----FS1I C---------

Form ’A_har ↩ah˘

ar Q�

k�@ wa wa �

ð

Morph HACaL "wa"

Root "’ _h r" "w"

Reflex other, another and

Class adjective conjunction

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 22 / 30

Page 68: Programming the Arabic Treebank

Form sayad˜aEiy ú«�

��Y

�J

��

Morph sa + ya + d˜aEiy + (null)

Tag FUT+IV3MS+IV+IVSUFF_MOOD:I

Gloss will + he / it + allege / claim / testify + [ind.]

Lemma Aid˜aEaY_1 ú�«

��X@�

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Form yadda‘I yadda↪ı ú«�

��Y

�K sa sa �

Morph "ya" >>| FtaCI |<< "u" "sa"

Tag VIIA-3MS-- F---------

Form idda‘Y idda↪a ú�«

��X@� sa sa �

Morph IFtaCY "sa"

Root "d ‘ w" "s"

Reflex allege, claim, testify future marker

Class verb particle

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 23 / 30

Page 69: Programming the Arabic Treebank

Form sayad˜aEiy ú«�

��Y

�J

��

Morph sa + ya + d˜aEiy + (null)

Tag FUT+IV3MS+IV+IVSUFF_MOOD:I

Gloss will + he / it + allege / claim / testify + [ind.]

Lemma Aid˜aEaY_1 ú�«

��X@�

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Form yadda‘I yadda↪ı ú«�

��Y

�K sa sa �

Morph "ya" >>| FtaCI |<< "u" "sa"

Tag VIIA-3MS-- F---------

Form idda‘Y idda↪a ú�«

��X@� sa sa �

Morph IFtaCY "sa"

Root "d ‘ w" "s"

Reflex allege, claim, testify future marker

Class verb particle

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 23 / 30

Page 70: Programming the Arabic Treebank

ElixirFM carefully designs the morphophonemic patterns of the

templates, along with the phonological rules hidden in the >|

or |<< operators. This greatly simplifies the morphological rules

proper, inflectional or derivational. ElixirFM implements many gen-

eralizations of classical grammars, and suggest some new ones.

"ya" >>| FtaCI |<< "u" yadda‘I yadda↪ı ú«�

��Y

�K

"ya" >>| FtaCI |<< "a" yadda‘iya yadda↪iya �ú«�

��Y

�K

"ya" >>| FtaCI |<< "" yadda‘i yadda↪i ¨�

��Y

�K

"ya" >>| FCuL |<< "u" yaktubu yaktubu �I.

��J�º

�K

"ya" >>| FCuL |<< "a" yaktuba yaktuba �I.

��J�º

�K

"ya" >>| FCuL |<< "" yaktub yaktub �I.

��J�º

�K

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 24 / 30

Page 71: Programming the Arabic Treebank

ElixirFM carefully designs the morphophonemic patterns of the

templates, along with the phonological rules hidden in the >|

or |<< operators. This greatly simplifies the morphological rules

proper, inflectional or derivational. ElixirFM implements many gen-

eralizations of classical grammars, and suggest some new ones.

"ya" >>| FtaCI |<< "u" yadda‘I yadda↪ı ú«�

��Y

�K

"ya" >>| FtaCI |<< "a" yadda‘iya yadda↪iya �ú«�

��Y

�K

"ya" >>| FtaCI |<< "" yadda‘i yadda↪i ¨�

��Y

�K

"ya" >>| FCuL |<< "u" yaktubu yaktubu �I.

��J�º

�K

"ya" >>| FCuL |<< "a" yaktuba yaktuba �I.

��J�º

�K

"ya" >>| FCuL |<< "" yaktub yaktub �I.

��J�º

�K

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 24 / 30

Page 72: Programming the Arabic Treebank

Buckwalter Transliteration

�Ñî

�D

�Ê�«

�ð @

�Q�Ö�Þ

��

�ð C

� ��®

�« @ñ

�J.ë�

�ð

�Y

��¯

�ð .

���ñ

��®

�m

�Ì'

�@�ð �

é�

�Ó@ �Q

�º

�Ë�@ ú

¯�

�áKð�A

��

����Ó @ �P@ �Q

�k

� @ ��A

��JË

�@

�©JÔ�

�g.

�Y

�Ëñ

�K

. Z� A�

gB �

��@ h

�ð �QK.�

A�

��ª

�K.

�Ñ

�îD

��

�ª

�K.

�ÉÓ� A

�ª

�K

�à

� @

yuwladu jamiyEu {ln˜aAsi OaHoraArFA mutasaAwiyna fiy

{lokaraAmapi wa{loHuquwqi. waqado wuhibuwA EaqolAF

waDamiyrFA waEalayohimo Oano yuEaAmila baEoDuhumo baEoDFA

biruwHi {loIixaA’i.

ÑîDÊ«ð @Q�ÖÞ�ð C

�®« @ñJ.ëð Y

�¯ð .

��ñ

�®mÌ'@ð �

éÓ@QºË@ ú

¯ áKðA�

��Ó @P@Qk

@ �A

�JË @ ©JÔ

g.

YËñK

. Z A

gB

@ hðQK. A

�ªK. ÑîD�ªK. ÉÓAªK

à

@

ywld jmyE AlnAs OHrArA mtsAwyn fy AlkrAmp wAlHqwq. wqd

whbwA EqlA wDmyrA wElyhm On yEAml bEDhm bEDA brwH AlIxA’.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 25 / 30

Page 73: Programming the Arabic Treebank

Notation of ArabTEX

�Ñî

�D

�Ê�«

�ð @

�Q�Ö�Þ

��

�ð C

� ��®

�« @ñ

�J.ë�

�ð

�Y

��¯

�ð .

���ñ

��®

�m

�Ì'

�@�ð �

é�

�Ó@ �Q

�º

�Ë�@ ú

¯�

�áKð�A

��

����Ó @ �P@ �Q

�k

� @ ��A

��JË

�@

�©JÔ�

�g.

�Y

�Ëñ

�K

. Z� A�

gB �

��@ h

�ð �QK.�

A�

��ª

�K.

�Ñ

�îD

��

�ª

�K.

�ÉÓ� A

�ª

�K

�à

� @

ÑîDÊ«ð @Q�ÖÞ�ð C

�®« @ñJ.ëð Y

�¯ð .

��ñ

�®mÌ'@ð �

éÓ@QºË@ ú

¯ áKðA�

��Ó @P@Qk

@ �A

�JË @ ©JÔ

g.

YËñK

. Z A

gB

@ hðQK. A

�ªK. ÑîD�ªK. ÉÓAªK

à

@

Yuladu gamı↪u ’n-nasi ↩ah. raran mutasawına fı ’l-karamati wa-’l-h. uquqi.

Wa-qad wuhibu ↪aqlan wa-d. amıran wa-↪alay-him ↩an yu↪amila ba↪d. u-

hum ba↪d. an bi-ruh. i ’l-↩ih˘

a↩i.

\cap yUladu ˆgamI‘u an-nAsi ’a.hrAraN mutasAwIna fI

al-karAmaTi wa-al-.huqUqi.

\cap wa-qad wuhibUW ‘aqlaN wa-.damIraN wa-‘alay-him ’an

yu‘Amila ba‘.du-hum ba‘.daN bi-rU.hi al-’i_hA’i.

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 26 / 30

Page 74: Programming the Arabic Treebank

Encode Arabic

biruwHi {loIixaA’i Z� A�

gB �

��@ h

�ð �QK.�

bi-ruh. i ’l-↩ih˘

a↩i bi-rU.hi al-’i_hA’i

[>] decode ArabTeX < decode.d | encode Buckwalter > encode.d

Implemented in Perl and available on CPAN as Encode-Arabic:

$encoded = encode "buckwalter", decode "arabtex", $decoded

$encoded = encode("buckwalter", decode("arabtex", $decoded))

Implemented in Haskell and available along with ElixirFM:

encoded = encode Buckwalter $ decode ArabTeX decoded

encoded = encode Buckwalter (decode ArabTeX decoded)

encoded = (encode Buckwalter . decode ArabTeX) decoded

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 27 / 30

Page 75: Programming the Arabic Treebank

Encode Arabic

biruwHi {loIixaA’i Z� A�

gB �

��@ h

�ð �QK.�

bi-ruh. i ’l-↩ih˘

a↩i bi-rU.hi al-’i_hA’i

[>] decode ArabTeX < decode.d | encode Buckwalter > encode.d

Implemented in Perl and available on CPAN as Encode-Arabic:

$encoded = encode "buckwalter", decode "arabtex", $decoded

$encoded = encode("buckwalter", decode("arabtex", $decoded))

Implemented in Haskell and available along with ElixirFM:

encoded = encode Buckwalter $ decode ArabTeX decoded

encoded = encode Buckwalter (decode ArabTeX decoded)

encoded = (encode Buckwalter . decode ArabTeX) decoded

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 27 / 30

Page 76: Programming the Arabic Treebank

Encode Arabic

biruwHi {loIixaA’i Z� A�

gB �

��@ h

�ð �QK.�

bi-ruh. i ’l-↩ih˘

a↩i bi-rU.hi al-’i_hA’i

[>] decode ArabTeX < decode.d | encode Buckwalter > encode.d

Implemented in Perl and available on CPAN as Encode-Arabic:

$encoded = encode "buckwalter", decode "arabtex", $decoded

$encoded = encode("buckwalter", decode("arabtex", $decoded))

Implemented in Haskell and available along with ElixirFM:

encoded = encode Buckwalter $ decode ArabTeX decoded

encoded = encode Buckwalter (decode ArabTeX decoded)

encoded = (encode Buckwalter . decode ArabTeX) decoded

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 27 / 30

Page 77: Programming the Arabic Treebank

Encode Arabic

biruwHi {loIixaA’i Z� A�

gB �

��@ h

�ð �QK.�

bi-ruh. i ’l-↩ih˘

a↩i bi-rU.hi al-’i_hA’i

[>] decode ArabTeX < decode.d | encode Buckwalter > encode.d

Implemented in Perl and available on CPAN as Encode-Arabic:

$encoded = encode "buckwalter", decode "arabtex", $decoded

$encoded = encode("buckwalter", decode("arabtex", $decoded))

Implemented in Haskell and available along with ElixirFM:

encoded = encode Buckwalter $ decode ArabTeX decoded

encoded = encode Buckwalter (decode ArabTeX decoded)

encoded = (encode Buckwalter . decode ArabTeX) decoded

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 27 / 30

Page 78: Programming the Arabic Treebank

References

ElixirFM plus lexicons, Encode Arabic, MorphoTrees, and ArabTEX

extensions are open-source software licensed under GNU GPL:

http://sourceforge.net/projects/elixir-fm/

http://sourceforge.net/projects/encode-arabic/

Prague Arabic Dependency Treebank ++ is the project’s weblog:

http://ufal.mff.cuni.cz/padt/online/

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 28 / 30

Page 79: Programming the Arabic Treebank

References

ElixirFM plus lexicons, Encode Arabic, MorphoTrees, and ArabTEX

extensions are open-source software licensed under GNU GPL:

http://sourceforge.net/projects/elixir-fm/

http://sourceforge.net/projects/encode-arabic/

Prague Arabic Dependency Treebank ++ is the project’s weblog:

http://ufal.mff.cuni.cz/padt/online/

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 28 / 30

Page 80: Programming the Arabic Treebank

Buckwalter, Tim. Buckwalter Arabic Morphological Analyzer

1.0. LDC2002L49, ISBN 1-58563-257-0. 2002

Forsberg, Markus and Aarne Ranta. Functional Morphology.

Proceedings of ICFP 2004, pages 213–223. ACM Press. 2004

Lagally, Klaus. ArabTeX: Typesetting Arabic and Hebrew, User

Manual Version 4.00. Technical Report 2004/03, Fakultat

Informatik, Universitat Stuttgart. 2004

Sgall, Petr and Eva Hajicova and Jarmila Panevova. The

Meaning of the Sentence in Its Semantic and Pragmatic

Aspects. Academia, Prague. 1986

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 29 / 30

Page 81: Programming the Arabic Treebank

Smrz, Otakar and Petr Pajas. MorphoTrees of Arabic and

Their Annotation in the TrEd Environment. Proceedings of the

NEMLAR Conference 2004, pages 38–41. 2004

Smrz, Otakar. Functional Arabic Morphology. Formal System

and Implementation. Ph.D. thesis, Charles University in

Prague. 2007

Smrz, Otakar et al. Prague Arabic Dependency Treebank:

A Word on the Million Words. LREC 2008 Workshop on Arabic

and Local Languages. 2008

Otakar Smrz (Charles University) Programming the Arabic Treebank Dublin, April 18, 2008 30 / 30