[DCSB] Amiz Zeldes (HU, Berlin) "Towards Digital Coptic: Searching and Visualizing Coptic Manuscript...

Towards Digital Coptic

Caroline T. Schroeder, University of the Pacific [email protected]

Amir Zeldes, Humboldt-Universität zu Berlin [email protected]

Berlin Digital Classicist Seminar, 14.1.2014

Searching and Visualizing Coptic Manuscript Data

mailto:[email protected]




Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 1/37

Plan

Introduction

Coptic data

Annotations so far: normalizing, tokenizing and tagging

Search architecture

Searching through multiple segmentations: ANNIS

Dealing with corpus formats: TEI, SaltNPepper

Visualization

Dedicated visualizations

A reusable generic approach

Conclusion and outlook


Who are these people?

Prof. Caroline T. Schroeder – Religious and Classical Studies / Humanities Center Director University of the Pacific

Dr. Amir Zeldes – Korpuslinguistik / SFB 632 Information Structure (from March: eHumanities group KOMeT) Humboldt-Universität zu Berlin

Cooperation Coptic SCRIPTORIUM established at 2012 NEH summer institute on "Text in a Digital Age" (Tufts): http://coptic.pacific.edu/

http://coptic.pacific.edu/


Why Coptic?

Last stage of Ancient Egyptian Language (starting 2nd Century)

Mediterranean in 1st millenium

Hellenistic period

Unique language

Longest continuous documentation

Contact language (with Greek)

Religious significance

Early Christianity

Rise of monasticism

Gnosticism

...

BMBF eHumanties - KOMeT / Zeldes Coptische Dialects


The data

Lots of material (thanks to the Egyptian desert )

Relatively little online, nothing like Greek and Latin (Perseus)

Lots of things you may want are not available:

New Testament (online, not normalized/lemmatized/annotated)

Old Testament

The Rule of St. Pachomius

Works of Shenoute of Atripe

Apophthegmata patrum

...

But some have been digitized at some point!


A word about the texts in this talk

So far we've concentrated on Shenoute's sermon Abraham our Father

"As for us, brethren, let us live by the truth so that we are upstanding in all our works, and so that the prophets, apostles and all the saints might dwell among us, ..."

Apophthegmata Patrum (sayings of the desert fathers)

"They said about the blessed Sarah the virgin that she spent sixty years living at the top of the river and she never set foot outside to see the river."

New Testament, esp. Gospel of Mark

see http://coptic.pacific.edu/ for corpora and tools





Getting from raw text to annotated corpora

Making the data searchable starts with:

Encoding manuscripts (Epidoc TEI)

Segmentation of "word forms"

Normalization

Segmentation of morphemes

Part-of-speech tagging

More annotations...

Brief recap: Detailed talk in Leipzig last month (slides on my page)


Normalization

Automatic normalization, manual correction

handling of known diacritics, abbreviations

closed, growing list of known variants


Tokenization

Identifying morphemes non-trivial (agglutinative language, different conventions; we follow Layton 2004)

ϫⲓⲛⲧⲁⲓⲣ̅ⲙⲟⲛⲁⲭⲟⲥ 'Since I became a monk' since-that-PAST-1sg-do-monk

ⲉⲛⲧⲁϥⲧⲣⲉⲛⲣⲡϣⲁ 'he who made us keep the ceremony' REL-PAST-3sgM-CAUS-1pl-do-the-observance

Word level segmentation: manual (no scriptio continua)

Morph segmentation: automatic (accuracy: 84% - 94%)

ⲛ̄ⲟⲩϣⲏⲣⲉ` ⲛ̄ⲁⲃⲣⲁϩⲁⲙ` ⲛ ⲟⲩ ϣⲏⲣⲉ ⲛ ⲁⲃⲣⲁϩⲁⲙ of-a-son of-Abraham of a son of Abraham


Part-of-speech tagging

POS tagging using TreeTagger (Schmid 1994) and a lexicon from the CMCL project (courtesy of Prof. Tito Orlandi)

Two tag sets:

fine grained (45 tags) and coarse (22 tags) (see http://coptic.pacific.edu/ for documentation)

Interannotator agreement: 94.19% agreement, kappa = 93.67 (considers chance agreement, cf. Artstein & Poesio 2008)

Accuracy:

In domain, 10-fold cross-validation: 94.04% (fine)

Out of domain (test with papyri.info): 79.6% (fine) / 87.7% (coarse)

Main difficulties: open classes (N/V), disambiguating homonyms (ⲉ can have 6 different tags!)



Further annotations

Many other layers are done manually:

Translation

Language of origin

Coreference

Entity tagging (people, places...)

Parallel alignment (with Greek)

Syntax trees (very preliminary tests)


Representing data – how to look at all this stuff?

We now have a lot of data to represent:

Diplomatic transcriptions (including character rendering!)

Normalization

Segmentation into words, morphemes, sometimes letters

Annotations

How do we encode this data for search and visualization?


The first challenge: minimal units

Minimal units, or tokens, are critical for searching:

Find all words preceding the word "God"

Give me any mentions of Saint Paphnutius, ±10 words

Search for the glosses father and son within 20 words

Two problems:

The concept of words is complex in Coptic

Annotations overlap parts of words: individual letters, line breaks... tokens are smaller than words!

ⲡⲉϪⲁϥ ϫⲉ ⲉⲓ̇ⲥ ϣ ⲙⲟⲩⲛ ⲛ̇ⲣⲟⲙⲡⲉ ⲻ Ⲡⲉϫⲉ ⲡ̇ϩⲗ̇ⲗⲟ ⲛⲁϥ

he sAid "it's been e

ight years" –

The old man told him


Solution: segmentation layers in ANNIS

We use the open source ANNIS platform as a search interface (Zeldes et al. 2009)

Any annotation layer can be defined as a segmentation defining alternative views on:

Adjacency (in words, morphemes, etc.)

Proximity (in words, morphemes, etc.)

Context size (in words, morphemes, etc.)

But which segmentation layer do you want to see?

Remember, diplomatic and normalized layers don't match

Any segmentation layer is usable as "base text"


Switching segmentations in ANNIS


Different contexts

Example search: entity="person"

Hit: Abba Antonius

Some options:

±5 words, diplomatic: (less than -5 found, since start of text) Ⲁⲩϭⲱⲗⲡ̇ ⲉ̇ⲃⲟⲗ ⲛⲁⲡⲁ ⲁ̇ⲛⲧⲱⲛⲓ̇ⲟⲥ ϩⲓ̇ⲡ̇ϫⲁⲓ̇ⲉ̇ · ϫⲉⲟⲩⲛⲟⲩⲁ̇ ⲉ̇ϥⲉⲓⲛⲉ̇ ⲙ̇ⲙⲟⲕ

±10 morphs, normalized: ⲁ ⲩ ϭⲱⲗⲡ ⲉⲃⲟⲗ ⲛ ⲁⲡⲁ ⲁⲛⲧⲱⲛⲓⲟⲥ ϩⲓ ⲡ ϫⲁⲓⲉ · ϫⲉ ⲟⲩⲛ ⲟⲩⲁ ⲉ ϥ ⲉⲓⲛⲉ ⲙⲙⲟ ⲕ

±5 tokens: Ⲁ ⲩ ϭⲱⲗⲡ̇ ⲉ̇ⲃⲟⲗ ⲛ ⲁⲡⲁ ⲁ̇ⲛ ⲧⲱⲛⲓ̇ⲟⲥ ϩⲓ̇ ⲡ̇ ϫⲁⲓ̇ⲉ̇ · ϫⲉ

Ⲁⲩϭⲱⲗⲡ̇ 5 ⲉ̇ⲃⲟⲗ ⲛⲁⲡⲁ ⲁ̇ⲛ ⲧⲱⲛⲓ̇ⲟⲥ ϩⲓ̇ ⲡ̇ϫⲁⲓ̇ⲉ̇ · ϫⲉ ⲟⲩⲛ ⲟⲩⲁ̇ ⲉ̇ϥⲉⲓⲛⲉ̇


Searching with AQL (see http://www.sfb632.uni-potsdam.de/annis/ )

Basic principle of ANNIS Query Language (AQL):

search for some annotations (#1, #2, #3...)

stipulate relationships between them (operators)

Example: verbs of Greek origin

pos="V" & source_lang="Greek" & #1 _=_ #2

The head bandit repented

I have faith in God

identical coverage operator

http://www.sfb632.uni-potsdam.de/annis/







Referencing segmentations

There are many operators

. (adjacent), _i_ (inclusion), _o_ (overlap), _l_ (left aligned)...

> (dominance), -> (pointing relation), >@l (left child)...

...

Possible to use segmentations in queries:

#1 . #2 - one followed by two

#1 .word #2 - two is the next word after one

#1 .norm,1,10 #2 - within 1 to 10 norm units

...


Adding metadata

Metadata is like any other constraint, with meta:: prefix

Can use regular expressions and negation

pos!="V" & source_lang="Greek" & #1 _=_ #2 & meta::msName=/MONB.*/

For metadata names and values we use TEI/EpiDoc as a guideline

More information on AQL: http://www.sfb632.uni-potsdam.de/annis/







Architecture and formats

Different formats are suitable for different parts of the data

TEI ideal for manuscript structure, metadata

Linguistic formats for computational corpus linguistics: tagging, parsing, coreference

Convert and merge data using SaltNPepper (Zipser & Romary 2010)


SaltNPepper (Zipser & Romary 2010)

Metamodel Salt for multiformat conversion

Work on extending TEI support: 2014-15

Salt as internal representation in ANNIS


How can we view the data?

Even if we can query everything at once:

people who are indirect objects of the verb "show" aligned with Greek neuters...

Can we also look at everything at once?

Excerpt from a Salt graph view of two words:


Breaking it down

Different annotations require different visualizations

Two conflicting requirements:

Ideal representation for each layer (syntax -> trees)

Stay generic and minimize amount of visualizations

How can we avoid programming new visualizations with each new annotation layer?


Generic versus dedicated

For some purposes, dedicated visualizations cannot be avoided

Special interactive functionality

Special layouting algorithms

For other purposes, we can reuse visualizations by making flexible and configurable

Need to take segmentations into account


Some dedicated examples

Syntax trees

Coreference view (interactive)


Taking segmentations into account

Visualizations must be configurable to be aware of different base texts

Syntax tree is based on normalized "word"-internal morphs

Sometimes one syntactic unit has multiple tokens

band of ban dits came upon a band of bandits band ofban 15 dits and foundthem drinking . [...]


Reusing dedicated visualizers?

In some cases, some creative uses can be found for existing visualizations

Using the coreference visualizer for parallel alignment:

apophthegmata patrum


Generic visualizations

Two main generic visualizers:

Annotation grid:

just mark borders of annotations

good for flat information

HTML visualizer:

generates HTML elements based on annotations

defined using two simple stylesheets

can look like (almost) anything


Multiple grids

All annotations in one grid can lead to visual overload

Often better to separate groups of annotations:


The HTML visualizer

norm.config norm.css

p p

word span; style="word"

norm span; style="norm" value

trans t:title; style="trans" value

div.htmlvis {

font-family: Antinoou, sans-serif; width: 500px; white-space: normal !important;

}

.trans:hover{color: red}

.word:after{content: " ";}

Any specific visualization is configured by two style sheets: a config file and a CSS file

Result <t class="translation" title="Abraham our father wished to have children with Sarah."> ⲁⲃⲣⲁϩⲁⲙ ⲡⲉⲛ ⲉⲓⲱⲧ </t>

... 

Abraham our Father


Reusing the HTML visualizer

dipl.config

tok span value

lb div; style="line"

pb table:title; style="pb" value

pb tr

cb td; style="cb"

hi_rend hi_rend:rend value


Visualizing TEI @rend attributes dipl.css div.line{display: block; height: 22px counter-increment: linecount;}

div.line:nth-of-type(5n):before{ content: counter(linecount)" "}

...

.pb{border-style:solid;}

.cb{counter-reset: linecount 0; width: 160px; min-width: 160px}

...

hi_rend[rend*=superscript] {vertical-align: super; font-size: 80%}

hi_rend[rend*=red] {color: red}

hi_rend[rend*=tall] {font-size: 120%}

hi_rend[rend*=extralarge] {font-size: 160%}


Aggregate visualizations

Latest version of ANNIS offers basic frequency analysis

Open question: How much more should we build?


Aggregate visualizations

Other visualizations are currently done e.g. in R: 11 apophthegmata patrum Gospel of Mark 1

ⲉⲓ

ⲩⲛⲟⲩ

ⲓⲏⲥⲟⲩⲥ

ⲛⲙⲛⲧ

ⲉⲣⲉ

ⲃⲁⲡⲧⲓⲥⲙⲁ

ⲅⲁⲗⲓⲗⲁⲓⲁ

ⲓⲱϩⲁⲛⲛⲏⲥ

ⲛⲥⲱⲡⲛⲉⲩⲙⲁ

ⲥⲓⲙⲱⲛ

ⲕⲏⲣⲩⲥⲥⲉ

ⲥⲩⲛⲁⲅⲱⲅⲏ

ⲧⲃⲃⲟϯⲥⲃⲱ

ⲁⲕⲁⲑⲁⲣⲧⲟⲛ

ⲇⲁⲓⲙⲱⲛⲓⲟⲛ

ⲉⲣⲏⲙ

ⲟⲥ

ⲉⲩⲁⲅⲅⲉⲗⲓⲟⲛ

ⲕⲁⲛⲉⲩ

ⲛⲙⲙⲁⲥⲟⲩⲧ

ⲛ

ⲓϫⲉ

ⲡⲉϫⲁϩⲗⲗⲟ

ⲕⲁⲡⲁ

ⲡⲉⲓ

ⲧⲁ .

ⲫⲟⲣⲉⲓ

ϣⲁ

ϫⲟⲟ

ⲗⲁⲁⲩ

ⲣⲓ

ⲣⲟⲙⲡⲉ

ϣⲟⲙⲛⲧ

ϣⲧⲏⲛ

ⲉⲓⲣⲉ

ⲏⲣⲡ

ⲡⲉϫⲉⲥⲱ

ⲧⲉⲧⲛ

ϩⲟⲟⲩ

ϭⲱⲗⲡ

ⲁϣ

ⲉⲓⲃⲉ

ⲕⲱ

ⲙⲉⲉⲩⲉ

ⲙⲟⲛⲁⲭⲟⲥ

ⲙⲟⲟⲩ

ⲟⲩⲛ

ⲟⲩⲱⲙ

ⲣⲁⲧ

old man

Egyptian vocabulary said

you.SG.M

Abba

eat

wine

I/me

Greek vocabulary

synagogue

impure baptism

John

Jesus

Holy Ghost

Gospel


Conclusion

Annotation projects should not be limited by corpus architectures:

annotate whatever you want, however often you want

link anything to anything

Why annotate all of these things in the corpus? (and not just in a separate spreadsheet)

Plots of just the verbs? Proper names? POS tagging

Highlight, search and link place-names? Entity tagging

Collapse inflected variants? Lemmatization

Collapse prominent referents? Coreference annotation

Dispersion of any of the above, alignment ... and much more


Conclusion

Anything can be made queryable with more layers:

typical constructions and objects of verbs?

Greek vs. native verbs -> add language of origin layer

Translation behavior -> add alignment layer

...

Fitting visualization facilities

should be easy to re-use

optimized to the task, display relevant portions of information

for many purposes, they must be sensitive to segmentations


Outlook

This March: BMBF funded young researcher group on eHumanities at HU Berlin

KOMeT: KOrpuslinguistische Methoden für ePhilologie mit TEI

Focus on marrying TEI resources with computational linguistics methods and formats

Developing NLP tools, search and visualization for ancient world textual resources

Pilot phase (2014, approved): Coptic

Main phase (2015-2019, pending): Other languages as well

Currently looking for a student assistant (60h/month)

Stay tuned for more!

Ⲙⲓⲱⲧⲛ ⲧⲱⲛⲟⲩ! well-being+your.PL greatly => Thanks!

References

Artstein, Ron & Massimo Poesio (2008), Inter-Coder Agreement for Computational Linguistics. Computational Linguistics 34(4), 556–596.

Layton, Bentley (2004), A Coptic Grammar. Second Edition, Revised and Expanded. (Porta linguarum orientalium 20.) Wiesbaden: Harrassowitz.

Schmid, Helmut (1994), Probabilistic Part-of-Speech Tagging Using Decision Trees. In: Proceedings of the Conference on New Methods in Language Processing. Manchester, UK, 44–49. Available at: http://www.ims.uni-stuttgart.de/ftp/pub/corpora/tree-tagger1.pdf.

Zeldes, Amir, Julia Ritz, Anke Lüdeling & Christian Chiarcos (2009), ANNIS: A Search Tool for Multi-Layer Annotated Corpora. In: Proceedings of Corpus Linguistics 2009. Liverpool, UK.

Zipser, Florian & Laurent Romary (2010), A Model Oriented Approach to the Mapping of Annotation Formats using Standards. In: Proceedings of the Workshop on Language Resource and Language Technology Standards, LREC-2010. Valletta, Malta, 7–18.

http://www.ims.uni-stuttgart.de/ftp/pub/corpora/tree-tagger1.pdf







Links

Coptic SCRIPTORIUM: http://coptic.pacific.edu/

ANNIS: http://www.sfb632.uni-potsdam.de/annis/

Search engine for our corpora: https://korpling.german.hu-berlin.de/annis3/scriptorium

Papyri.info: http://papyri.info/

CMCL: http://cmcl.let.uniroma1.it/









https://korpling.german.hu-berlin.de/annis3/scriptorium







http://papyri.info/

http://papyri.info/

http://papyri.info/

http://cmcl.let.uniroma1.it/



[DCSB] Amiz Zeldes (HU, Berlin) "Towards Digital Coptic: Searching and Visualizing Coptic Manuscript...

Education

Transcript of [DCSB] Amiz Zeldes (HU, Berlin) "Towards Digital Coptic: Searching and Visualizing Coptic Manuscript...