Wido van Peursen, VU University Amsterdam, Faculty of Theology.

Post on 27-Dec-2015

221 views 5 download

Transcript of Wido van Peursen, VU University Amsterdam, Faculty of Theology.

Wido van Peursen,VU University Amsterdam, Faculty of

Theology

1. The corpus: Hebrew Bible2. The WIVU Database3. CLARIN-project: SHEBANQ4. NWO-project: Syntactic Diversity in

BH

5. Case study: Judges 4 and 5

Ca. 400.000 words Probably composed over a period of ca. 1000

years (1200-200 BC) Complex transmission history Oldest complete MS: Codex Leningradensis,

1008/9 AD Various linguistic layers (e.g. vowel signs) No native speakers

WIVU database of the Hebrew Bible [WIVU = Werkgroep Informatica Vrije

Universiteit]• Createted since 1970s• Linguistic levels:

Morphology (encoding rather than tagging!) Words Phrases Clauses Sentences Text hierarchy

1. The corpus: Hebrew Bible2. The WIVU Database3. CLARIN-project: SHEBANQ4. NWO-project: Syntactic Diversity in

BH

5. Case study: Judges 4 and 5

System for HEBrew text: ANnotations for Queries and markup

Challenges:

1. No dedicated space on the web where an authorized version of this resource is guaranteed to exist.

2. No possibility to annotate it, link to it or build (open source) tools around it.

3. Results of existing queries cannot be shown on the web.

4. EMDROS is maintained by one-person private company.

5. Mainly used by specialists in Bible & Computer.

Mission:• To build a bridge between the linguistically

annotated Hebrew Text corpus and biblical scholars.

Three steps:(1)make text & annotations, available to scholars;(2)demonstrate how queries can function to address

research questions: repository of saved queries;(3)give textual scholarship more empirical basis, by

creating the opportunity of unique identifiers referring to saved queries.

Mission:• To build a bridge between the linguistically

annotated Hebrew Text corpus and biblical scholars.

Three steps:(1)make text & annotations, available to scholars;(2)demonstrate how queries can function to address

research questions: repository of saved queries;(3)give textual scholarship more empirical basis, by

creating the opportunity of unique identifiers referring to saved queries.

Mission:• To build a bridge between the linguistically

annotated Hebrew Text corpus and biblical scholars.

Three steps:(1)make text & annotations, available to scholars;(2)demonstrate how queries can function to address

research questions: repository of saved queries;(3)give textual scholarship more empirical basis, by

creating the opportunity of unique identifiers referring to saved queries.

Mission:• To build a bridge between the linguistically

annotated Hebrew Text corpus and biblical scholars.

Three steps:(1)make text & annotations, available to scholars;(2)demonstrate how queries can function to address

research questions: repository of saved queries;(3)give textual scholarship more empirical basis, by

creating the opportunity of unique identifiers referring to saved queries.

Example: “in-his –feet”: a.“on foot” orb.“in his footsteps”.Disambiguation: 1.intuitive/contextual or2.on basis of pattern recognition (participants/agreement)

Mission:• To build a bridge between the linguistically

annotated Hebrew Text corpus and biblical scholars.

Three steps:(1)make text & annotations, available to scholars;(2)demonstrate how queries can function to address

research questions: repository of saved queries;(3)give textual scholarship more empirical basis, by

creating the opportunity of unique identifiers referring to saved queries.

[she-sang <Pr>] [Deborah and Barak <Su>]

1. The corpus: Hebrew Bible2. The WIVU Database3. CLARIN-project: SHEBANQ4. NWO-project: Syntactic Diversity in

BH

5. Case study: Judges 4 and 5

Does Syntactic Variation reflect Language Change? Tracing Syntactic Diversity in Biblical Hebrew Texts

Explanations for linguistic diversity:• Genre• Chronology• Language contact (Aramaic)• Dialects• Textual transmission• Oral versus written layers

Limitations in current research:• Focus on separate Bible books• Methodological presuppositions• Focus on lexical items or set phrases• Failure to make use of methods for

researching linguistic variation and change. • Failure to incorporate insights into syntactic

differences between independent / dependent clauses and between narration / direct speech.

Our approach• Focus on syntax in three project

components: Phrase level Clause level Text level

• Synthesis: Integration of congruous and contradicting tendencies.

• Extra-biblical texts used as points of comparison.

1. The corpus: Hebrew Bible2. The WIVU Database3. CLARIN-project: SHEBANQ4. NWO-project: Syntactic Diversity in

BH

5. Case study: Judges 4 and 5

These chapters deal with battle• of Deborah, Barak and Israelite tribes• against the Canaanite king Jabin and his

army-captain Sisera. Differences, e.g.:

• 4 is prose, 5 is poetry.• Main figures (Jabin absent in 5).• Tribes involved (only two in 4).

4 depends on 5 Wellhausen 1878; Halpern 1983; Houston 1997;

Neef 2002 and many others. 5 depends on 4

Bechmann 1989; Waltisberg 1999. Common source/tradition

Richter 1963; Younger 1991. Synchronous/sequential

Guest 1998; Reis 2005.

1. Identification of ‘similar’ text segments on the basis of ‘distance’ (synopsis impossible).

2. Identification of text features that cause high similarity scores.

3. Analysis of the distribution of these features in the larger context of Judges and the Old Testament.

Is intuition that 4 and 5 belong together supported by textual features?

If so, where in the text can they be found?

Similarity matrices: ‘distance’ measuring between each verse from ch. 4 and each verse from ch. 5.

4\ 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 2 2 1 2 1 1 1 2 0 2 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 2 0 1 2 1 1 0 0 0 1 1 1 0 1 0 1 1 1 0 2 1 0 0 2 0 0 2 0 1 0 1 1 2 3 1 2 2 1 2 1 1 1 2 0 2 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 2 3 4 1 1 1 0 1 0 2 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 4 5 2 1 1 0 2 1 2 1 1 1 2 2 0 1 1 2 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 5 6 4 2 3 1 4 2 1 3 2 1 2 3 1 2 2 0 0 2 1 0 0 0 2 0 0 1 0 0 0 0 1 6 7 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 2 0 0 0 1 2 0 2 0 1 0 7 8 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 8 9 3 1 1 1 1 1 2 0 1 2 1 3 1 0 2 0 0 0 0 1 0 0 2 1 0 2 0 1 0 1 1 9

10 2 0 0 0 0 0 1 1 0 0 0 2 0 1 3 0 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 10 11 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 11 12 3 0 0 0 1 1 0 0 0 0 0 3 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0 12 13 0 1 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 1 0 1 2 0 0 0 0 1 0 2 0 1 1 13 14 4 1 1 2 3 1 2 1 1 0 2 3 2 2 2 0 0 0 0 1 0 0 2 0 1 2 0 1 0 1 2 14 15 1 1 1 1 2 0 0 0 1 0 2 1 2 1 2 0 0 0 0 1 0 0 1 0 0 1 1 3 0 1 2 15 16 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 2 0 1 1 16 17 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0 0 5 0 1 2 1 0 1 0 17 18 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 2 0 1 0 1 0 1 1 18 19 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 2 0 0 0 0 0 0 19 20 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 1 1 0 0 1 0 0 0 20 21 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 1 4 0 3 0 1 0 0 1 21 22 2 0 0 1 0 2 0 1 0 1 0 1 0 0 1 0 0 1 1 1 0 0 2 1 0 3 1 2 0 1 1 22 23 2 1 3 0 3 2 1 2 1 0 1 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 23 24 1 1 2 0 1 2 1 1 1 1 1 1 0 0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 24

Shared Lexemes: the more shared lexemes, the smaller the

distance. ‘Noise’: e.g. ‘and’ >

Stoplist: exclude frequent particles etc. Selection of content words on basis of part of

speech: only words with inflection (nouns, verbs, adjectives).

Basic unit for text comparison: verse, but ‘verse’ based on traditional unit delimitation.

Differences in verse size may affect results.

Jaccard Index: the intersection of the number of shared lexemes divided by the union.

I went homeI went home yesterday

Intersection: Shared lexemes (types): 3 (I, went, home)Union: Total number of lexemes: 4 (I, went, home, yesterday)Jaccard Index = 3/4 = 0.75

I went homeAfter the meeting I went home yesterday

Intersection: 3 (I, went, home)Union: 7 (I, went, home, after, the, meeting, yesterday)Jaccard Index = 3/7 = 0.43

Shared lexemes: ‘feature-based’. Also ‘blind’ methods, based on

mathematical characteristics of the digital representation of the text, e.g. Normalized Compression Distance (NCD).

Example: verse pairs with the highest number of shared lexemes (4 or more)

5:1 5:5 5:24

4:6AbinoamBaraksayson

GodIsraelthe LORDmountain

4:14BarakdayDeborasay

4:17

HeberJaelKenitetentwife

4:21HeberJaeltentwife

Proper nouns: ‘Barak’, ‘Israel’.

Common nouns that are part of proper noun phrases:

‘wife’ in ‘Jael the wife of Heber’; ‘son’ in ‘Barak the son of Abinoam’.

Other verbs and common nouns: ‘say’, ‘tent’, ‘day’.

High similarity scores in places that show high concentration of proper nouns.

Even within category of proper nouns considerable differences.

Shared common nouns and verbs: frequent words such as ‘day’, ‘say’. No significant concentration.

In case of literary dependency we would expect at least some concentration of shared lexemes.

Significant number of shared lexemes only in case of proper nouns.

But proper nouns suggest shared traditions, rather than literary dependency.