Corpora built for linguistic varieties of a pluricentric language such as German are an...

1
Corpora built for linguistic varieties of a pluricentric language such as German are an indispensable resource for a detailed and systematic variety comparison and dictionary development. We present desiderata and suggestions as well as methods from computational linguistics to systematically apply variety corpora for the enrichment, i.e. confirmation, extension and generation, of lexical entries in distinctive variant dictionaries for German. Examples are those variant dictionaries developed by Ammon et al. (2004) and Abfalterer (2007), where we focus on the South Tyrolean German language. On the one hand, we conducted a systematic frequency analysis in newspaper variety corpora for approved lists of South Tyrolean special vocabulary in order to possibly refine corresponding dictionary Approaches to Computational Lexicography for German Approaches to Computational Lexicography for German Varieties Varieties * * Andrea Abel, Stefanie Anstein - {aabel|sanstein}@eurac.edu - LCT day FUB - May 15th, 2008 Related Work Related Work Variety Corpora Variety Corpora * Paper to be presented at Euralex 2008 German: DWDS-Korpus (DE), Austrian Academy Corpus (AT), Schweizer Text Korpus (CH), Korpus Südtirol (IT) ‘C4’ platform English: International Corpus of English (ICE), London-Lund Corpus, ICAME etc. French: Trésor de la Langue Française Informatisé (au Quebec) etc. Spanish: Corpus del Español etc. ... German variant dictionaries German variant dictionaries German variety in South Tyrol German variety in South Tyrol Studies on language contact phenomena and particularities • on lexical and partly morpho-syntactical level (e.g. Rizzo-Bauer 1962, Riedmann 1972, Pernstich 1984, Forer/Moser 1988, Lanthaler 1995, Ammon et al. 2004, Abfalterer 2007) • hardly on syntagmatic (e.g. collocations, idioms), textual level (e.g. Riehl 1997) or on translated texts (e.g. Putzer 1984) Interpretation of language contact phenomena • shift: research based on criticism of contact phenomena as impairment of language (e.g. Riedmann 1972) description of “special vocabularies” (e.g. Ammon 2004, Abfalterer 2007) on purely lexical level: less particularities than assumed (see e.g. Ammon 2001) Methods manual examination and excerption of references (e.g. Riedmann 1972, Riehl 1997); consultation of informants, relevant literature and dictionaries (e.g. Abfalterer 2007) • Internet as resource for additional evidence (e.g. Abfalterer 2007, Bickel 2000) now: corpus linguistics (Korpus Südtirol, ‚C4‘ initiative) Requirements Requirements Desiderata for corpus lexicography • content (confirmation and enrichment of existing data, addition of new data) and data modelling (e.g. special notes, frequency labels) methods for data acquisition (improvement and refinement of existing tools as well as development of new specific tools) • data presentation (e.g. online dictionaries with direct links to corpus data) Research requirements on South Tyrolean German • large-scale investigations on a lexical, syntagmatic and • use of state of the art corpus linguistic methods and Methods Methods © 2. Tagger 2. Tagger ‚unknowns‘ ‚unknowns‘ tering of the ‘unknowns’ in the Dolomiten corpus yielding special vocabulary ‘candidates’ © 3. Continuous and discontinuous 3. Continuous and discontinuous cooccurrences: cooccurrences: Adj+N, Prep+N; Subj+Pred, Pred+Obj Adj+N, Prep+N; Subj+Pred, Pred+Obj extraction and comparison of cooccurrences in the two corpora ... Outlook Outlook enhance corpora to be compared and their annotation • develop more tools for the semi-automatic comparison of varieties on the basis of corpora systematize exemplary findings on South Tyrolean variety investigate ‚South Tyrolisms‘ and their collocators, phraseologisms compare synthetical and analytical constructions • analyse ‘cause’ and ‘origin’ for certain phenomena (e.g. language contact, language variation over time) removing special vocabulary collected for the South Tyrolean variety in other projects (e.g. legal terms), the remaining list was manually checked for possible new variant dictionary entries, thus - as an innovative variety corpus lexicographic approach - also automatically filtering a huge amount of data to extract only relevant data to be investigated in detail. In addition, we semi-automatically extracted lexical cooccurrences of our two newspaper corpora and compared their frequencies – with the assumption that those cooccurrences are worth being more closely investigated that have high frequency in the South Tyrolean corpus and very low frequency in the corpus from Germany. With these three methods we were not only able to refine dictionary entries for South Tyrolean German, but also to add new ones. The findings on variants can be re-used for further corpus annotation resulting in again better resources for computational variant lexicography of the kind described, which is also to be extended to more complex levels of linguistic description. • Ammon, U. et al (2004): Variantenwörterbuch des Deutschen. Die Standardsprache in Österreich, der Schweiz und Deutschland sowie in Liechtenstein, Luxemburg, Ostbelgien und Südtirol. • Abfalterer, H. (2007): Der Südtiroler Sonderwortschatz aus plurizentrischer Sicht. Lexikalisch-semantische Besonderheiten im Standarddeutsch Südtirols. 1. ‚South 1. ‚South Tyrolisms‘ Tyrolisms‘ counting ‘South Tyrolisms’ (Abfalterer 2007) in the two corpora and extracting words with ‘suspicious’ frequencies ... Resources Resources Korpus Südtirol (FUB, Eurac, UIBK) Subcorpus ‘Dolomiten (IT) 66 mio tokens Corpus ‘Frankfurter Rundschau’ (D) 40 mio tokens Dolo FR (tokenised, PoS-tagged, lemmatised, chunked; queried with CQP) data from project ‘Datenbank zum Südtiroler Deutsch’ IBK lists of special vocabulary (‘South Tyrolisms’, legal terms, proper names etc.) weißer Stimmzettel: Dolo 81 vs. FR 2 allgemeine Klasse: Dolo 522 vs. FR 0 innerhalb <Mai>: Dolo 420 vs. FR 0 ... © ©
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of Corpora built for linguistic varieties of a pluricentric language such as German are an...

Page 1: Corpora built for linguistic varieties of a pluricentric language such as German are an indispensable resource for a detailed and systematic variety comparison.

Corpora built for linguistic varieties of a pluricentric language such as German are an indispensable resource for a detailed and systematic variety comparison and dictionary development. We present desiderata and suggestions as well as methods from computational linguistics to systematically apply variety corpora for the enrichment, i.e. confirmation, extension and generation, of lexical entries in distinctive variant dictionaries for German. Examples are those variant dictionaries developed by Ammon et al. (2004) and Abfalterer (2007), where we focus on the South Tyrolean German language. On the one hand, we conducted a systematic frequency analysis in newspaper variety corpora for approved lists of South Tyrolean special vocabulary in order to possibly refine corresponding dictionary entries with corpus evidence. On the other hand, we filtered the list of words of our South Tyrolean corpus which could not be lemmatised by a tool developed for the variety in Germany. After

Approaches to Computational Lexicography for German Approaches to Computational Lexicography for German Varieties Varieties **

Andrea Abel, Stefanie Anstein - {aabel|sanstein}@eurac.edu - LCT day FUB - May 15th, 2008

Related WorkRelated Work

Variety CorporaVariety Corpora

* Paper to be presented at Euralex 2008

• German: DWDS-Korpus (DE), Austrian Academy Corpus (AT), Schweizer Text Korpus (CH), Korpus Südtirol (IT) ‘C4’ platform

• English: International Corpus of English (ICE), London-Lund Corpus, ICAME etc.• French: Trésor de la Langue Française Informatisé (au Quebec) etc.• Spanish: Corpus del Español etc.• ...

German variant dictionariesGerman variant dictionaries German variety in South TyrolGerman variety in South Tyrol• Studies on language contact phenomena and particularities

• on lexical and partly morpho-syntactical level (e.g. Rizzo-Bauer 1962, Riedmann 1972, Pernstich 1984, Forer/Moser 1988, Lanthaler 1995, Ammon et al. 2004, Abfalterer 2007)

• hardly on syntagmatic (e.g. collocations, idioms), textual level (e.g. Riehl 1997) or on translated texts (e.g. Putzer 1984)

• Interpretation of language contact phenomena• shift: research based on criticism of contact phenomena as impairment of

language (e.g. Riedmann 1972) description of “special vocabularies” (e.g. Ammon 2004, Abfalterer 2007)

• on purely lexical level: less particularities than assumed (see e.g. Ammon 2001) • Methods

• manual examination and excerption of references (e.g. Riedmann 1972, Riehl 1997); consultation of informants, relevant literature and dictionaries (e.g. Abfalterer 2007)

• Internet as resource for additional evidence (e.g. Abfalterer 2007, Bickel 2000)• now: corpus linguistics (Korpus Südtirol, ‚C4‘ initiative)RequirementsRequirements

• Desiderata for corpus lexicography• content (confirmation and enrichment of existing data, addition of new

data) and data modelling (e.g. special notes, frequency labels)• methods for data acquisition (improvement and refinement of existing

toolsas well as development of new specific tools)

• data presentation (e.g. online dictionaries with direct links to corpus data)

• Research requirements on South Tyrolean German • large-scale investigations on a lexical, syntagmatic and textual level• intralinguistic comparison to other German varieties • use of state of the art corpus linguistic methods and technologies

MethodsMethods

©

2. Tagger 2. Tagger ‚unknowns‘‚unknowns‘

filtering of the ‘unknowns’ in the Dolomiten corpus yielding new special vocabulary ‘candidates’

©

©

3. Continuous and discontinuous 3. Continuous and discontinuous cooccurrences:cooccurrences:Adj+N, Prep+N; Subj+Pred, Pred+ObjAdj+N, Prep+N; Subj+Pred, Pred+Obj

extraction and comparison of cooccurrences in the two corpora

...

OutlookOutlook• enhance corpora to be compared and their annotation• develop more tools for the semi-automatic comparison of varieties on the basis of corpora • systematize exemplary findings on South Tyrolean variety• investigate ‚South Tyrolisms‘ and their collocators, phraseologisms• compare synthetical and analytical constructions • analyse ‘cause’ and ‘origin’ for certain phenomena (e.g. language contact, language

variation over time)

removing special vocabulary collected for the South Tyrolean variety in other projects (e.g. legal terms), the remaining list was manually checked for possible new variant dictionary entries, thus - as an innovative variety corpus lexicographic approach - also automatically filtering a huge amount of data to extract only relevant data to be investigated in detail. In addition, we semi-automatically extracted lexical cooccurrences of our two newspaper corpora and compared their frequencies – with the assumption that those cooccurrences are worth being more closely investigated that have high frequency in the South Tyrolean corpus and very low frequency in the corpus from Germany. With these three methods we were not only able to refine dictionary entries for South Tyrolean German, but also to add new ones. The findings on variants can be re-used for further corpus annotation resulting in again better resources for computational variant lexicography of the kind described, which is also to be extended to more complex levels of linguistic description.

• Ammon, U. et al (2004): Variantenwörterbuch des Deutschen. Die Standardsprache in Österreich, der Schweiz und Deutschland sowie in Liechtenstein, Luxemburg, Ostbelgien und Südtirol.

• Abfalterer, H. (2007): Der Südtiroler Sonderwortschatz aus plurizentrischer Sicht. Lexikalisch-semantische Besonderheiten im Standarddeutsch Südtirols.

1. ‚South 1. ‚South Tyrolisms‘Tyrolisms‘

counting ‘South Tyrolisms’ (Abfalterer 2007) in the two corpora and extracting words with ‘suspicious’ frequencies

...

ResourcesResources

Korpus Südtirol (FUB, Eurac, UIBK) Subcorpus ‘Dolomiten (IT)66 mio tokens

Corpus ‘Frankfurter Rundschau’ (D)

40 mio tokens

Dolo FR

(tokenised, PoS-tagged, lemmatised, chunked; queried with CQP)

data from project

‘Datenbank zum

Südtiroler Deutsch’

IBK

lists of special

vocabulary (‘South

Tyrolisms’,legal terms,

proper names etc.)

weißer Stimmzettel:Dolo 81 vs. FR 2allgemeine Klasse:Dolo 522 vs. FR 0innerhalb <Mai>:Dolo 420 vs. FR 0...

©

©