RESEARCH PROJECTS 2013
Documentation and analysis of an endangered language: aspects of the grammar of Griko
Dr. Marika Lekakou, Assistant Professor of Linguistics, University of
Ioannina
Dr. Valeria Baldissera, University Ca’ Foscari of Venice Antonis Anastasopoulos, Electrical and Computer Engineering,
National Technical University of Athens Dr. Sjef Barbiers, Professor of Dutch Variation Linguistics, Utrecht
University & Senior researcher, Meertens Instituut (KNAW)
December 2013
2
Project Report
Table of contents 2 Project summary in English 3 Project summary in Greek 4 1. Introduction – Objectives of the project 5 2. Team members 7 3. Methodology 7 3.1 Data collection 7 3.2 Data transcription and enrichment 8 3.3 Data storage and retrieval 10 4. Results 10 5. Extensions and avenues for future research 11 Acknowledgments 12 Selected References 13 APPENDIX A1 Transcription Protocol 14 APPENDIX A2 Part-‐of-‐Speech Tagging Protocol 18 APPENDIX A3 Database and Website Manual 24
3
PROJECT SUMMARY
Documentation and analysis of an endangered language: aspects of the grammar of Griko
The project aimed at collecting, digitizing and analyzing new data from Griko, the Greek dialect spoken in Puglia, Southern Italy. Griko is rare among Greek dialects in retaining the infinitive in particular syntactic contexts. In all Greek varieties spoken in Greece, infinitives have been replaced by finite embedded clauses. Taking the infinitive as our point of departure, we examined aspects of the grammar of Griko, emphasizing in verbal morphosyntax (voice, tense, mood, modality, aspect) and the structure of embedded clauses. During our fieldtrip in the Greek-‐speaking villages of Puglia, we collected new data, which was digitally recorded, transcribed and morphosyntactically tagged. The enriched data, along with the corresponding sound files, have been made available in an on-‐line database, where users can perform searches according to parameters such as location, tag, lemma, gloss, and syntactic variable. Our research records aspects of a language under extinction, emphasizing on the often overlooked syntactic level of grammar. The online availability of enriched data ensures optimal access of our material to the international linguistic community. The theoretical analysis of such data is relevant for theoretical syntax, theories of language change through language contact, as well as the diachronic development of Greek.
4
ΠΕΡΙΛΗΨΗ ΤΗΣ ΜΕΛΕΤΗΣ
Καταγραφή και ανάλυση μιας γλώσσας υπό εξαφάνιση: το γραμματικό σύστημα της Γκρίκο
Σκοπός της μελέτης μας ήταν η συλλογή, καταγραφή και ανάλυση δεδομένων από την Γκρίκο, την ελληνική διάλεκτο της Απουλίας στη Νότια Ιταλία. Η Γκρίκο χρησιμοποιεί σε ορισμένες περιπτώσεις το απαρέμφατο, γραμματικό τύπο που έχει εκλείψει από τις ελληνικές διαλέκτους του ελλαδικού χώρου και έχει αντικατασταθεί από παρεμφατικές εξαρτημένες προτάσεις. Με αφετηρία αυτή την ιδιότητα της Γκρίκο, ερευνήσαμε πτυχές της γραμματικής της με έμφαση στη μορφοσύνταξη του ρήματος (φωνή, χρόνος, έγκλιση, τροπικότητα, όψη) και τη δομή των εξαρτημένων προτάσεων. Με επιτόπια έρευνα στα ελληνόφωνα χωριά της Απουλίας συλλέξαμε πρωτότυπο γλωσσικό υλικό, το οποίο μαγνητοφωνήθηκε ψηφιακά, μεταγράφηκε και επισημειώθηκε μορφοσυντακτικά. Το συλλεχθέν υλικό διατίθεται σε ηλεκτρονική βάση δεδομένων, στην οποία οι χρήστες είναι σε θέση να εκτελέσουν αναζητήσεις ανάλογα με διαφορετικές παραμέτρους, όπως τοποθεσία, λήμμα, μορφοσυντακτική κατηγορία, συντακτική μεταβλητή. Η έρευνά μας συμβάλλει στην καταγραφή μιας γλώσσας υπό εξαφάνιση, και συγκεκριμένα στην περιγραφή του συντακτικού τομέα της γραμματικής, που συχνά παραγνωρίζεται σε διαλεκτικές έρευνες. Επιπλέον, η επεξεργασία και ηλεκτρονική διάθεση των δεδομένων προσφέρει διευρυμένες δυνατότητας αξιοποίησής τους από τη διεθνή γλωσσολογική κοινότητα. Η θεωρητική ανάλυση των δεδομένων μας ενδιαφέρει και εμπλέκει τη σύγχρονη συντακτική θεωρία, τη θεωρία γλωσσικής αλλαγής μέσω γλωσσικής επαφής καθώς και τη διαχρονική εξέλιξη της ελληνικής γλώσσας.
5
1. Introduction and objectives It would not be unfair to state that scientific research on dialect systems is currently a popular and productive enterprise. This is evidenced by the large number of scientific projects, recently completed or currently underway, whose explicit focus is on dialects. Dialects have become the point at which language specialists of different convictions (theoretical linguists, traditional dialectologists, typologists, sociolinguists, as well as historical linguists) join forces towards shared goals, such as: - to record linguistic variation before it altogether disappears; - to enhance the reliability of the data collected. - to broaden the empirical basis of research; - to submit under scrutiny theories based more often than not on standard
languages; - to illuminate the diachrony of particular languages (since dialects very often
preserve phenomena that are only attested in previous stages of the (standard) language);
- to shed light on the workings of contact-‐induced change (since dialects are practically always spoken in bi-‐ or multi-‐lingual communities).
The relevance of dialects for linguistic theory, and especially for syntactic theory, has been recognized at least since Kayne’s (1996) explicit parallelism of the investigation of dialect systems with an experimental setting: the study of closely related varieties, which differ from each other in minimal ways, is the closest theoretical linguistics can get to a controlled experiment. This has come to be known as the micro-‐comparative approach to linguistic variation. (For an illustration of the positive outcomes of the interplay between theoretical linguistics and dialect syntactic studies, see Koeneman & Lekakou 2006).
The adoption of the micro-‐comparative approach has led to a wealth of research on dialect syntax in recent years. At the European level, a number of projects, both large-‐ and small-‐scale, especially dedicated to the study of the syntactic properties of dialects, have been carried out (for an indication, see: http://www.dialectsyntax.org/wiki). Within Greece, however, dialect syntactic variation has only been recorded via sporadic and individual-‐based research, and digital access to the data has not been ensured. In traditional dialectological studies, syntactic description is either lacking or not theoretically informed (Tzitzilis 2000). It is from this perspective that we approach Griko, the Greek dialect spoken in Puglia (province of Lecce), Southern Italy, in the area known as Grecìa Salentina. Officially, Grecìa Salentina consists of 12 villages (Calimera, Carpignano Salentino, Castrignano dei Greci, Corigliano d'Otranto, Cutrofiano, Martano, Martignano, Melpignano, Sogliano Cavour, Soleto, Sternatia and Zollino). In reality, only in a subset of these villages is Griko spoken actively today. Moreover, the speakers are mainly advanced in age. There is currently no reliable estimate available of the number of active Griko-‐speakers. According to the Unesco Atlas of the World’s
6
Languages in Danger (Moseley 2010), Griko is facing severe danger of extinction (http://www.unesco.org/culture/languages-‐atlas/en/atlasmap.html). Its status as a severely endangered language is not the only reason to study Griko. For centuries, Grik has been spoken alongside the local Romance dialect, Salentino, as well as, more recently, Standard Italian (or the regional version thereof), in a «complex linguistic situation of diglossia with expanding bilingualism» (Ledgeway 2013:2). Griko is thus important not only in illuminating the diachrony of Greek, but as a potential window into the workings of contact-‐induced change as well. Possibly the most exotic syntactic property Griko exhibits is that, almost uniquely among Modern Greek varieties, it retains the infinitive (cf. Joseph 1983; Mackridge 1987 on Romeyka of Pontus; and Katsoyiannou 1995 on Grecanico). All Modern Greek varieties spoken within Greece have lost the infinitive and replaced it with embedded finite clauses. The following examples from Baldissera (2013) illustrate the current distribution of infinitives in Griko, namely as complements to the modal verb sodzo ‘can’. (1) a. Poa sodzi piai ta pipogna? when can-‐2SG take.INF the melons ‘When can you take the melons?’
b. Ta sodzo piai simmeri. them can-‐1SG take.INF today ‘I can take them today.’
The data in (1) in fact show not only retention of the infinitive in Griko, but also the commonly co-‐occurring phenomenon of clitic climbing (Terzi 1996), i.e. the placement of the object clitic belonging to the infinitival clause close to the matrix verb. This is not possible in Standard Modern Greek (SMG), for instance, where the sentence in (1b) would involve a na-‐clause as the complement to the modal boro and no clitic climbing. We thus see that the existence of infinitive in a language correlates with other morphosyntactic properties, an example of which is clitic climbing.
In this project, we undertook empirical and theoretical research on the morphosyntactic properties of Griko, with the aim of collecting syntactic data (i.e. sentences in Griko), analyzing them theoretically, and making them available electronically for the purposes of future (micro-‐comparative) research. We thus had two major objectives, one involving data collection and analysis, and one concerning data enrichment and storage.
Regarding the first objective, by taking the infinitive as our point of departure, the project focused on a central aspect of the grammar of Griko, namely the
7
morphosyntax of the verb in main and embedded clauses. The examination served the following goals: 1. to record the distribution of infinitives, na-‐clauses (subjunctives), and other types
of embedded clauses; 2. to provide a description of the dimensions of voice, tense, aspect and modality,
focusing in particular on the following phenomena: a. the structure of subjunctive and imperative clauses b. the three-‐way voice distinction (active, passive, reflexive) c. the split-‐auxiliary selection system (based on person) in compound tenses d. the encoding of futurity (periphrastic and not) e. the encoding of aspectual distinctions (periphrastic and not) f. properties of modal verbs (e.g. non-‐volitional ‘want’)
3. to provide an analysis of the syntactic status of the subjunctive marker na and of the dependent verbal form it embeds (see section 4). The second objective of our project was to make available the empirical results
of our research to the wider linguistic community, by digitizing the data and storing them in a searchable online database. In other words, our second objective was to contribute to dialect syntactic research in Greece in terms of infrastructure, by initiating a way to annotate and store empirical data which is a widely used in similar research endeavours outside of Greece. To this end, our team included international partners, whose input ensured that future integration of the database within the larger European family, if that would be desired, would be possible. We implemented this long-‐term goal by aligning our methodology (in terms of data collection, enrichment, storage and retrieval) with the one used in large-‐scale dialect syntactic projects in Europe. This enhances the comparability of our data with the dialect syntactic data already available in corpora unified via the Edisyn search engine (http://www.dialectsyntax.org/wiki/Edisyn_search_engine). 2. Team members The co-‐ordinator of the project was Dr. Marika Lekakou, Assistant Professor of Linguistics (University of Ioannina). The team included the following members: Dr. Valeria Baldissera (University Ca’ Foscari of Venice), Antonis Anastasopoulos (Electrical and Computer Engineering, National Technical University of Athens) and Prof. Dr. Sjef Barbiers, Professor of Dutch Variation Linguistics (Utrecht University) & Senior researcher (Meertens Instituut, KNAW). 3. Methodology 3.1 Data collection We collected data via oral interviews conducted in May and August 2013. In May, members of our team visited the four villages in the Greek-‐speaking area that all of our contacts within and outside of Greece (Professor Ralli at the University of Patras,
8
Professor Katsoyiannou at the University of Cyprus and Professor Bernardini at the Universit of Lecce) pointed out to us as extant Griko enclaves, namely Calimera, Corigliano d’Otranto, Martano and Sternatia. In August, Valeria Baldissera also conducted a follow-‐up interview in Corigliano d’Otranto with the informant consulted in May. In Calimera and Sternatia, the informants belonged to a superset of those that Valeria Baldissera had interviewed for the empirical research of her PhD thesis (Baldissera 2013). The informants already acquainted with Dr Baldissera (our main contacts in Grecìa Salentina) liaised us with more members of the local communities. Eventually, the interviews conducted in these locations (Calimera and Sternatia) involved more speakers than those conducted in Martano and Corigliano d’Otranto, where we met our informants for the first time upon arrival. There was thus always at least one informant per location and in some cases more than one. Not all speakers participated in the inteviews in equal measure. We interviewed ten informants in total, 4 female and 6 male. With the exception of the informant in Corigliano who is younger, all of our informants are aged over 60. Our elicitation method involved a translation task (from Standard Italian) and, as follow-‐up questions, grammaticality judgments of sentences in Griko. As Cornips & Poletto (2004) discuss, these methodological choices have been successfully employed in other dialect syntactic projects, such as SAND (Syntactic Atlas of the Dutch Dialects) and ASIS (Syntactic Atlas of Italian Dialects). We were able to construct Griko versions of all our test sentences in advance of our fieldtrip, by sending an electronic copy of our questionnaire to our contact in Sternatia. This enabled us to have a first idea of what to expect in the oral interviews, and to calibrate our follow-‐up questions accordingly. In total, the questionnaire we administered contained 78 test sentences. In October and November, Valeria Baldissera conducted telephone interviews with one of the speakers to make final confirmations and to ask follow-‐up questions needed for our theoretical analysis.
Because the speakers interviewed in 2013 are a superset of the speakers interviewed in 2011 by Valeria Baldissera, and since the methodology of data collection was homogeneous, we decided to also include parts of the data published in Baldissera (2013) in the online corpus. Both sets of data have been transcribed and annotated in the same way (see immediately below), the difference being that the data from Baldissera (2013) systematically lack a corresponding sound file. In the corpus they bear a diacritic, so that users will be able to distinguish them and cite them accordingly. 3.2 Data transcription and enrichment The data recorded during the fieldtrips of May and August were transcribed and enriched using the free program PRAAT (http://www.fon.hum.uva.nl/praat/). PRAAT is a program commonly used for the transcription of audio files for the purposes of phonetic research. We used PRAAT for the transcription of our material, which
9
involved entire sentences. In addition, Part-‐of-‐Speech (PoS) tagging, glossing (in Italian) and lemmatization (in Griko) was also provided, within separate PRAAT tiers (see «Transcription Protocol» and «Part of Speech Tagging Protocol» in the Appendix). The assignment of PoS tag, lemma and gloss was done manually. This kind of information makes the data much more accessible to the database users and enables advanced search possibilities. We have also assigned to each test sentence one or more syntactic keywords, so that searching by syntactic variable will also be an option. This kind of search is most interesting for those with little idea of the grammatical properties of Griko more closely investigated in our project. Aspectual periphrasis Aspectual verb By-‐phrase Causative verb Compound tense Conditional clause Concessive clause Clitic Clitic climbing Clitic doubling Declarative complement Dative argument Factive complement Focus Future Habituality Imperative Infinitival complement Intensional verb Modal periphrasis Modal verb Negation Non-‐volitional want Passive verb Perception verb Purpose clause Raising verb Reason clause Reflexive verb Subjunctive complement Temporal clause Wh-‐complement Wh-‐question Table 1: List of syntactic keywords instantiated in the Griko corpus
10
The transcription protocol that we developed for Griko was based on existing practices for writing the language within the community; unlike other Modern Greek dialects, Griko has some tradition of written texts. We thus decided to forego a phonetic or phonological transcription of our data, as this would seem foreign to members of the Griko community, who we hope will also be interested in the results of our fieldtrip. The conventions used for the orthographic transcription are explicated in the Transcription Protocol (Appendix A1).
For the morphosyntacic annotation, we developed a PoS tagging protocol especially for Griko (see the Part of Speech Tagging Protocol, Appendix A2). We relied on the guidelines of EAGLES (Expert Advisory Group on Language Engineering Standards) (http://www.ilc.cnr.it/EAGLES96/annotate/annotate.html), and on tools developed for the purposes of the Edisyn search engine (http://www.dialectsyntax.org/wiki/Edisyn_search_engine). Future incorporation of our data within the Edisyn family of dialect syntactic corpora relies on database interoperability. In terms of PoS tagging, a mapping will need to be provided between the tagset developed for Griko and the Edisyn tagset. This will be undertaken in the future. 3.3 Data storage and retrieval The transcribed data have been stored in a MySQL (relational) database, hosted in the same server as the project website (http://griko.project.uoi.gr/). The audio files, wherever available, are also stored in the server, in a WAV format. The transcriptions, tags, lemmas and glosses are stored in MySQL tables.
The website provides an interface for queries to the database, with various parameters. It also enables a selection of the results to be shown. The results are automatically exported to html format. In addition, the audio files are accessed through a simple interface, which automatically selects the player that each browser supports, in order to avoid compatibility issues.
Since the project focuses on the syntactic aspect of the Griko language, the database is also constructed accordingly. The PoS tags have different features, depending on the category, so they are stored on different SQL tables, to optimize performance. Integrity is ensured using foreign keys.
The transcribed data (.TextGrid files, as resulted from Praat) were parsed, checked and stored using Python and the InnoDB storage engine for SQL, which supports foreign keys constraints.
For details of this aspect of the project, see Appendix A3. 4. Results In this project, we pursued two goals: to collect and analyze data that pertain to the level of syntax, and to make the data widely available. Regarding the first objective, we have collected new, theoretically informed data, which will guide research on
11
topics related to Griko verbal morphosyntax and clause structure, as well as raise new empirical questions. Regarding the second objective, the data collected have been transcribed, annotated and stored in a searchable online database. The project not only preserves significant aspects of our cultural heritage, in danger of becoming forever lost. It also brings dialect research in Greece in line with dialect reasearch carried out in the majority of European countries, where (a) syntactic variation is intensely researched and (b) available technological advances in data storage and retrieval are exploited for the benefit of the scientific community.
We have already presented some of our empirical findings along with our theoretical analysis in the following two workshops: - Workshop on Language Contact in the Light of Modern Greek Morphological
Variation, 11th International Conference of Greek Linguistics, University of the Aegean, Rhodes 26-‐29 September 2013.
- Workshop on Balkan – Romance Contact, University Ca’Foscary of Venice, 26-‐27 November 2013. In these oral presentations, which will result in two peer-‐reviewed
publications, we have provided syntactic arguments for the claim that contrary to Standard Modern Greek (Holton et al. 1997), Griko encodes subjunctive mood in verbal morphology. We have sought to detect contact with Romance as a possible cause of this microvariation. The results of our research are thus directly relevant for issues such: the effects of contact between Italo-‐Greek and Italo-‐Romance, the diachronic development and origins of Griko (Rohlfs 1950; Profili 1983; Manolessou 2005), as well as the diachronic development of Greek more in general. In future work, we will turn to the synchronic comparison of Griko and Romeyka of Pontus, which too retains the infinitive (albeit in slightly different syntactic contexts). 5. Extensions and avenues for future research This is the first Greek dialect study to focus exclusively on a cluster of (morpho)syntactic properties and its repercussions for the overall linguistic system, and also to provide access to a corpus of transcribed and morphosyntactically annotated sentences. A number of actions can be undertaken in the future, in order to maximize the long-‐term effects of this project.
Regarding the main deliberable of the project, a number of minor additions will be made in the future. We aim to provide English glosses, so as to make the data even more easily accessible to linguists who don’t speak Italian (or Griko). We are also currently compiling an updated bibliography on Griko, which will be added presently to the website. Finally, in the interest of international collaboration and visibility, and given the standardized methodology employed in this project, we aspire to explore the possibility of allowing our database to be linked to the Edisyn search engine.
12
We also hope it will be possible to expand the corpus by importing additional data from Griko, to be collected in the near future through new rounds of data collection, by us or by others using similar methodology. Another extension of the corpus would involve incorporation of data from dialects other than Griko, collected and enriched with the use of comparable methodology. In this way, the infrastructure work undertaken for the purposes of this project will have served the purpose of a pilot study, making it easier in the future to undertake theoretically informed and technologically up-‐to-‐date dialect syntactic research. Acknowledgments We are extremely grateful to the John S. Latsis Foundation and to the members of the Griko communities who took part in our research; without the financial support of the former and the enthusiastic participation of the latter this research would not have been possible. For their help with informants and the data collection process, we are extremely thankful to Isabella Bernardini, Carmine Greco, Luigi Tommasi and Giuseppe De Pascalis; for her continuous help with the data, we thank Adriana Spagnolo. For her help with PRAAT, we are grateful to Cinzia Avesani and especially to Evia Kainada. Finally, for their support and/or advice in various stages of the project we thank Marianna Katsoyiannou, Jan Pieter Kunst, Josep Quer, Ioanna Sitaridou, Angeliki Ralli, and Arhonto Terzi.
13
Selected References Baldissera, V. 2013. Il dialetto grico del Salento: elementi balcanici e contatto linguistico. [The Griko dialect of Salento: Balkan features and linguistic contact.] Doctoral Dissertation, University Ca’ Foscari of Venice. Cornips, L. & C. Poletto 2004. On standardizing syntactic elicitation techniques. Lingua 115.7: 939-‐957. Holton D., P. Mackridge & I. Philippaki-‐Warburton. 1997. Greek: A Comprehensive Grammar of the Modern Language. London: Routledge. Joseph, Brian. 1983/2009. The synchrony and diachrony of the Balkan infinitive. A study in areal, general, and historical linguistics. Cambridge: Cambridge University Press. Katsoyannou M. 1995. Le Parler Greco de Galliciano (Italie): Description d’une Langue en Voie de Disparition. Doctoral Dissertation. University of Paris VII. Kayne, R. 1996. Microparametric syntax: some introductory remarks. In J.R.Black & V. Motapanyane (eds.), Microparametric syntax and dialect variation. Amsterdam: John Benjamins. 9-‐18. Koeneman O. & M. Lekakou. 2006. The role of syntactic theory in the SAND and EDiSyn projects. Ms., Meertens Institute. Ledgeway, A. 2013. Greek disguised as Romance? The case of Southern Italy. Ms. Cambridge University. To appear in Proceedings of 5th International Conference on Modern Greek Dialects and Linguistics. Mackridge, P. 1987. Greek-‐speaking Moslems of North-‐East Turkey: Prolegomena to Study of the Ophitic Sub-‐Dialect of Pontic. Byzantine and Modern Greek Studies 11: 115–137. Manolessou, I. 2005. The Greek dialects of southern Italy: An overview. ΚΑΜΠΟΣ 13: 103-‐35. Moseley, Christopher (ed.). 20103 . Atlas of the World’s Languages in Danger. Paris, UNESCO Publishing. http://www.unesco.org/culture/en/endangeredlanguages/atlas Profili, O. 1983. Le parler grico de Corigliano d'Otranto. Phénomènes d'interférence entre ce parler grec et les parlers romans environnants, ainsi qu'avec l'italien. Doctoral Dissertation, Université des Langues et Lettres, Grenoble. Rohlfs, G. 1950. Historische Grammatik der unteritalienischen Grazitat, Munchen: Bayerischen Akademie der Wissenschaften. Terzi, A. 1996. Clitic climbing out of finite clauses and tense raising. Probus 8: 273-‐295. Τζιτζιλής, Χ. 2000. Νεοελληνικές διάλεκτοι και νεοελληνική διαλεκτολογία. Στο Χριστίδης, Α.-‐Φ. (επιμ.), Η ελληνική γλώσσα και οι διάλεκτοί της. 15-‐22. Αθήνα: ΥΠΕΠΘ & Κέντρο Ελληνικής Γλώσσας.
14
Appendix A1: Protocol for Transcription of Griko
1. Introduction This document is the manual used for transcribing of Griko audio files. It contains information on the conventions employed in the transcription of the sound files, as well as information on how data enrichment more in general was carried out.
Transcription was carried out manually. In addition, each word in the corpus is assigned a Part-‐of-‐Speech tag, a lemma, and a translation into Italian. This information is provided in separate tiers: a. Text Tier: contains the transcription of the sound file. b. Part-‐of-‐Speech (PoS) Tag Tier: contains the Part of Speech Tag of each item in the
Text Tier (see “Protocol for Part-‐of-‐Speech tagging of Griko”). c. Italian Gloss: contains a word-‐for-‐word translation of the Griko sentence in
Italian. d. Griko Lemma: contains lemmas for each word. The version of Griko lemmas we
used is the one provided in the following dictionary: Greco, Carmine (2001). Lessico di Sternatia, paese della Grecia salentina: italiano-‐griko-‐neogreco, griko-‐italiano-‐neogreco, neogreco-‐griko. Lecce: Edizioni Del Grifo.
All four tiers were created manually using Praat (http://www.fon.hum.uva.nl/praat/). Within this program, the sound file was divided into sentences, which were separated by boundaries. Within each set of boundaries, spaces were used to indicate word boundaries in the text tier. The apostrophe is used as an alternative to space only to mark word boundaries wherein the phenomenon of raddoppiamento sintattico (phonosyntactic germination) takes place (see page 4).
In order to ensure a one-‐to-‐one correspondence between items on the transcription tier and items on the other tiers, spaces are used on those tiers as well. Whenever itemization discrepancies across different tiers occurred, the following conventions were used: 1. when a phonological word contains two morphological words of different
syntactic category, e.g. with prepositions fusing with the definite article (text-‐tag discrepancy), the tag of the two words is separated by period (“.”).
2. If a Griko word corresponds to more than one word in Italian (text-‐gloss discrepancy), then underscore (“_”) between the two Italian words is used. E.g. ‘irta and sono_venuta.
Only the speech of the informants is transcribed, and not that of the interviewer. Whenever additional speakers were present, an additional tier was used to notate their speech. Intonational breaks are transcribed as “#”. Whenever transcription was not possible, this was notated on the text tier by “[…]”.
15
Each sentence in the transcription tier is preceded by a number, followed by a space. The number corresponds to the number of the test sentence in the questionnaire. The end of a sentence is notated by a period or a question mark, as appropriate depending on sentence type. The only other punctuation mark that appears on the text tier is the apostrophe, which is used to mark phonosyntactic gemination. No other punctuation marks are used on the transcription tier. On all other tiers, only the underscore and the period are used, in the way mentioned above.
Synopsis of symbols used [Text Tier] . end of test sentence ? end of test sentence (question) ’ phonosyntactic gemination # intonational break […] material in the sound file is not transcribed [Gloss Tier] _ separates two words in Italian corresponding to a single one in Griko [PoS Tag Tier] . separates two PoS tags realized as a single word in Griko
The transcription is orthographic. Since there is a (limited) tradition of written Griko, we decided to forego a phonetic or phonological transcription, which would be foreign to native speakers of Griko. We relied on a version of the orthographic conventions adopted in texts written in Griko, like the ones employed in e.g. the magazine Spitta (of which available digitized issues can be found by following the relevant link in the project website). The orthographic conventions used for Griko closely recall conventions adopted for Standard Italian. Notes explicating the conventions used are provided on page 4.
16
TRANSCRIPTION CONVENTIONS
Transcription in GRIKO Correspondence to simplified IPA
NOTES
a; à [a] b [b] c [k] before [a] , [o], [u]
[tʃ] before [i] and [e] e. g. Carlo [’Karlo] ceràsi [tʃe’rasi]; cilìa [tʃi’lia] <cia> = [tʃa]; e.g. cialatèdda [tʃalat’ed:a] <cio> = [tʃo]; e.g. ciofàli [tʃo’fali] <ciu> = [tʃu]; e.g. ciumpì [tʃum’pi] <cìa> = [tʃ’ia] <cìo> = [tʃ’iu]
ch [x] e.g. rùcho [r’uxo] d [d] e, è [e] f [f] g
[g] before [a] , [o], [u] [dʒ] before [i] and [e]
e.g. garrofèddo [gar:o’fed:o] <ge> = [dʒe] <gi> = [dʒi <gia> = = [dʒa] e.g. sangìa [san’dʒia <gio> = = [dʒo] <giu> = [dʒu]
gh [g] e.g. ègghene [‘eg:ene] i; ì [i] j [γ] k [k] l [l] m [m] n [n] o; ò [o] p [p] r [r] s [s] t [t] u; ù [u] v [v] z [dz] e.g. ziò [z’io] ts [ts] e.g. tsìlo [‘tsilo] sc [ʃ] e.g. scìmmata [‘ʃim:ata] gn [ɲ] e.g. signurèdda [siɲu’red:a]
17
NOTES 1. The symbol <g> represents the plosive /ɡ/, unless it precedes a front vowel (⟨i⟩ or ⟨e⟩). In this case it represents the affricate /dʒ/. When the plosive pronunciation occurs before a front vowel, ⟨gh⟩ is used, so that <ghe> represents [ge] and <ghi> represents [gi].
2. <c> represents the affricate /tʃ/ before front vowels ⟨i⟩ and ⟨e⟩. In some words of Italian origin, before non-‐front vowels (<a>, <o>, <u>) <c> spells the unvoiced plosive /k/, e.g. Carlo [‘Karlo].
3. Thus, the letter <i> may function as a mere indicator that the preceding ⟨c⟩ or ⟨g⟩ is affricate, e.g. cia (/tʃa/), ciu (/tʃu/), gia (/dʒa/), giu (/dʒu/).
4. The symbols <ch> always represent [x]. E.g. cheretìmmata [xere’tim:ata]; chàri [’xari]. Plosive [k] is transcribed as <k>.
5. For every word that has two or more syllables an accent diacritic is used, to indicate the location of stress, e.g. cheretìmmata. 6. Raddoppiamento fonosintattico (or phonosyntactic doubling, PD): the phenomenon of syntactic gemination, i.e. the lengthening of word-‐initial consonants related to the presence of a particular set of preceding elements. PD is notated by an apostrophe <’> (used whenever a segment in the preceding word is elided) and no space between the two words, e.g. si’putèka [sip:u’teka].
18
Appendix A2: Protocol for Part-‐of-‐Speech Tagging of Griko 1. Introduction This document is the manual used for performing part-‐of-‐speech (PoS) tagging of Griko texts. All aspects of the data enrichment process, namely transcription, tagging, lemmatization and glossing in Italian were carried out manually, using Praat (http://www.fon.hum.uva.nl/praat/); see also “Protocol for transcription of Griko”. The categories used for PoS tagging are the following: 1. N [Noun] 2. Adj [Adjective] 3. V [Verb] 4. Adv [Adverb] 5. P [Adposition] 6. C [Complementizer] 7. Pr [Pronoun] 8. D [Determiner] 9. Prt [Particle] 10. Num [Numeral] The specifications for values and attributes that were ascribed to each category are explicated in separate subsections below. 2. General remarks 1. In the transcription, the category of the word appears first. Specifications for
other attributes are separated with a plus (“+”) sign. 2. For each category, there exist obligatory and optional attributes. A value for the
obligatory attributes is always specified. Regarding optional attributes, when no value is provided, the value is set to default (which is provided for particular categories and optional attributes).
3. The size of the internal composition of each tag is constant for each category, but not identical across categories. For instance, for Griko nouns a 4-‐character tag is minimally needed, whereas for finite verbs the tags are 9-‐character long.
4. In case a specification cannot be given with certainty, e.g. in case the gender of a particular noun is unclear the value ‘unspecified’ (“U”) is provided.
5. In case characterization for a particular attribute does not apply for a given category, 0 (zero) is used.
3. Specifications, Attributes and Values per Category 3.1 Noun Abbreviation: N Specification: Features Obligatory attributes: Gender, Number, Case Values for obligatory attributes Gender: Masc/Fem/Neu Number: S/Pl
19
Case: Nom/Gen/Acc/Voc Optional Attribute: Type. Since most nouns in our corpus are common, we do not specify the type; common is treated as default. Thus, only proper name come with a fifth specification, namely Prop (for Proper). Example: the tag for Maria is N+Fem+S+Nom+Prop. The case ascribed to a noun does not always reflect morphological distinctions, but may rely on the syntactic context. For instance, nouns realizing the syntactic role of object will be tagged as realizing accusative case, even if there is no discrete morphological marking for accusative case on the noun. This was deemed necessary for several reasons, one of the being the lack of syntactic annotation of the corpus. 3.2 Adjective Abbreviation: Adj Specification: Features, Degree, Position Obligatory Attributes: Gender, Number, Case Values for obligatory attributes: Gender: Masc/Fem/Neu Number: S/Pl Case: Nom/Gen/Acc/Voc Optional attributes: Position, Degree, Nominalization Values for optional attributes: Position: Post(nominal) Degree: Comp(arative)/Sup(erlative) Nominalization: NM (Nominalized) The default value for Position is Preposed (reflecting the order Adj-‐N). When Postposed, an adjective will receive the specification Post (i.e. post-‐posed). Example, the tag for the adjective in petìa mincià (“children young”) is the following: Adj+Neu+Pl+Nom+Post. The default value for degree is Positive. When the adjective is of comparative or superlative degree, the values Comp and Super are used. The default value for Nominalized is negative. So NM (Nominalized) only appears in the marked case. When nominalized, the adjective is neither preposed nor postposed, as there is by definition no overt noun with respect to which the adjective is ordered. So NM could be seen as another value for Position. 3.3 Verb Abbreviation: V.
20
Specification: Features, Type. Obligatory attributes for all members of category V: Finiteness, Voice, Type. Values for obligatory attributes: Finiteness: Fin(ite)N(on)Fin(ite) Voice: Act(ive)/N(on)Act(ive) Type: M(ain)/Aux(iliary) Subtypes of Aux: a. Mod = modal auxiliary verb b. PRF = perfect auxiliary, e.g. ‘have’ and ‘be’ in compound tenses (i.e. present and
past perfect). c. PASS = passive auxiliaries, e.g. ‘be’ and ‘come’. d. ASP = aspectual auxiliaries, e.g. steo.
3A. Attributes of finite verbs (VFin) Tense: Past/NonPast NonPast is the present tense form, used also in Griko as future tense. Aspect: Perf(ective)/Imperf(ective) The aspectual distinction is morphologically realized in e.g. past tense indicative. Mood: Ind (Indicative)/Imp (Imperative)/Subj (Subjunctive) Subjunctive is the value attributed to Griko finite verbs that realize perfective aspect and nonpast tense. Number: S(ingular)/P(lural) Person: 1/2/3 For example, a finite main verb like teli (“wants”) is tagged as follows: V+fin+M+Act+Nonpast+Imperf+Ind+S+3 3B. Attributes of nonfinite verbs (VNfin) Subtype: Inf(initive)/Part(iciple) We characterize all non-‐finite verb forms that are not infinitives as participles (subsuming gerunds too). This is meant purely as a descriptive label. Aspect: Perf(ective)/Imperf(ective) The characterization reflects the morphological specification of the stem. Number: S(ingular)/P(lural) Gender: Masc(uline)/Fem(inine)/Neu(ter) Griko passive participles inflect for gender and number. In the all other cases (active participle, infinitive), the distinctions don’t apply, so 0 is used for these attributes.
For example: a VNfin such as vriskonta in pao vriskonta would be tagged in the following way: V+Nfin+M+Act+Part+Imperf+0+0. 3.4 Adverb Abbreviation: Adv. Specification: Type, Features. Obligatory attributes: Type.
21
Values for obligatory attributes: Type: Temp(poral), Loc(ative), Interr(ogative), Asp(ectual), Epist(emic), Quant(ificational), QuantNeg (Negative Quantificational). Subtype: Temp(oral)/Loc(ative). The specification of an adverb as interrogative makes possible its further specification as temporal or locative. Optional Attribute: Degree. Value for Degree: Comp/Super Default degree specification is positive, unless otherwise stated. Example, pu is tagged as Adv+Interr+Loc, pote as Adv+Interr+Temporal. 3.5 Adposition Abbreviation: P Specification: Feature Attribute: P/Pfus(ed) P is used for simple P’s, Pfus for when P is fused with the definite article (D) that follows it. In the latter case, we include the information of the D head too. This is a case where a single word corresponds to two tags, separated by a “.”. Examples: atsè is tagged as P, s(t)i is tagged as P+Pfus.D+Det+Fem+S+Acc. 3.6 Complementizer/Conjunction Abbreviation: C Specification: Type and Subtype. Attributes of type: Sub(ordinating)/Coord(inating) Co-‐ordinating conjunctions correspond to “and”, “or”. Sub-‐ordinating conjunctions introduce embedded clauses. Attributes of subordinating (Sub) C: Decl(arative), Inter(rogative), Rel(ative), Caus(al), Temp(oral), Cond(itional), Subj(unctive), Def(ault). Def(ault) occurs whenever the value/function of the all-‐purpose complementizer ka is unclear. Examples: ce: C+Coord, na: C+Sub+Subj.
22
3.7 Pronoun Abbreviation: Pr Specification: Type, Features Attributes for Type: Pers(onal)/ Dem(onstrative)/ Inter(rogative)/ Quant(ifcational)/ Poss(essive) Attributes for Features: Strength: W(eak)/Str (ong) Person: 1/2/3 Gender: Masc/Fem/Neu Number: S/P Case: Nom/Acc/Gen/Voc Strength and person specifications are only applicable for personal pronouns. Example: cìni (“those”) is tagged as Pr+Dem+0+0+Pl+Masc+Nom. Optional attributes: Position, Clitic Doubling. Default value for Position is proclisis (weak personal pronouns precede finite verbs in Griko as in Standard Modern Greek). Encl(isis) is specified when the pronoun follows the verb. Default value for Clitic Doubling is no occurrence of clitic doubling. When doubling occurs, dou(bling) is additionally specified. 3.8 Determiner Abbreviation: D Specification: Type, features. Values for Type: Def(inite)/Indef(inite) Values for Features: Gender: Masc/Fem/Neu Number: S/Pl Case: Nom/Gen/Acc/Voc Example: i (definite feminine singular) is tagged as D+Det+Fem+S+Acc. 3.9 Particle Abbreviation: Prt Specification: Type, Subtype Attributes: Neg/Other Attributes for Subtype Neg: Ind(icative)/N(on)Ind(icative)/Sent(ential) U(nknown) In our corpus, all particles are negative. In Griko, as in Standard Modern Greek, sentential negative markers are sensitive to the mood (indicative/nonindicative) of the verb. Negative particles that occur in clausal ellipsis contexts are characterized as Sent(ential). For example, ndè is tagged as Prt+Neg+Sent.
23
3.10 Numeral Abbreviation: Num Example: ettà (“seven”)
24
Appendix A3: Database and website manual 1. Introduction This manual includes all necessary information regarding the implementation of the database and of the website for the project ‘Documentation and analysis of an endangered language: aspects of the grammar of Griko’. The flow of the manual follows the flow of the project. First the preprocessing of the transcribed data is presented. The relational database is described in the next section and the website user interface in the final chapter. 2. Data Preprocessing The programme used for the transcription of the data is Praat. Using Praat, the information for each audio segment (which corresponds to a sentence) is stored in tiers. Following the transcription protocol, the tiers used were:
• transcription • tagging • gloss • lemma • metadata
The metadata tier included information on the speaker, only in the cases of the locations where there were multiple speakers in the interviews. The preprocessing of the data included two steps:
1. Parse the .TextGrid files into a more suitable format, such as plain text format. This was done with the ParseTextGrid.py python script, which is available in http://griko.project.uoi.gr/pythoscripts/ParseTextGrid.py.
2. Check the tags for possible inconsistencies in the tags, according to the tagging protocol. This was done with the CheckData.py python script, which is available in http://griko.project.uoi.gr/pythoscripts/CheckData.py.
3. Database The processed data are then stored in a relational SQL database, using the InnoDB database engine for SQL. The database is automatically filled from a python script, which reads the preprocessed data and stores them into the appropriate tables. The character set of the database is set to `utf8_general_ci` in order to accommodate for all the characters present in the Griko, Italian and Greek alphabet.
25
3.1 SQL Tables The main table of the database is table `sentences`. It stores the whole trascription, tagging, gloss, lemma and question id. In addition, it stores information on the location and whether the segment is retrieved from other sources. It also provides the name of the .wav file that corresponds to this segment, if it is available. The table `questions` stores the questionnaire that was used during the interviews, including the test sentence, its Italian translation, and a list of the syntactic phenomena that this question tries to examine. The table `location` stores a list of the locations of the interviews and the table `keyword` stores a list of all the syntactic variables/keywords, which were examined with the various questions. Moreover, the table `questionkeywords` provides the M-‐to-‐N (multiple relation in both ways) match of each question to each syntactic keyword. This is needed in order to efficiently retrieve the question ids and the sentences when performing queries based on the syntactic keywords. In addition, the tables `tokens`, `itgloss`, `lemma` and `tags` store each individual transcription word, Italian gloss word, lemma and tag for all sentences, also storing (incrementally) its position in the sentence, as they would result from any split() function on the sentences. Although it may seem redundant, these tables enable a faster individual search on their contents and also ensure the matching of the transcription tokens to their relevant tag, gloss and lemma, through the variable of the position. Finally, the rest of the tables include the information of the tags. For each part of speech, there exists its corresponding table, which stores the information of its particular features. This is needed because the different part-‐of-‐speech tags employ different features (for more information, see `Tagging protocol`). 3.2 Tags' features The features for each tag are always stored as an integer (or Boolean) value, in order to ensure coherency, avoid issues with string comparisons and also increase the speed of the queries' execution. The features are stored in the following way:
• If the feature has only two values, then it can be represented with a boolean variable ("0" or "1"). Example of such feature is the feature Italian which denotes whether the word is Italian or not.
• If the feature can have multiple values, then each value is matched with an integer (incrementally, starting from "0"). The matching of these values to the integers is presented in the relevant tables in the appendix.
Note: It is important to note that the feature ``case`` -‐for nouns, adjectives, pronouns, determiners-‐ is referred as casse in the database, because the term CASE is a bound word, as it is part of the SQL syntax.
26
4. Website The website is created by simple html pages, using javascript and php for specific functionalities. Javascript is used for creating the adjusting forms for querying the database. This allows for the form to be user friendly and also ensure that no meaningless queries are executed. For example, for each selected tag to search by, only the corresponding features are shown. Javascript is also used for the functionalities of the images. The lightbox functionality uses the lightbox-‐2.6.min.js package. PHP is used for the connecting, querying the database and displaying the query results. 4.1 Database Search UI The database server uses the mysqli php extention (its main advantage being that it supports Unicode character encoding), so the interface for querying the database also uses php and mysqli. 4.1.1 SEARCH FORM The form is organised into tabs, enabling queries with the various parameters. The tabs Word, Lemma, Gloss (Italian) receive as input from the user a term and search in the appropriate tables for it. Note that in the current version the input term has to be spelled consistently with the orthographic conventions adopted for Griko. As a result, a Griko term also has to include the appropriate accent diacritic (eg. ``tròo`` instead of ``troo``), otherwise it will not return any results. This means that an Italian (or other Unicode) keyboard is required. The next version will hopefully enable search without the accent. The Test Sentence and Keyword tabs receive input only from the check-‐boxes and return the corresponding sentences. The form is constructed by retrieving the test sentences and the keywords from the `questions` and `keyword` tables of the database. The form in the Location tab is constructed in the same way, only in this case the input is requested in the form of a drop-‐down list. Special attention is given in the Tag tab, which enables search by PoS tags. Since each part of speech has different features, with which the user must be able to search the database, the form is constructed dynamically. The user first selects the PoS tag and the needed features are then presented for selection. This is accomplished with the Pane() and init() javascript functions. The function Pane() determines the properties of each ``pane`` (every form is defined as a different pane) and implements the functionality that allows for forms
27
to only appear when they are needed. The syntax of the function can be interpreted as:
Pane(X,Y,Z) à Show pane Y, if X takes value Z The function Init() defines the various dependencies of the form-‐panes on the user input, initializing the page. The values for each option of the features correspond to the values that are stored in the database and can be found in the appendix. The current version does not support queries with more than one PoS tags, but a new version with this feature enabled is already planned. 4.1.2 QUERIES The queries in general are the result of SQL JOIN queries between the tables `sentences`, `location` and a 3rd table, which depends on user input. The third table is decided based on the search tab that the user has selected (for example, if the user is querying through the Lemma tab, then the third table is the `lemma` table) and, in the case of PoS tags, the type of PoS (eg, if the selected PoS tag is Verb, then the third table is the `verbtags` table). The rest of the input of the user is used for constructing the conditions of the JOIN. The searchtags.php script constructs and executes the query, by iterating over the possible user inputs. It is important to note that input for different features is used in an additive way to construct the conditional query. For example, when querying the database for all Adjective tags and the Masculine and Singular options are selected (for the Gender and Number features), then the result will contain all adjectives that are both masculine AND in singular number. However, the additional input over the same feature is used in a different way, constructing the conditional query using ``OR`` for the conditions of this feature. For example, when querying the database for all Adjective tags and the Masculine and Feminine options are selected (for the Gender feature), then the result will contain all adjectives that are either masculine OR feminine. Of course, the above can also be combined, in order to construct more complicated queries. If no option is selected for some feature, then this feature does not form part of the query. As a result, if no option is selected in general, the result consists of all the sentences in the database.
28
4.1.3 RESULTS The results of the query are presented in a table, which is also constructed by the searchtags.php script. As a default, the question id, transcription, location and audio file (if available) of the sentence are included in the results. According to the options selected by the user, also italian gloss, tags and lemmas may be shown. Where audio data are available, the appropriate image is shown, also opening, when clicked, a new browser tab or window (depending on user preferences) with a simple player for the wav file. The audio files are accessed through a simple interface, which automatically selects the player that each browser supports, in order to avoid compatibility issues. 5. Conclusion For any enquiries, suggestions or more information on the implementation and the technical details of the database, the website or the whole project, please contact:
• Antonis Anastasopoulos (for technical information or for help on using the search engine) at [email protected]
• Marika Lekakou (for enquiries about the project or the corpus) at [email protected].
29
Appendix The following tables present the 1-‐1 match of the features to the values used in the database and the online form.
Location 1 Calimera
2 Corigliano
3 Martano 4 Sternatia 5 other
Gender
0 Masculine
1 Feminine 2 Neutral 3 Unknown
Degree 0 Positive 1 Comparative 2 Superlative
Number 0 Singular 1 Plural
2 Unknown
Case
0 Nominative
1 Genitive 2 Accusative 3 Vocative 4 Unknown 5 Undefined
Position 0 Preposed 1 Postposed
Verb Finiteness 0 Non Finite 1 Finite
Verb Type 0 Main 1 Auxiliary
Auxiliary Verb Type
0 Modal 1 Perfect 3 Passive 4 Aspectual
Verb Voice 0 Active 1 Non Active
Verb Tense 1 Past 2 Non Past
Verb Aspect 1 Perfective 2 Imperfective
Verb Mood 1 Indicative 2 Imperative 3 Subjunctive
Person 0 First 1 Second 2 Third
Non Finite Verb Subtype
0 Not Applicable (finite verb)
1 Infinitive 2 Participle
Adverb Type 0 Temporal 1 Locative 2 Interrogative 3 Aspectual 4 Epistemic 5 Quantitative
6 Quantitative (Negative)
7 Other 8 Manner
Subordinating Complementizer
Subtype 1 Temporal 2 Default 3 Declarative 4 Interrogative 5 Relative 6 Conditional 7 Subjunctive 8 Causal
Pronoun Type 0 Personal 3 Demonstrative 5 Interrogative 6 Possessive 7 Quantificational
Pronoun Strength 0 Weak 1 Strong 2 Non applicable
Complementizer Type
0 Subordinating 1 Coordinating
30
The rest of the features:
• Proper (nouns) • Nominalised (adjectives) • Fused (adpositions) • Enclisis (pronouns) • Participation in Clitic Doubling (pronouns) • Italian (all words)
are modeled with Boolean values. Their default value is 0 (false) and if the specification applies to the particular word/tag, then the value is 1 (true).
Particle Subtype 0 Indicative 1 Non Indicative 2 Sentential 3 Unknown
Particle Type 0 Negative 1 Other
Top Related