15 INTERNATIONAL CONFERENCE NooJ 2021

59
15 th INTERNATIONAL CONFERENCE NooJ 2021 Book of Abstracts Virtual conference Besançon, France June 9-11, 2021 Magali Bigey Annabel Richton Max Silberztein Izabella Thomas (Eds.)

Transcript of 15 INTERNATIONAL CONFERENCE NooJ 2021

Page 1: 15 INTERNATIONAL CONFERENCE NooJ 2021

15th INTERNATIONAL CONFERENCE

NooJ 2021 Book of Abstracts

Virtual conference Besançon, France June 9-11, 2021

Magali Bigey Annabel Richton Max Silberztein

Izabella Thomas (Eds.)

Page 2: 15 INTERNATIONAL CONFERENCE NooJ 2021

2

Organization 14th Annual NooJ 2021 International Conference was organized as a Virtual conference.

Organizing Institutions

Organizing Committee

Magali Bigey ELLIADD, Université de Franche-Comté, France Annabel Richeton ELLIADD, Université de Franche-Comté, France Max Silberztein ELLIADD, Université de Franche-Comté, France Izabella Thomas CRIT, Université de Franche-Comté, France

UNIVERSITE DE FRANCHE-COMTE, BESANÇON, FRANCE

ELLIADD LABORATORY

UNIVERSITE DE FRANCHE-COMTE

CRIT LABORATORY UNIVERSITE DE FRANCHE-COMTE

Association Internationale des Utilisateurs de NooJ

Page 3: 15 INTERNATIONAL CONFERENCE NooJ 2021

3

Scientific Committee

Max Silberztein (Chair) Université de Franche-Comté, France Farida Aoughlis Mouloud Mammeri University, Algeria Anabela Barreiro INESC-ID, Portugal Magali Bigey Université de Franche-Comté, France Xavier Blanco Autonomous University of Barcelona, Spain Stéfan Darmoni Université de Rouen, France Héla Fehri University of Sfax, Tunisia Zoe Gavriilidou Democritus Univ. of Thrace, Greece Yuras Hetsevich National Academy of Sciences, Belarus Kristina Kocijan University of Zagreb, Croatia Walter Koza Pontifical Catholic University of Valparaiso, Chile Philippe Lambert Université de Lorraine, France Danielle Leeman Université de Nanterre, France Peter Machonis Florida International University, USA Samir Mbarki IbnTofail University, Morocco Slim Mesfar University of Manouba, Tunisia Elisabeth Métais Conservatoire National des Arts et Métiers, France Mario Monteleone University of Salerno, Italy Johanna Monti University of Naples "L'Orientale", Italy Ralph Müller Friburg University, Switzerland Thierry Poibeau Laboratoire Lattice, CNRS, France Jan Radimský University of South Bohemia, Czech Republic Andrea Rodrigo University of Rosario, Argentina Marko Tadić University of Zagreb, Croatia Izabella Thomas Université de Franche-Comté, France François Trouilleux Université Blaise-Pascal, France Agnès Tutin Université de Grenoble-Alpes, France

Page 4: 15 INTERNATIONAL CONFERENCE NooJ 2021

4

Preface

NooJ is a linguistic development environment that provides tools for linguists to construct linguistic resources that formalize a large gamut of linguistic phenomena: typography, orthography, lexicons for simple words, multiword units and discontinuous expressions, inflectional, derivational and agglutinative morphology, local, phrase-structure and dependency grammars, as well as transformational and semantic grammars. For each linguistic phenomenon to be described, NooJ proposes a set of computational formalisms, the power of which ranges from very efficient finite-state automata (that process regular grammars) to powerful Turing machines (that process unrestricted grammars). NooJ contains a rich toolbox that allows linguists to construct, maintain, test, debug, accumulate and share linguistic resources. This makes NooJ’s approach different from most other computational linguistic tools that typically offer a unique formalism to their users and are not compatible with each other.

NooJ includes parsers that can apply any set of linguistic resources to any corpus of texts, to extract examples or counter examples, annotate matching sequences, perform statistical analyses, and so on. Because NooJ’s linguistic resources are neutral, they can also be used by NooJ’s generators to produce texts in Natural Languages. By combining NooJ’s parsers and generators, one can construct sophisticated NLP (Natural Language Processing) applications, such as MT (Machine Translation) systems, abstracts and paraphrases generators, etc.

Since its first release in 2002, several private companies have used NooJ’s linguistic engine to construct business applications in several domains, from Business Intelligence to Opinion Analysis. To date, there are NooJ modules available for over 30 languages; more than 140,000 copies of NooJ have been downloaded. Since 2013, a new version for NooJ is available, based on the JAVA technology and available to all as an open-source GPL project and distributed by the European Metashare platform.

NooJ has been enhanced with new features to respond to the needs of researchers who seek to analyze their corpus in various domains of Human and Social Sciences (history, literature and political studies, psychology, sociology, etc.). The new ATISHS software (Analyseur de Textes Innovant pour les Sciences Humaines et Sociales) offers the statistical tools provided by most Digital Humanities software, with the important difference that the objects processed by ATISHS are units of meaning rather than simple graphical word forms.

Page 5: 15 INTERNATIONAL CONFERENCE NooJ 2021

5

Table of contents

Table of Contents PREFACE ...................................................................................................................................................... 4

Invited speakers, Nastia Osidach & Hugues de Mazancourt ................................................................................. 7 PART 1: LEXICAL AND MORPHOLOGICAL RESOURCES ............................................................................. 8

Treatment of the aspectual value of Kabyle verbs, Hamid Annouz ....................................................................... 9 Intensive frozen adverbs of the PECO class in Old French, Xavier Blanco, Yauheniya Yakubovich .............. 10 The Western Armenian resources for NooJ, Anaid Donabedian ......................................................................... 11 A NooJ module for the Romanian language, Maria-Diana Manescu ................................................................. 12 Formalization of the Italian negation system and sentiment analysis, Mario Monteleone, Ignazio Mauro Mirto ............................................................................................................................................................................. 13 A NooJ module for Albanian language, Odile Piton ............................................................................................ 15 A NooJ module for the Ukrainian language, Olena Saint-Joanis ....................................................................... 16

PART 2: SYNTACTIC AND SEMANTIC RESOURCES ................................................................................... 17 Syntactic analysis of sentences containing Arabic psychological verbs, Asmaa Amzali, Asmaa Kourtin, Mohammed Mourchid, Abdelaziz Mouloudi and Samir Mbarki .......................................................................... 18 Recognition and automatic translation of dative verbs, Hajer Cheikhrouhou .................................................... 19 Formalization of clause-subordinate transformations in Quechua, Maximiliano Duran .................................. 20 Lexicon-grammar tables for modern Arabic frozen expressions, Asmaa Kourtin, Asmaa Amzali, Mohammed Mourchid, Abdelaziz Mouloudi, Samir Mbarki ...................................................................................................... 21 The nominal ellipsis in complex head of Spanish: Phenomenon description and computer-based model proposal, Walter Koza, Hazel Barahona ................................................................................................................ 22 Meaning extraction from strappare causatives in Italian, Ignazio Mauro Mirto, Mario Monteleone ............ 23 Using syntactic grammar to export Nooj morphological annotation: A case study of the morphological annotation of Indonesian texts, Prihantoro ............................................................................................................. 25 Transformational analysis of auxiliary verbs, Max Silberztein ............................................................................ 26

PART 3: CORPUS LINGUISTICS AND DISCOURSE ANALYSIS .................................................................... 27 The contribution of NooJ to digital surveillance: The assimilation of the major issues that characterized the U.S. presidential elections, Nana Ama Ampomah Awuah ..................................................................................... 28

Page 6: 15 INTERNATIONAL CONFERENCE NooJ 2021

6

The designation and storytelling of the culprits during the French Yellow Vests movement. A study of 2018-2019 books of grievances, Marion Bendinelli ........................................................................................................ 29 Uses and potential of mobile devices in francophone sub-Saharan Africa, Magali Bigey, Ibrahim Maïdakouale ................................................................................................................................................................ 30 Sensitivity to fake-news: reception analysis with NooJ and ATISHS, Magali Bigey, Justine Simon ............. 31 Creation of a legal domain corpus for the Belarusian NooJ module: texts, dictionaries, grammars, Yuras Hetsevich, Yauheniya Zianouka, Valerii Varanovich, Mikita Suprunchuk, Tsimafei Prakapenka, Dmitrii Dzenisiuk ....................................................................................................................................................................... 32 Negation usage in Croatian Parliament, Kristina Kocijan, Krešimir Šojat .......................................................... 34 From laws and decrees to a legal ontology, Ismahane Kourtin, Aziz Mouloudi, Samir Mbarki ..................... 35 Festival volunteers, committed festival-goers or the legacy of cultural practice, Stéphane Laurent ............... 36 How to locate traces of subjectivity in diplomatic discourse with Nooj? The example of the French Ministry of Foreign Affairs, Annabel Richeton ....................................................................................................................... 37 Terms and Appositions: What unstructured texts tell us, Giulia Speranza, Maria Pia Di Buono, Johanna Monti ............................................................................................................................................................................. 38 The debates on the advent of the fifth generation of mobile telephony (5G), Yu Xia ..................................... 40

PART 4: NATURAL LANGUAGE PROCESSING SOFTWARE APPLICATIONS .............................................. 42 Paraphrasing tool using the NooJ Platform, Amine Alassir ISG, Sondes Dardour, Héla Fehri ....................... 43 Recognition and analysis of complex questions in standard Arabic using NooJ, Essia Bessaies, Slim Mesfar, Henda ben Ghazela Riadi .......................................................................................................................................... 44 Answer validation in question answering system, Essia Bessaies, Slim Mesfar, Henda ben Ghazela Riadi .... 45 The use of NooJ’s functionalities to build an application for Arabic acquisition, Ilham Blanchete, Mohammed Mourchid ....................................................................................................................................................................... 46 Linguistics, applied research and NLP: using NooJ in a technical-operational context. Case-study, analysis and perspectives, Nicolas Boffo, Philippe Lambert ................................................................................................ 48 Construction of an educational game "VocabNooJ", Hela Fehri, Lazhar Arroum, Sameh Ben Aoun ............ 49 Automatic analysis of finding predicates from a Lexicon-Grammar proposal, Javiera Jacobsen, Mirian Muñoz, Walter Koza, Francisca Saiz ...................................................................................................................... 50 Arabic spelling error detection and correction using NooJ, Rafik Kassmi, Samir Mbarki, Abdelaziz Mouloudi ....................................................................................................................................................................... 51 Automatic identification and disambiguation of abbreviations in the medical domain, Walter Koza, Ninoska Godoy, Constanza Suy, Romanet Contreras, Sofía Koza, Fernanda Aguirre, Martín Díaz ............................ 52 Automatic detection and generation of argument structures of the medical domain, Walter Koza, Constanza Suy Álvarez .............................................................................................................................................. 53 Geoparsing with NooJ Italian toponym resolution for environmental crimes, Raffaele Manna, Annarita Magliacane, Antonio Pascucci, Wanda Punzi Zarino, Vincenzo Simoniello ...................................................... 54 Integrated NooJ environment for Arabic linguistic disambiguation improvement using MWEs, Dhekra Najar, Slim Mesfar, Henda Ben Ghezela ................................................................................................................. 56 Approach to the automatic treatment of gerunds in Spanish and Quechua: A pedagogical application of NooJ, Andrea Rodrigo, Maximiliano Duran, María Yanina Nalli ....................................................................... 57 Automatic generation of intonation marks and prosodic segmentation in Belarusian, Yauheniya Zianouka, Dzmitry Dzenisiuk, David Latyshevich, Yuras Hetsevich ........................................................................................ 58

Page 7: 15 INTERNATIONAL CONFERENCE NooJ 2021

7

Invited speakers, Nastia Osidach & Hugues de Mazancourt

Computational Linguistics at Grammarly

Nastia Osidach, Grammarly

Grammarly is one of the world's most innovative AI companies, consistently breaking new ground in natural language processing research. Grammarly develops an AI-based writing assistant that helps 30 million people every day write more clearly and effectively in English. In Kyiv, Nastia Osidach manages Grammarly's team of Computational Linguists who are developing and maintaining features for Grammarly's writing assistant.

Industrial Applications of Natural Language Generation techniques

Hugues de Mazancourt, Yseop, APIL

Hugues is VP of Innovation at Yseop and is responsible for NLU developments and the Yseop Lab. Hugues brings extensive machine-learning and related technology expertise and experience to Yseop's portfolio, all while being an expert for Cap Digital and a master degree lecturer at the Université Paris Diderot. In 2001 he started APIL, an association for professionals in Natural Language Processing, acting as president back in 2004; a role he has taken up again this year for the association's new start in 2019.

Page 8: 15 INTERNATIONAL CONFERENCE NooJ 2021

8

Part 1: Lexical and Morphological Resources

Page 9: 15 INTERNATIONAL CONFERENCE NooJ 2021

9

Treatment of the aspectual value of Kabyle verbs, Hamid Annouz

INALCO, France [email protected]

Abstract

Mainly, Kabyle language verbs can be divided into five different themes on the basis of their morphology: aorist, intensive aorist, negative preterit, preterit and participle (with a simple imperative and intensive). Nevertheless, for certain types of verbs, the change of the value does not systematically introduce a morphological change.

The CCeC type (three consonants with an ‘’e’’ before the last one), for example, has the same flexion in aorist and preterit, and the CCV type (two consonants and one vowel) has the same flexion in preterit and negative preterit. The negation of simple aorist produces the intensive aorist theme, etc.

In other particular cases, the temporal dimension is not taken into account by the verb. The oppositions of the latter are tense independent, the same verbal form can either occur in the past, the present and the future (aspectual language). The tense is distinguished by the context, mainly adverbs, and in some cases by preverbal modalities.

Using NooJ tools, especially syntactic grammars, morphological plans, dictionaries and corpus, we will try to write and formalize certain technical cases that will allow us to avoid ambiguities.

The first part is based on distinguishing between aorist and preterit, the second, between negative preterit and positive preterit, whereas the third part compares aorist and negative aorist.

References

Aoughlis F., Nait-Serrad K., Annouz H., Ferroudja A. and Habet M.S. (2014). “A New Tamazight Module for NooJ.”, Formalising Natural Languages with NooJ 2013, Edited by S. Koeva, S. Mesfar and M. Silberztein. Cambridge Scholars Publishing, Newcastle., UK: p. 13-26.

Annouz H. (2019). Traitement morphologique des unités linguistiques du kabyle à l’aide de logiciel NooJ : Construction d’une base de données, PhD thesis, Institut national des langues et civilisations orientales (INALCO).

Chaker S. (1989). «Aspect », In: Encyclopédie Berbère VII, EDISUD, Aix-en-Provence, p. 971-977. Naït-Zerrad K. (2001). Grammaire moderne du kabyle, tajerrumt n teqbaylit, Karthala, Paris. Silberztein M. (2015). La formalisation des langues, l’approche de NooJ, ISTE Editions.

Page 10: 15 INTERNATIONAL CONFERENCE NooJ 2021

10

Intensive frozen adverbs of the PECO class in Old French, Xavier Blanco, Yauheniya Yakubovich

Universitat Autònoma de Barcelona, Spain [email protected]

Universitat de València, Spain [email protected]

Abstract

In his GTF-Syntaxe de l’adverbe, Maurice Gross (1986) laid the groundwork for a formal and systematic study of frozen adverbial structures. In the last 35 years, numerous works in different languages have adopted his views for the inventory and description of these phraseological units. The availability of descriptions referring to different linguistic materials, but elaborated with the same theoretical principles and the same formal means constitutes an important asset for comparative studies. In addition, formalization and computer implementation allow considering immediate applications. We are, therefore, fully in the synergy between linguistics and NLP (Silberztein, 2020).

In the framework of the COLINDANTE project (Ministerio de Ciencia e Innovación, Spain), we are proceeding with the inventory, classification and description of the intensive collocations of Medieval French and Medieval Spanish languages (Blanco, 2020). With this communication, we propose to present a precise type of such collocations: the intensive adverbial comparative structures in comme and plus que applied to an adjective (that is basically the PECO class described by Maurice Gross for contemporary French, but taking into account only the intensive meaning, excluding adverbs of the same class that present other meanings, such as meliorative or vericonditional).

We built a lexical database by systematically extracting and annotating the intensive adverbial comparative structures from big textual corpus available in Old French (11th-13th centuries). We completed our inventory by exploring numerous literary texts in PDF using the NooJ platform, which allowed us much greater flexibility and customization than the on-line interfaces of textual bases, such as Frantext (Laboratoire ATILF, CNRS / Université de Lorraine, <www.frantext.fr>) or the BFM (Base de Français Médieval (Laboratoire IHRIM, ENS de Lyon, <txm.bfm-corpus.org>). NooJ has specific resources for Medieval French (Aouini, 2018), although referred to a later period than the one represented in our corpus.

By focusing our research successively on very specific types of linguistic units (both from the syntactic and semantic points of view), by tending to exhaustiveness in recuperation and by assuring a full relevance of the results through a manual double-check, we were able to reach conclusions far more precise than reasoning on a limited number of examples selected in a more or less arbitrary way. These results are relevant for lexicology, lexicography, literary studies (Yakubovich, 2020), translation studies and NLP. Furthermore, the results of this type of linguistic research are presented in a NooJ dictionary module reusable by other researchers.

References

Aouini M. (2018). Approche multi-niveaux pour l’analyse des données textuelles non-standardisées : corpus de textes en moyen français. Phd Thesis, Université Bourgogne Franche-Comté.

Blanco X. (2020). “Remarques sur la variation diachronique des collocations”, Cahiers de lexicologie 2020-1, n° 116. Variation(s) et phraséologie, p. 71-94.

Gross M. (1986). Grammaire transformationnelle du français. “3 - Syntaxe de l’adverbe”, Paris : Asstril.

Silberztein M. (2020). “Linguistique et traitement automatique des langues : une coopération nécessaire”, Langue(s) & Parole nº 5, p. 43-66.

Yakubovich Y. (2020). “Collocations dans la poésie : de la norme linguistique à son dépassement” in Mejri, S., Meneses-Lerín, L. & Buffard-Moret, B. (dirs) : La phraséologie française en questions, Paris : Hermann, p. 277-292.

Page 11: 15 INTERNATIONAL CONFERENCE NooJ 2021

11

The Western Armenian resources for NooJ, Anaid Donabedian

INALCO, France [email protected]

Abstract

The Western Armenian NooJ module consists of three dictionaries associated with inflectional paradigms as well as a dozen morphological grammars.

In particular, because intonation signs (interrogation, exclamation, emphasis) are placed inside wordforms in Armenian (on the vowel of the stressed syllable), we had to write a grammar that remove them from wordforms before accessing dictionaries (described in Donabedian and Boyacioglu 2007).

Other grammars are used to recognize and analyze wordforms following productive patterns and that cannot be listed. For example, some verbal categories are marked by agglutinative markers, causing a proliferation of transparent forms which are not worth to be inserted into the flexional grammars: prefixal negation, derivational patterns for passive and causative, semantic derivation of verbs through preverbs. Many derivational patterns are also active in nominal domain: abstracts nouns derived from adjectives (similar to -ation in French or English). Thanks to these lexical and morphological resources, I was able to obtain a 100% coverage of Zabel Essayan's novel "The last cup" (1912, 17.695 wordforms).

I will also present a set of syntactic grammars used to disambiguate certain frequent grammatical words.

References

Donabédian A., Boyacioglu N. (2007). “La lemmatisation de l’arménien occidental avec Nooj”, In: Koeva S., Maurel D., Silberztein M., Formaliser les langues avec l’ordinateur, de INTEX à NooJ, Presses Universitaires de Franche-Comté, p. 55-75.

Donabedian A., Khurshudyan V., Silberztein M. (eds.) (2013). Formalizing Natural Languages with NooJ, Cambridge Scholar Press, 255 p.

Silberztein M. (2016). Formalizing Natural Languages: the NooJ approach, Wiley Ed. Vidal-Gorène C., Khurshudyan V., and Donabédian A., (2020). “Recycling and comparing

morphological annotation models for Armenian diachronic-variational corpus processing”, In: Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects (VARDIAL 2020), p. 90–101, Barcelona, Spain (Online). International Committee on Computational Linguistics (ICCL)

Page 12: 15 INTERNATIONAL CONFERENCE NooJ 2021

12

A NooJ module for the Romanian language, Maria-Diana Manescu

Politehnica University (Romania) in partnership with Grenoble Alpes University (France), Erasmus+ master studies

[email protected]

Abstract

We present the first NooJ module for the Romanian language. It contains a morphological dictionary with 10,282 entries, as well as a corpus that consists of 160 text files from various fields and registers.

To develop the lexical resource, we used three morphological dictionaries available under free license: RoMorphoDict (Barbu, 2008), wfl-ro (Erjavec, 2012) and ro_lexicon (Aufrant, Wisniewski, & Yvon, 2016). We used these resources to construct a dictionary of invariable words: conjunctions, prepositions, interjections and adverbs; we corrected and formatted them in the NooJ format. (Barbu, 2007) constructed an inventory of Romanian verbs and their inflectional paradigms from an electronic copy of DOOM (a dictionary that prescribes the correct writing, pronunciation, and inflection of the Romanian words). We developed a program in Python to convert this inventory and the inflectional paradigms to the NooJ format. To facilitate the description of inflectional paradigms in NooJ, we developed context-free grammars and organized the inflectional paradigms by conjugation groups. Inflected forms of Romanian verbs can have up to 5 different roots. After compiling the canonical file, we manually checked all the entries with vowel and consonant alternations in the root and made the necessary corrections. We thus obtained 260,945 inflected forms, from a total of 7,522 canonical forms and 192 inflection paradigms, compared to the initial 135 paradigms. An example of an entry for which we had to create a new idiosyncratic paradigm is: arăta,V+pr+FLX=ARĂTA [to show].

Since none of the available corpora of texts in contemporary Romanian language could be downloaded and exploited with NooJ, we used NooJ built-in modules (Silberztein, 2015) to create a new corpus made of 160 text files from various fields and registers. When applying our dictionary to this corpus, NooJ identified 23,563 different annotations. The quality of the annotations is promising, but there is a clear need for the formalization of analytically inflected verbal forms and the remaining grammatical categories (adjectives, nouns, and pronouns). Developing disambiguation grammars will also be an essential contribution. Nonetheless, the current module can already be used for didactic purposes, for the development of various NLP applications (for instance, a conjugator), and even for rudimentary autocorrection modules.

References

Aufrant L., Wisniewski G., & Yvon F. (2016). Cross-lingual and Supervised Models for Morphosyntactic Annotation: a Comparison on Romanian. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (p. 1521-1526). Portorož, Slovenia: European Language Resources Association (ELRA).

Barbu A.-M. (2007). Conjugarea verbelor româneşti, Dictionar: 7500 de verbe româneşti grupate pe clase de conjugare (4th edition, revised). Bucharest.

Barbu A.-M. (2008). Romanian Lexical Data Bases: Inflected and Syllabic Forms Dictionaries. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (p. 1937-1941). Marrakech, Morocco: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2008/

Erjavec T. (2012). MULTEXT-East: morphosyntactic resources for Central and Eastern European languages. Language Resources and Evaluation, 46(1), 131-142. https://doi.org/10.1007/s10579-011-9174-8

Silberztein M. (2015). La formalisation des langues : l’approche de NooJ, ISTE Editions.

Page 13: 15 INTERNATIONAL CONFERENCE NooJ 2021

13

Formalization of the Italian negation system and sentiment analysis, Mario Monteleone, Ignazio Mauro Mirto

Università di Salerno, Italy [email protected]

Università di Palermo,Italy [email protected]

Abstract

The main purpose of this study is a sample formalization of the Italian morphosyntactic negation system, in order to assess the relevance of this system as regards Sentiment Analysis (SA) practices and routines. The study will be achieved through the construction, application, evaluation, and debugging of specific NooJ grammars.

Italian negation can be achieved by, at least, three different means – lexical/morphological, syntactic, and semantic. Here, we will only deal with those features of Italian negation that are subject to formalization.

For Sentence Negation (SN), Italian employs above all the form non (not). The latter precedes predicates inflecting in person and number, producing sentences such as Non piove (It does not rain, i.e. it is false that it is raining), which are in opposition to their affirmative counterparts, i.e. Piove (It rains, i.e. it is true that it is raining). In addition, Italian language features multiple negation. The addition of a negative element before or after non is not interpreted as a double negation equivalent to an affirmation of truth, as in logic. Therefore, a sentence like Non ha parlato nessuno (Nobody spoke) is not equivalent to the sentence Qualcuno ha parlato (Someone spoke).

The presence or absence of non with an additional negative element (nessuno, mai, mica, etc.) is regulated by the following restrictions:

a. non is mandatory if it confers negation value on a predicate, as in Non sapevo che nessuno fosse venuto (I didn't know nobody had come) or Non mi dice mai niente nessuno (Nobody ever tells me anything);

b. non is also mandatory when the negation is expressed by a double pronoun negation, as in Nessuno mi dice mai niente (Nobody ever tells me anything);

c. mica has two different uses: a negative polarity item, e.g. in Non l’ho visto mica (Actually I didn’t see him), and a negation comparable to non, but emphatic, as in Mica l’ho visto (I really didn’t see him).

Other types of morphosyntactic elements with distinct functions may produce negation. For instance:

- The series of indefinites nessuno as in Nessuno è venuto (Nobody came); niente, as in Niente paura (No fear); mai, as in Mai visto niente di simile (I have never seen anything like this”);

- The conjunctions né/nemmeno/neppure, e.g …né/nemmeno/neppure si può accettare una cosa simile… (It is impossible to accept such a thing);

- The correlative conjunctions né … né, as in né bianco, né nero (neither white, nor black);

- The form no and tutt’altro, used mainly as a holophrastic answer to polar questions as in Sei felice? No/Tutt’altro (Are you happy? No, I’m not/Not at all);

- Autonomous and non-autonomous adverbial phrases such as tutt’altro che, as in È tutt’altro che uno sprovveduto (He is anything but a fool) and nient’altro che as in È nient’altro che uno sprovveduto (He is nothing more than a fool);

- At the morphological level, the prefixes s- as in sfiorire (wither); dis- as in disassemblare (disassemble), a- as in anormale (not normal), in- as in incapace (incapable);

- non placed before single words, as in non-credente (non-believer), non-belligeranza (non-belligerence), nonviolenza (non-violence), or nonviolento (nonviolent).

Page 14: 15 INTERNATIONAL CONFERENCE NooJ 2021

14

It is worth noting that other morphosyntactic aspects may participate in the structuring of Italian negation. By means of NooJ tools and routines, these aspects will be methodically dealt with in our presentation. Therefore, we will use specific NooJ FSAs / FSTs and assemble them into an autonomous morphosyntactic system, based on both labeled electronic dictionaries and local grammars built ad hoc.

References

Academia della Crusca, Sulla costruzione della frase negativa in italiano. URL : https://accademiadellacrusca.it/it/consulenza/sulla-costruzione-della-frase-negativa-in-italiano/169

Bernini G. (2011). “Negazione”, Enciclopedia dell'Italiano. URL : http://www.treccani.it/enciclopedia/negazione_%28Enciclopedia-dell%27Italiano%29/

La grammatica italiana (2012). “Negazione, avverbi di”. URL : https://www.treccani.it/enciclopedia/avverbi-di-negazione_%28La-grammatica-italiana%29/

Wikipedia, “Negazione”. URL: https://it.wikipedia.org/wiki/Negazione_(linguistica) Aula di lingue, “La doppia negazione”. URL :

https://aulalingue.scuola.zanichelli.it/benvenuti/2019/11/21/la-doppia-negazione/ Aula di lingue, “Elementi di negazione”. URL:

https://aulalingue.scuola.zanichelli.it/benvenuti/2014/12/04/elementi-di-negazione/ Piccini L., “Avverbi di negazione”, Lingua e Grammatica Italiana. URL :

http://linguaegrammatica.com/avverbi-di-negazione-quali-sono-e-come-utilizzarli Ballarè S. (2015). “La negazione di frase nell'italiano contemporaneo: un'analisi sociolinguistica”,

Rivista Italiana di Dialettologia 39, p. 37-61. Silberztein M. (2016). Formalizing Natural Languages. The NooJ Approach. London, Wiley-ISTE,

London.

Page 15: 15 INTERNATIONAL CONFERENCE NooJ 2021

15

A NooJ module for the Albanian language, Odile Piton

SAMM EA 4543 / CNRS 2036, Université Paris1 Panthéon-Sorbonne, Paris, France [email protected]

Abstract

After discovering the enchanting and rich Albanian literature, I had the opportunity to study this language at the Inalco. Then, during my studies in linguistics, I made the acquaintance of Intex, then NooJ, and I wished to develop a module for processing this language.

I have already done a significant amount of work on the Albanian language, through NooJ and the tools it provides, and I have given several presentations, in collaboration with Klara Lagji, and with Professor Remzi Përnaska.

Albanian is an Indo-European language which has a complex history. It remained an oral language until the end of the 19th century, with many regional or international variations. A colossal work was undertaken during the 20th century in order to create a standardized language that is called "Gjuha lettrare Shqipe" – the Albanian literary language. It should be noted that the work presented here focuses on this Albanian literary language. I have benefited from the work carried out by the Albanian Institute of Linguistics and Literature of the Academy of Sciences of Albania: (1) a dictionary in 1954, (2) the dictionary of Vedat Kokona, and (3) the grammar of 1989 written under the direction of Androkli Kostallarit.

The purpose of this work is to present the NooJ module for Albanian. This module requires the creation of electronic dictionaries for each category of words: for adverbs, conjunctions, and prepositions that are uninflected words, as well as for words that are conjugated or inflected. These are verbs, nouns, adjectives, pronouns, and determinants. I plan on giving a detailed presentation of dictionaries and their inflection, and explaining their selected features. We will notice that morphology plays an important role.

In addition to these dictionaries, the processing of the Albanian language requires dynamic tools, since some elements of the language (agglutinated words) cannot all be listed in a dictionary. Besides the classic case of numbers, I will present some FST for verbs, nouns, adjectives, and adverbs. The module integrates grammars of the noun phrase, and it addresses the problem of disambiguation. Detailed examples will be covered. I hope that this module will be used by Albanian speakers and that they will evaluate its possibilities for processing Albanian texts. I welcome suggestions in order to develop this work.

References

Piton O., Lagji K. (2008). “Morphological Study of Albanian Words, and Processing with NooJ”, In Proceedings of the 2007 International NooJ Conference (Barcelona, Spain). Edited by Xavier Blanco and Max Silberztein. Cambridge Scholars Publishing, Newcastle, UK: 189-205

Piton O., Përnaska R. (2013). “Variations Lexicographiques en Albanais Contemporain, à l'Epreuve du TAL. ” Proceedings of 7th International Corpus Linguistics Conference, Lorient JLC 2013, Texte et corpus n°5, France : electronic version.

Piton O., Lagji K., Përnaska R. (2007). “Electronic Dictionaries and Transducers for Automatic Processing of Albanian Language”. Proceedings 12th International Conference NLDB 2007.Ed. LNCS Series, Springer Verlag, France: 407-413

Silberztein M. (2015). La Formalisation des Langues : l'approche de NooJ. ISTE Ed.: Londres (426 p.)

Page 16: 15 INTERNATIONAL CONFERENCE NooJ 2021

16

A NooJ module for the Ukrainian language, Olena Saint-Joanis

ELLIADD, Université Bourgogne Franche-Comté, France & CREE, INALCO, France [email protected]

Abstract

This paper presents the Ukrainian module for NooJ.

1. Dictionary

Our source is the “Open Source version 2.9.1 electronic dictionary (Polyakov& Rysin, 2016) that contains 221,650 entries. From it, we have built a dictionary using the Ukrainian grammar from (Gorpynyč, 2004, Plušč, 2010, Vyhovanets, 2004). In our Ukrainian dictionary:

а) Our goal is to to transform an imperfective phrase – Іван робив вправу дві години. [Ivan has been doing exercises during two hours] – into a perfective one – Іван зробив вправу за дві години. [Ivan did the exercises in two hours] – in the framework of transformational grammars (Silberztein, 2015). We have thus entered the imperfective form of each verb in the dictionary and have connected it to its corresponding perfective form.

робити,VERB+FLX=ЛЮБИТИ+DRV=З:ЛЮБИТИ_2

b) We have entered passive participle forms in the dictionary separately.

2. Inflectional and derivational grammars

We are presenting the paradigms used to generate all the forms of each entry. Special features of our description system are the following:

a) Participles and adjectives have the same paradigms.

b) Animated masculine nouns have a different form of accusative case than unanimated nouns.

c) We have connected 6,800 perfective forms to their imperfect forms.

3. Syntactic grammars

We are presenting some disambiguation grammar. For example, a grammar that treats the analytical future of Imperfective verbs, grammars that disambiguate cases (for example, to differentiate the dative case from the prepositional case for the nouns).

4. Conclusion and perspectives

We have developed a dictionary of 112,375 entries associated with their inflectional and derivational paradigms that represents 1,0182,56 forms. The dictionary is still to be completed, using Ukrainian texts to detect new forms. At the moment, a database of 100 texts is at our disposal and our first evaluations are promising (we did not find many missing words).

We will create more disambiguation grammars, and more derivation paradigms for adverbs and other categories. We are planning to enrich our dictionary by adding new verbs derived from the basic verbs, formalizing the phenomenon of secondary perfectivation.

References

Gorpynyč V. (2004). Morphologiya ukraïnskoï movy. Akademiya, Kyïv. Plusc M. (2010). Gramatyka ukraïnskoï movy. Čatyna 1. Morfemika. Slovotvir. Morfologiya.

Pidručnyk dlya studentiv filologi čnyh spetsialnostei vyščyh nav čalnyh zakladiv. Vyšča škola, Kyïv. Silberztein M. (2015). “Joe loves Lea: Transformational Analysis of Transitive Sentences”, In

Formalizing Natural Languages with NooJ (9th International NooJ conference, Minsk, Belarus 2015), CCIS Series. Springer Verlag: Heidelberg (2016).

Vyhovanets I., Gorodenska K. (2004). Teorretyčna morfologiya ukraïnskoï movy. Pulsray, Kyïv.

Page 17: 15 INTERNATIONAL CONFERENCE NooJ 2021

17

Part 2: Syntactic and Semantic Resources

Page 18: 15 INTERNATIONAL CONFERENCE NooJ 2021

18

Syntactic analysis of sentences containing Arabic psychological verbs, Asmaa Amzali, Asmaa Kourtin, Mohammed Mourchid, Abdelaziz Mouloudi and

Samir Mbarki

Computer Science Research Laboratory, Ibn Tofail University, Kenitra-Morocco [email protected], [email protected], [email protected]

EDPAGS Laboratory, Faculty of Science, Ibn Tofail University, Kenitra-Morocco [email protected], [email protected]

Abstract

Natural language processing (NLP) makes our lives easier, by means of question-answering systems, data extraction, machine translation, and feeling analysis.

One of the crucial steps for NLP, however, involves the automatic recognition of Atomic Linguistic Units (ALUs). Multiword terms, grammatical units and ambiguous words must be correctly identified in terms of ALUs to assure that the structure of every sentence is accurately analyzed. Otherwise, it will be hard to detect sentences with different word orders, such as " دمحأ دیز هرك " (Kariha Zaidun Ahmadan; Zaid hates Ahmed) and " دیز دمحأ هرك " (Kariha ahmadun Zaidan; Ahmed hates Zaid) have a different interpretation, or sentences with different structures, such as " ادنھ دیز بّحأ " ('Ahabba Zaidun &indan; Zaid loves Hind) and " دنھل اّبح دیز نّكأ " ('akanna Zaidun hobban li &indin; Zaid has a love for Hind) have a similar interpretation.

Our aim is to realize a syntactic analyzer of sentences containing Arabic psychological verbs using NooJ platform. For this reason, we will use the dictionary with about 400 verb entries generated from the lexicon-grammar table of Arabic psychological verbs, containing all the lexical, syntactic, semantic, and transformational information of these verbs, which will facilitate the realization of our analyzer. Then, we will adapt to our needs, the simple sentences' analyzer allowing us to recognize and denote all the grammatical structures of simple Arabic sentences, to analyze the sentences containing Arabic psychological verbs. Then we will finish by testing the efficiency of this analyzer on texts and corpora.

References

Amzali A., Kourtin A., Mourchid M., Mouloudi A., Mbarki S. (2020). "Lexicon-Grammar Tables Development for Arabic Psychological Verbs" In: Fehri H., Mesfar S., Silberztein M. (eds) Formalizing Natural Languages with NooJ 2019 and Its Natural Language Processing Applications. Communications in Computer and Information Science, vol 1153. Springer, Cham.

Kourtin A., Amzali A., Mourchid M., Mouloudi A., Mbarki S. (2020). “The Automatic generation of NooJ dictionaries from lexicon-grammar tables” In: Fehri H., Mesfar S., Silberztein M. (eds) NooJ 2019. CCIS, vol. 1153, pp. 65-76. Springer, Cham.

Bourahma S., Mbarki S., Mourchid M., Mouloudi A. (2018). "Syntactic parsing of simple Arabic nominal sentences using NooJ linguistic Platform" In Proceedings of the International Conference on Arabic Language Processing, ICALP2017 (11-12 October 2017), Arabic Language Processing: From Theory to Practice, Fez, Morocco, Springer, p. 244-25.

Bourahma S., Mourchid M., Mbarki S., Mouloudi A. (2017). “The Parsing of Simple Arabic Verbal Sentences using NooJ Platform” In: Proceedings of the International NooJ’17 Conference, Kenitra-Rabat, Morocco (May 2017), Formalizing Natural Languages with NooJ and its Natural Language Processing Applications, Springer, p0 81-95.

Silberztein M. (2015). La formalisation des langues, l’approche de NooJ, ISTE Editions.

Page 19: 15 INTERNATIONAL CONFERENCE NooJ 2021

19

Recognition and automatic translation of dative verbs, Hajer Cheikhrouhou

University of Sfax, LLTA, Tunisia & University of Franche-Comté, ELLIADD, France [email protected]

Abstract

This paper aims to study French dative verbs, extracted from Jean Dubois and Françoise Dubois-Charlier’s LVF dictionary1. In this database, dative verbs are stored in three semantic-syntactic categories (D1, D2, D3) and fifteen syntactic subcategories (D1, D2b, D3a...). This semantic-syntactic classification is done according to the oppositions “Animate / Inanimate” and “Literal / Figurative”, and also according to their syntactic and lexical paradigm. We have added another classification, based on their syntactic structures (T1100, N9a, T1308 P3000, P10a0…).

For example, the verb offrir (01) belongs to the class D2, subclass D2a “Donner qc à qn ou à qc” [give something to someone or something] and it enters in the syntactic structure T13a0, which means that the verb is transitive when the subject is human, its direct complement is an object, and its indirect complement is a human. Therefore, we have described it with the following NooJ dictionary lexical entry2:

offrir,V+CONS=T13a0+N0VN1PREPN2+N0Hum+V+N1Abst+N1Conc+N2Hum+PREP="à"

e.g.: Il offre un diamant à sa femme. [He offers a diamond to his wife.]

First, we have built a French-Arabic bilingual dictionary of dative verbs similar to (Cheikhrouhou 2014) and (Cheikhrouhou 2015). Then, we have constructed grammars to automatically recognize these verbs and translate them into Arabic. As most verbs have multiple meanings and structures, they have to be disambiguated before being translated into Arabic. To disambiguate them, we have implemented objects classes, see (Gross 2008).

References

Cheikhrouhou H. (2014). “Recognition of Communiction Verbs with NooJ Platform” In: Formalizing Natural Languages with NooJ 2013, Cambridge Scholars Publishing, Britis, p. 155–169.

Cheikhrouhou H. (2015) “The Formalisation of Movement Verbs for Automatic Translation using NooJ Platform” In: Formalizing Natural Languages NooJ 2014, Cambridge Scholars Publishing, British, p. 14–21.

Gross G. (2008). Les classes d’objets. Lalies, Presses de l'ENS, Editions rue d'Ulm, p.111-165, URL: https://halshs.archives-ouvertes.fr/halshs-00410784

Le Pesant D., François J., Leeman D. (2007). “Présentation de la classification des Verbes Français de Jean Dubois et Françoise Dubois-Charlier”, Langue Française 153, Larousse, Armand Colin.

Leeman D. (2010). “Description, taxinomie, systémique : un modèle pour les emplois des verbes français”. Langages n°179-180, Armand Colin.

Silberztein M. (2010). “La formalisation du dictionnaire LVF avec NooJ et ses applications pour l’analyse automatique de corpus“, Langages n°179-180, Armand Colin.

1 See (Le Pesant et al. 2007), (Leeman 2010) and (Silberztein 2010). 2 See (Silberztein, 2003).

Page 20: 15 INTERNATIONAL CONFERENCE NooJ 2021

20

Formalization of clause-subordinate transformations in Quechua, Maximiliano Duran

Université de Franche-Comté, Besançon, France LIG, UGA, Grenoble, France [email protected]

Abstract

In Quechua, subordination is morpho-syntactically defined by the use of a suffixation of the PoS composing the sentence, e.g. Chayta mamai niptin, Ceciliata rimarqani. Among the different classes of dependent clauses, there are internally headed relative clauses (IHR) studied by (Hastings, 2001), headless relative clause (HRC), presented by (Cole et al. 1982) and participial relative clauses, partially studied by (Adelaar 2010). Using NooJ local grammars, we have formalized the transformation of the clause-subordination, allowing us to automatically produce all the paraphrases of a given phrase that contains an adverbial dependent clause.

The verbal suffixes {-pti, -spa, -stin} are used to construct adverbial dependent clauses, e.g. Asispa chayta willawarqa [He told me that laughing]; Kay qellqata tukuspai yanukuita qallarisaq [I will start cooking after finishing writing]; Asikustin sipasja llamkaq richkan [The girl goes happy to work]. These so-called relative suffixes can be agglutinated with certain verbal suffixes. To describe these combinations, we have constructed a Boolean matrix that describe the valid combinations of (-pti,-spa, -stin) with IPS suffixes (-chka, -yku, -paya, …). Then, we have manually selected those agglutinations that preserve the subordination (such as -chkapti, -ykuspa, -payachkapti, -payastin), e.g. Mikuchkaptii chayaramun [He arrived at the moment I was eating]; Mikuykuspa llamkaqman rin [He went to work after he ate]. These new combinations of suffixes can be used to automatically produce a large number of paraphrases as well as to automatically translate Quechua sentences containing subordinate clauses into French.

References

Adelaar W.F.H. (2010). Participial clauses in Tarma Quechua. Subordination in native South American languages, Edited by: Van Gijn, Rik ; Haude, Katharina ; Muysken, Pieter. University of Zurich Main Library.

Cole P. et als. (1982). “Headless relative clauses in Quechua”, International Journal of American Linguistics. Vol. 48. N° 2. University of Chicago Press

Duran M. (2013). “Formalizing Quechua verbs Inflexion”, Proceedings of the NooJ 2013 International Conference, Saarbrücken.

Duran, M. (2014). “Les verbes du quechua. Une approche matricielle”, Communication Semaine NooJ, INALCO, Paris.

Duran M. (2016). “The Annotation of Compound Suffixation Structure of Quechua Verbs”, Proceedings of the NooJ 2015 International Conference, Minsk, Belarus.

Guardia Mayorga C. (1973). Gramatica Kechwa, Ediciones Los Andes. Lima Peru. Harris Z. (1951). Methods in Structutral Linguistics, Université de Chicago Press, Chicago. Hastings R. (2001). The interpretation of Cuzco Quechua Relative Clauses, University of

Massachusetts, occasional Papers in Linguistics, Vol. 27. Parker G.J. (1969). Ayacucho Quechua Grammar and Dictionary, University of Hawaii. Mouton The

Hague Paris. Soto Ruiz C. (1976). Gramática quechua: Ayacucho-Chanca, Lima: Ministerio de Educación, Instituto

de Estudios Peruanos. 184 p. Silberztein M. (2003). NooJ Manual. Available for download at: www.nooj-association.org.

Updated regularly. Silberztein M. (2016). Formalizing natural languages: The NooJ approach. Iste Ediciones. London. Weber D. (1983). Relativization and nominalized clauses in Huallaga (Huánuco) Quechua.

University of California Publications in Linguistics, vol. 103. Berkeley: University of California

Page 21: 15 INTERNATIONAL CONFERENCE NooJ 2021

21

Lexicon-grammar tables for modern Arabic frozen expressions, Asmaa Kourtin, Asmaa Amzali, Mohammed Mourchid, Abdelaziz Mouloudi, Samir

Mbarki

Computer Science Research Laboratory, Ibn Tofail University, Kenitra-Morocco [email protected], [email protected], [email protected]

EDPAGS Laboratory, Faculty of Science, Ibn Tofail University, Kenitra-Morocco [email protected], [email protected]

Abstract

The language lexicon is not only made up of single words but also frozen expressions. Therefore, we should not be limited to the study of the vocabulary and the analysis of the lexical meaning of a language to process it. The language treatment must include the study of the syntactic meaning, including the study of frozen or idiomatic expressions.

Frozen or idiomatic expressions have drawn the attention of several researchers in the last few years, leading to many researches on different languages. Different definitions have been provided to frozen expressions from the syntactic and semantic points of view by Dubois, Maurice Gross, Gaston Gross, etc. From these definitions, we have adopted that the term "frozen" is reserved for expressions whose global meaning is not deduced by joining the meanings of its components; it is a group of words which, in their entirety, make a meaning that is coming from the accord of a group of linguists.

The Arabic language is very rich in frozen expressions which it inherited from the pre-Islamic era and early Islam, and whose use has persisted to this day. These expressions are used in the daily communication language of Arabic speakers, but also in the works of writers and poets.

We can find these expressions dispersed in Arabic books, such as the Quran like " ملاحأ ثاغضأ " (Adghathu ahlamin; Pipe dream), the linguistic heritage and literary books such as " ھفتح يقل " (laqiya hatfahu; He is dead), the proverbs books such as " دئاوف موق دنع موق بئاصم " (massa'ibu qawmin 'inda qawmin fawa'idu; One man's meat is another man's poison), etc., which has led some researchers to collect, classify and explain them. Indeed, several classifications have been proposed according to the needs, such as continuous and discontinuous expressions, expressions that do not admit variations, expressions allowing variations, etc.

We aim is to create, for the modern Arabic language, lexicon-grammar tables of frozen expressions that are continuous and do not admit variations.

We have started by collecting and studying, in the modern Arabic language, the frozen expressions that are continuous and do not admit variations, such as " ماتخلا كسم " (misku al-khitam; Save the best for last). Then, we have transformed these lexicon-grammar tables into dictionaries [4] and syntactic grammars in NooJ platform [5], allowing to detect these expressions in texts and corpora. This recognition will help to solve many problems related to the automatic natural language processing, such as the automatic translation issue.

References

Dubois J. (2002). Lexis : Larousse de la langue française. Gross G. (1996). Les expressions figées en français : noms composés et autres locutions, Editions

Ophrys. Gross M. (1993). "Les phrases figées en français". In: L'Information Grammaticale, n°59, pp. 36-

41. Kourtin A., Amzali A., Mourchid M., Mouloudi A., Mbarki S. (2020). "The Automatic Generation of

NooJ Dictionaries from Lexicon-Grammar Tables". In: Fehri H., Mesfar S., Silberztein M. (eds) Formalizing Natural Languages with NooJ 2019 and Its Natural Language Processing Applications. NooJ 2019. Communications in Computer and Information Science, vol 1153. Springer.

Silberztein M. (2015). La formalisation des langues, l’approche de NooJ. ISTE Editions.

Page 22: 15 INTERNATIONAL CONFERENCE NooJ 2021

22

The nominal ellipsis in complex head of Spanish: Phenomenon description and computer-based model proposal,

Walter Koza, Hazel Barahona

Pontificia Universidad Católica de Valparaíos, Chile – Proyecto FONDECyT 1171033 [email protected], [email protected]

Abstract

The ellipsis is a grammatical mechanism characterized, superficially, by the phonetic omission of some of its elements. In Spanish, nominal ellipsis does not always affect the head of the phrase, it can also extend to some of its modifiers; such is the case of complex head (CH) structures (Kornfeld, 2012):

- La tasa de mortaliad es más baja que la tasa de natalidad. [Mortality rate is lower than birth rate]

- La tasa de mortalidad de Chile es más alta que la tasa de mortalidad de Chile. [The mortality rate in Chile is higher than the mortality rate in Argentina]

The CHs are a special set of structures created from an assembly operation and can, under certain restricted domains, act as a syntactic atom (Kornfeld, 2012). These can be sum up in one word created from morphological composition (‘dishwasher’) or involving particular syntactic processes and conditioned by their structural and argumentative context (‘Swiss roll’, ‘mortality rate’). For this research, the latter is of interest, in which case semantics take on a determining role, in so much as it constitutes a fundament requirement in elliptic nominal formations. Thus, it is possible to recognize several types of CH such as, for example, expressions (‘break a leg’); structures with marker names (‘economic problem’) (Muñoz, Ciapuscio, 2019) or with predicative names, verbal (‘the construction of the city’) and non-verbal (‘birthday party’) (Resnik, 2010; Koza, 2019), which can behave in various ways regarding the ellipsis.

Nevertheless, this distribution entails a problem for automatic ellipsis resolution, since it is difficult for algorithms to determine the reach of the elision (Rello, Ilisei, 2009). Even so, it would be possible to solve this problem by adding semantic information electronic dictionaries and developing grammars that take such information into account. To this end, in this work an algorithm for automatic treatment of nominal ellipsis of CH in Spanish is developed using NooJ (Silberztein, 2016). To achieve this, firstly, the CH components are formalized from a semantic representation, and the corresponding lexical feature is assigned. Secondly, the resultant information is added to the electronic dictionary. Thirdly, grammars for recognition in natural language texts are created from Barahona & Koza (2021).

The results obtained in the automatic recognition allow us to assume that this methodology is appropriate for an automatic analysis of the nominal ellipsis.

References

Barahona H., Koza W. (2020). “Computational Modeling of a Nominal Ellipsis Grammar for Spanish” In: 14TH International conference NOOJ 2020. Zagreb: University of Zagreb. Kornfeld, L. (2012). Núcleos complejos y formación del trabajo lingüístico. Master Thesis. Neuquén: Universidad Nacional del Comahue.

Koza W. (2019). “Análisis de la polisemia de nombres eventivos no deverbales del español a partir de la Léxico-Gramática y la Teoría del Lexicón Generativo”, Revista Signos. Estudios de Lingüística, 52(100), 502-530.

Lilian Muñoz V., Ciapuscio G. (2019). “Los nombres rotuladores: un estudio de los rótulos cohesivos en artículos de investigación en inglés y español”, Revista Signos. Estudios de Lingüística, 52(100), 688-714.

Resnik G. (2010). Los nombres eventivos no deverbales en español. Ph.D. Thesis. Barcelona: Pompeu Fabra University.

Silberztein M. (2016). Formalizing natural languages: The NooJ approach. Wiley & Sons.

Page 23: 15 INTERNATIONAL CONFERENCE NooJ 2021

23

Meaning extraction from strappare causatives in Italian, Ignazio Mauro Mirto, Mario Monteleone

Università di Palermo,Italy [email protected]

Università di Salerno, Italy [email protected]

Abstract

This paper focuses on the automatic extraction of meaning from causative constructions in Italian, e.g. that with the verb fare (make/have) or lasciare (let), as exemplified in (1): (1) La polizia fece confessare il detenuto (The police made the detainee confess)

Italian also features a little-known causative clause type with strappare (Mirto 2014), a verb which more frequently occurs in non-causative, transitive uses, e.g. Ada strappò la camicia a Piero (Ada tore Piero’s shirt) and Ada strappò la penna a Piero (Ada snatched the pen from Piero). The following example permits a comparison between the two types: (2) La polizia strappò una confessione al detenuto (The police made the detainee confess)

Our main interest lies in the clause type illustrated in (2), whose post-verbal noun functions predicatively (Mirto 2014), in parallel to what happens in support verb constructions (Gross 1981). What catches the eye in the translation of (2) is that the sentence may potentially be regarded as a paraphrase of (1), insofar as (1) and (2) share the semantic roles itemized below (these roles are described in Mirto 2019): (a) >the one who confesses< (to be linked to the indirect object the detainee) (b) >those who bring about the confession< (to be linked to the subject the police)

Importantly, (1) and (2) do not convey the same meaning. A key difference lies in the resistance that the detainee opposes. In this regard, (1) is neutral, whilst (2) expresses the detainee’s reluctance to confess, which amounts to saying that the police somehow forced her/his confession. The semantic role of the subject is thus comparable to that described by Maurice Gross (1998: 10) in relation to the subject of the French mettre ‘to put’ as a Vsup causatif. Despite this key difference, the meaning in (a) and (b) remains constant in (1) and (2) and the truth of (2) guarantees the truth of (1). Hence, (2) entails (1).

80 nouns were collected and can replace the noun predicate confessione in the strappare causative. The automatic extraction of semantic roles such as (a) and (b) requires: (i) an inflected dictionary which associates each of such nouns to the verb sharing the noun’s root, e.g. confession > to confess (if no verb exists, a semantically contiguous verb is selected, e.g. ovazione ‘ovation’ > applaudire ‘to applaud’); (ii) the construction of a NooJ local grammar employing variables (Silberztein 2016: 188-191, Monteleone 2016a,b, Monteleone 2020). The latter permits to automatically return units of meaning such as (a) and (b). This way, unstructured texts containing instances of the strappare causative can be automatically annotated with semantic roles which say ‘who-does-what-to-whom’ (Mirto 2007, Palmer, Gildea, and Xue 2009: 1).

Once a dictionary and a NooJ grammar will become available also for the fare causative exemplified in (1), the software will permit the tracking or generation of entailment relations between the causative clause types illustrated in (1) and (2). Other entailments will also become possible with e.g. ordinary verb sentences such as Il detenuto confessò (The detainee confessed), support verb sentences such as Il detenuto fece una confessione (The detainee made a confession), and sentences such as La polizia lasciò che il detenuto confessasse (The police let the detainee confess).

References

Gross M. (1998). “La fonction sémantique des verbes supports”, Travaux de linguistique, 37, 1, p. 25-46.

Gross, M. (1981). “Les bases empiriques de la notion de prédicat sémantique”, Langages, 63, p. 7-52.

Page 24: 15 INTERNATIONAL CONFERENCE NooJ 2021

24

Mirto I. M. (2019). “The role of cognate semantic roles: Machine translation support for support verb constructions”. Paper presented at NooJ 2019, 7-9 July, Hammamet, Tunisia.

Mirto I. M. (2014). “Italian strappare: Unwilling vs. struggling agents”, In: Kakoyianni-Doa F., (ed.) Penser le Lexique-Grammaire, Perspectives actuelles, Honoré Champion, Paris, p. 335-348.

Mirto I. M. (2007). “Dream a little dream of me. Cognate predicates in English”, In: Actes du 26e Colloque International Lexique-Grammaire, Bonifacio, Corse, 2-6 October 2007, Camugli C., M. Constant, A. Dister (eds.), p. 121-128,

URL: http://infolingu.univmlv.fr/Colloques/Bonifacio/proceedings/mirto.pdf. Monteleone M. (2020). “Automatic Text Generation: How to Write the Plot of a Novel with NooJ”,

In: Fehri H., Mesfar S., Silberztein M. (eds), Formalizing Natural Languages with NooJ 2019 and Its Natural Language Processing Applications, NooJ 2019, Communications in Computer and Information Science, vol. 1153. Springer, Cham. https://doi.org/10.1007/978-3-030- 38833-1_12.

Monteleone M. (2016a). “NooJ Local Grammars for Endophora Resolution”, In: Barone L., Monteleone M., Silberztein M. (eds.), Automatic Processing of Natural-Language Electronic Texts with NooJ, NooJ 2016, Communications in Computer and Information Science, vol 667. Springer, Cham. https://doi.org/10.1007/978-3-319-55002-2_16.

Monteleone M. (2016b). “NooJ Local Grammars and Formal Semantics: Past Participles vs. Adjectives in Italian”, In: Okrut T., Hetsevich Y., Silberztein M., Stanislavenka H. (eds), Automatic Processing of Natural-Language Electronic Texts with NooJ, NooJ 2015, Communications in Computer and Information Science, vol 607. Springer, Cham. https://doi.org/10.1007/978-3-319-42471-2_8.

Palmer M., Gildea D., Xue N. (2009). Semantic Role Labeling. Morgan & Claypool. Silberztein M. (2016). Formalizing Natural Languages. The NooJ Approach, London, Wiley.

Page 25: 15 INTERNATIONAL CONFERENCE NooJ 2021

25

Using syntactic grammar to export Nooj morphological annotation: A case study of the morphological annotation of Indonesian texts,

Prihantoro

Lancaster University & Universitas Diponegoro [email protected]

Abstract

This experiment is a part of SANTI-morf (Sistem Analisis Teks Indonesia-morfologi1) project, which aims to build a system of text analysis for Indonesian at the morpheme level (Prihantoro, forthcoming). Indonesian is a standard variety of Malay, one of the Austronesian languages in Southeast Asia widely used in Indonesia. Polymorphemic words in Indonesian can be formed by a variety of morphological processes, such as affixation, reduplication, compounding, and cliticisation (Mueller 2001:1221-1224). These morphological processes are identifiable via either interlinear morphemic glossing or morphological annotation. Unlike glossing which targets specific examples, morphological annotation typically targets a full text, thus offers more benefit for users. I create a morphological annotation scheme (Prihantoro 2019) and morphological annotation resources (Prihantoro 2020) for Indonesian. The scheme is implemented using Nooj (Silberztein 2003) and successfully produces fine-grained morphological annotation. However, Nooj fails to properly export the result of the annotation into a text file. To overcome such an issue without modifying Nooj’s core engine, I offer a stepwise procedure, which slightly modifies the resources. The procedure simply ‘deceives’ Nooj to think that the morphological annotation is a syntactic annotation (which can still be exported properly by Nooj at present) so that Nooj can successfully export all morphemes and their corresponding tags. While successful, the document is improperly formatted, which reduces the readability of the text. To remedy this, I use a small external program (in this case, I use PHP) to perform search-replace operations to reformat the document back into a highly readable text.

References

Mueller F. (2007). “Indonesian morphology”, In: A. Kaye, Morphologies of Asia and Africa, p. 1207-1230). Winnona: Eisenbraums.

Prihantoro (2019). “A new morphological tagset for Indonesian”, International Corpus Linguistics Conference 2019. Cardiff.

Prihantoro (2020). Indonesian language resources for Nooj. Retrieved from http://www.nooj-association.org/resources.html

Prihantoro (Forthcoming). SANTI-morf: A new morphological annotation system for Indonesian (PhD Thesis). Lancaster: Lancaster University.

Silberztein M. (2003). NooJ Manual. Available for download at: www.nooj-association.org. Updated regularly.

1 In English: Indonesia Text Analysis System-morphology

Page 26: 15 INTERNATIONAL CONFERENCE NooJ 2021

26

Transformational analysis of auxiliary verbs, Max Silberztein

Université de Franche-Comté, France [email protected]

Abstract

A large number of Natural Language Processing software applications need to recognize, process and extract predicates (the meaning) from the sentences they parse. Predicates are often represented by an association of the sentence’s main verb and its arguments, such as “come (Joe, here)” to represent the meaning of “Joe comes here”. Most often, recognizing the main verb of a sentence is straightforward; however, a number of verbs, even when conjugated and in the syntactic position of the sentence’s main verb, should not be treated as the origin of the sentence’s main predicate. For example, in the sentence: “Joe needs to stop risking coming here”, the main predicate is carried by the verb “to come” rather than by the verb “to need”, the verb “to stop” nor the verb “to risk”. In this paper, we will present the problem of generalized auxiliary Verbs as defined by (Dubois 2017) and (Gross 1999), how to describe them with a NooJ grammar similar to the ones described by (Silberztein 2015), and how to use the resource to annotate sentences’ main predicates automatically.

References

Dubois J., Dubois-Charlier F. (2017). Les Verbes Français. Disponible à partir du site www.modyco.fr.

Gross M. (1999). “Sur la définition d'auxiliaire du verbe”. Langages, 8-21. Harris Z. (1968). Mathematical Structures of Language. J. Wiley: New York. Silberztein M. (2015, June). “Joe loves lea: transformational analysis of direct transitive

sentences”. In International Conference on Automatic Processing of Natural-Language Electronic Texts with NooJ (pp. 55-65). Springer, Cham.

Silberztein M. (2016). Formalizing Natural Languages: the NooJ approach. Wiley-ISTE Ed.: Hoboken, NJ, USA.

Page 27: 15 INTERNATIONAL CONFERENCE NooJ 2021

27

Part 3: Corpus Linguistics and Discourse Analysis

Page 28: 15 INTERNATIONAL CONFERENCE NooJ 2021

28

The contribution of NooJ to digital surveillance: The assimilation of the major issues that characterized the U.S. presidential elections,

Nana Ama Ampomah Awuah

Université de Franche-Comté, France [email protected]

Abstract

In a world where technology becomes intertwined with information, linguists devise means to follow the information that become trending. However, it is difficult to identify and assimilate salient issues that clearly represent an event due the high number of articles published daily on the internet. The current study is interested in monitoring the pivotal issues that characterized the U.S. 2020 elections digitally.

A collection of online news articles over a specific period of time was used to carry out this study. We selected 51 articles published between January and November 2020 on Google News. The selected articles touched on all the happenings of the U.S. presidential elections. All selected articles were in English and offered up to date in-depth information on the US elections.

Five grammars were created to carry out the research. It consisted of a main word and its synonyms to make up a grammar that could be applied. Digital text and statistical analysis were carried out with the aid of NooJ. The Standard Score was used to calculate the probability of the grammars occurring in a normal distribution and to compare to grammars from different distributions.

The research outlined feminism and racism, coronavirus and the media as key elements in last year’s U.S. elections. Statistical analysis revealed correlations and anti-correlations between some of these elements. Digital surveillance of information with a high-tech approach enhances and simplifies the recognition of important factors that categorize an event notwithstanding mass publication on the internet.

References

Gucul-Milojević S., Radulović V., Krstev C. (2008). “Usage of NooJ Graphs and Annotation for Information Extraction” In: Proceedings of the 2007 International NooJ Conference (Barcelona, Spain). Edited by X. Blanco and M. Silberztein. Cambridge Scholars Publishing, Newcastle, UK: 103-120

Karl D., Bekavac B., Raffaelli I. (2019). “A Construction Grammar Approach in the NooJ Framework: Semantic Analysis of Lexemes describing Emotions in Croatian” In: Formalising Natural Languages with NooJ 2018 and its Natural Language Processing Applications. Springer CCIS Series. I. Mauro Mirto, M. Monteleone, M. Silberztein Eds.

Moirand S., (2018). “L’apport de petits corpus à la compréhension des faits d’actualité”, Corpus [En ligne], consulté le 19 avril 2019. URL: http:// journals.openedition.org/corpus/3519

NooJ: http://www.nooj-association.org/index.html

Page 29: 15 INTERNATIONAL CONFERENCE NooJ 2021

29

The designation and storytelling of the culprits during the French Yellow Vests movement. A study of 2018-2019 books of grievances,

Marion Bendinelli

ELLIADD – pôle DTEPS, Université Bourgogne Franche-Comté, France [email protected]

Abstract

Our communication deals with the identification, denomination and narration of culprits and enemies in the popular voice expressed at the occasion of modern social movements.

French society, though recognized as a rich and democratic society, doesn’t avoid the fate of most present western societies: the level of mistrust in the institutions stands at an all-time high, especially in times of economic and social crisis. In the autumn 2018, French social situation was explosive: calls on social networks multiplied to block the country and the voices of the unheard and/or invisible came into light. Anger, resentment, sadness merged and gave birth to the Yellow Vests movement. To answer the distress and calm the situation down, the President of the French Republic, Emmanuel Macron, opened a “Great Debate” in mid-December to offer citizens different ways to express themselves.

Books of grievances were made available throughout the country; people freely came to write down their wills and requests, their analysis and understanding of the country’s economic, social, environmental situation, as well as to describe their everyday life. Whatever the mode of enunciation, these books were an opportunity to designate the one(s) they thought responsible, be they politicians, businessmen, stock market companies’ CEO, intellectuals, etc.

Our study focuses on Dole’s book of grievances, which has been digitized and transcribed in spring 2019. Using NooJ software, we first produce a dictionary of the actors that interest us (the ones designated as culprits), enriched with semantic annotations referring for example to their domain of activity and hierarchical level. Then, to model the narrative frames in which these actors take place, we build the adequate grammars so as to consider their syntagmatic position(s) and grammatical function(s). Eventually, we compare those frames depending on the types of accounts (personal testimonies or professional expertise) displayed in the book. This work will complete a previous study, couched in the framework of discourse analysis and textometry.

References

Bendinelli M., Rico-Perrier A. (2020). “Sens et matérialités d'un cahier citoyen. Approche argumentative et textométrique du cahier de la ville de Dole (Jura)”, Communication aux journées d’études “Quels outils d’analyse pour les Gilets jaunes ? ”, organisées par les réseaux METSEM et MATE-SHS, Sciences Po Paris, 16-17 janvier 2020.

Consortium Roland Berger, Blue Nove, Cognito (2019). “Le Grand Débat National, Analyse des contributions libres : cahiers citoyens, courriers et emails, comptes-rendus des réunions d'initiative locale. Rapport final”. URL : < https://www.grand-debat.org/wp-content/uploads/2019/04/synthese-rapport-final.pdf >, consulté le 23.08.2020.

Perrineau P. (2020). “Regard d’un garant du Grand débat”, communication aux journées d’études “Quels outils d’analyse pour les Gilets jaunes ?”, organisées par les réseaux METSEM et MATE-SHS, Sciences Po Paris, 16-17 janvier 2020.

Silberztein M. (2015). La formalisation des langues. L'approche de Nooj. Londres : ISTE Editions.

Page 30: 15 INTERNATIONAL CONFERENCE NooJ 2021

30

Uses and potential of mobile devices in francophone sub-Saharan Africa, Magali Bigey, Ibrahim Maïdakouale

ELLIADD-CCM Laboratory, University Bourgogne Franche-Comté, France [email protected] , [email protected]

Abstract

Nowadays, the proliferation of Information and Communication Technologies (ICT), their constant evolution and the diversity of services offered are generating a craze but also raise questions about the scope of possibilities. The adoption and development of the use of these mobile devices (cell phone, computer, internet) considerably transforms the individual's work postures by abolishing or reducing spatio-temporal boundaries (Besseyre des Horts & Isaac, 2006).

DISTICs are now attributed an imaginary social "power" that the researcher cannot neglect (Flichy, 2001). Nevertheless, a critical approach makes it possible, with hindsight, to go beyond the accompanying discourses held by equipment manufacturers and international institutions as well as governments, to study in depth the sociology of uses, the contexts of uses as well as the stakes associated with them.

In short, this would mean that there are technological devices that are being put in place in Africa, and more particularly in Niger, but which do not correspond to the expectations and needs of users, and so users reinvest them to do different things. For many manufacturers of communication machines, it is a matter of duplicating the Western model in Africa, but this cannot work in this way in Niger (Maïdakouale & Kiyindou, 2015). This is because there is a lack of data that concretely determines the expectations and needs of users. The original use of these systems is therefore subject to "tinkering" and "detour" by African users (de Certeau, 1990; Perriault, 1989).

The data collected in this study is aimed essentially at "identifying uses" and highlighting users' expectations so that DISTICs can be successfully integrated into the socioeconomic fabric of Niger.

Corpus processing system

During this study, we conducted nearly 70 audio-visual interviews.

The analysis of the first data collected was carried out using Advene software (for the transcription of the interviews) and NooJ software (for the lexical and semantic analysis of the interviews). We worked on the most common meanings and digrams, on concordance analysis around the representative ICT lexicon and their use by the respondents. We have been also interested in ethos in discourse, the positioning of people in relation to their ICT practice. The use of this software has proved to be indispensable insofar as it offers the opportunity to extract, from the corpus, the most fine-grained evidence possible to test our hypotheses.

References

Besseyre des Horts C.-H., Isaac H. (2006). “L'impact des TIC mobiles sur les activités des professionnels en entreprise”. Revue française de gestion, 32 (168-169), p. 243-266.

Bigey M. (2018). “Twitter et l’inscription de soi dans le discours. L’ethos pris au piège (ou pas) de la frontière sphère privée/sphère publique”, Les Cahiers du numérique, vol. 14, no. 3, p. 55-75.

De Certeau M. (1990). L'invention du quotidien. Arts de faire, Luce Giard, Vol. 1-1, Paris Gallimard. Flichy P. (2001). L'imaginaire d'Internet. Paris Gallimard, La découverte. Maïdakouale I., Kiyindou A. (2015). Usages des technologies numériques mobiles chez les jeunes

entrepreneurs au Niger : Cas de Niamey [Mémoire de master II - Recherche Communication, réseau et société]. Université Bordeaux Montaigne. Bordeaux, France.

Perriault, J. (1989). La logique de l'usage. Essai sur les machines à communiquer. Paris Flammarion. Silberztein M. (2016). Formalizing Natural Languages: the NooJ approach. John Wiley & Sons.

Page 31: 15 INTERNATIONAL CONFERENCE NooJ 2021

31

Sensitivity to fake-news: reception analysis with NooJ and ATISHS, Magali Bigey, Justine Simon

ELLIADD Laboratory, University of Franche-Comté [email protected], [email protected]

Abstract

The power of the image is to touch the deepest feelings of the audience (indignation, compassion, sublimation, fear, hatred, astonishment, etc.). Internet users, who are often drowned in a permanent flow of information, are not led to analyze an image. They are simply touched, affected by the content. This appeal to emotion is a favorable ground for the deployment of post-truth. Post-truth is a situation where the reality of facts influences public opinion less than the appeal to emotions and personal beliefs. It is the emotion that prevails, the truth has become secondary.

We decided to undertake a reception study with students of the Information-Communication OTC, aiming to understand their (in-)capacity to discern infox in a digital context, their verbalization of personal emotions and the forms of self-involvement in the discourse.

After having collected over two consecutive years the reactions of about 80 students to several kinds of pictures in context and out of context, we will show how NooJ allows us to bring out the personal involvement in discourse, and the differences in student’s permeability once they have been a bit sensitized to this issue. We will use lexicon analysis, concordances analysis, but also semantic information analysis (on expressed feelings for example). We have been interested in ethos in discourse, and we will see that it is often associated with emotional reactions, making a clear link with the emotional web that has been strongly emerging in recent years.

References

Ruth A. (2010). La présentation de soi. Ethos et identité verbale, Paris, Presses universitaires de France.

Bigey M. (2018). « Twitter et l’inscription de soi dans le discours. L’ethos pris au piège (ou pas) de la frontière sphère privée/sphère publique », Les Cahiers du numérique, 14, p. 55 – 75.

Bigey M., Simon J. (2021). « Désinfoxiquer les images sur les réseaux socionumériques. Vers une démarche empirique d’éducation à l’image », in Bonfils Ph., Dumas Ph., Remond É. et Stassin B. (dirs), L’éducation aux médias tout au long de la vie : des nouveaux enjeux pédagogiques à l’accompagnement du citoyen, Colloque International TICEMED 12, 7 – 9 Avril 2020,Athènes, Grèce, 978-2-492969-00-3. ⟨halshs-03206274v2⟩, pp. 198-206

Mercier A., dir. (2018). « Fake news et post-vérité : tous une part de responsabilité ! », Fake news et post-vérité, E-book publié par le site The Conversation, p. 4 – 8.

Sassoon V. (2018). « Éduquer les jeunes aux images, un enjeu de citoyenneté », La Revue des Médias. Accès : https://larevuedesmedias.ina.fr/eduquer-les-jeunes-aux-images-un-enjeu-de-citoyennete.

Siberztein M. (2015). La formalisation des langues, l’approche NooJ, ISTE Edition.

Page 32: 15 INTERNATIONAL CONFERENCE NooJ 2021

32

Creation of a legal domain corpus for the Belarusian NooJ module: texts, dictionaries, grammars,

Yuras Hetsevich, Yauheniya Zianouka, Valerii Varanovich, Mikita Suprunchuk, Tsimafei Prakapenka, Dmitrii Dzenisiuk

United Institute of Informatics Problems, Minsk, Belarus [email protected], [email protected], [email protected]

Belarusian State University, Minsk, Belarus [email protected], [email protected]

Minsk State Linguistic University, Minsk, Belarus [email protected]

Abstract

The current language situation in the Republic of Belarus is characterized primarily as state bilingualism. At the legislative level, two state languages are established – Belarusian and Russian. But despite the state bilingualism, the vast majority of legislative documents are implemented only in Russian. Thus, of the 26 codes of the Republic of Belarus, i.e. texts which are available on the National Legal Internet Portal pravo.by, 25 are officially adopted in Russian and only one in Belarusian. One of the main factors hindering the practical support of bilingualism in the legal sphere of the Republic of Belarus is the unresolved problem of high-quality and fast linguistic processing of large texts, which testifies to the relevance of high-quality machine translation. To handle the question of translating legislative documents into Belarusian, Speech Synthesis and Recognition Laboratory of UIIP NASB, in cooperation with specialists from Faculty of Social and Cultural Communications of BSU, have translated all codes of the Republic of Belarus into the Belarusian language using automatic services of corpus.by.

The next step is to collect all legislative codes of the Republic of Belarus in the Belarusian language on order to create a unified text corpus. This task is relevant for solving the following tasks. First of all, it is very important to perform primary analysis of legal domain corpus to find out the main linguistic peculiarities of this kind of corpus in comparison with Belarusian literary corpus. Secondly, we will be able to compose different types of dictionaries (Belarusian-Russian, Belarusian-English, Belarusian-English-Russian dictionaries). This question is very actual for Belarusian since there are very few translated Belarusian-foreign and foreign-Belarusian dictionaries of legal terminology – we are aware of 6 dictionaries that are different in their merits and significance for the ordering and development of Belarusian legal terminology. And the last – but not least – task is to develop special morphological and syntactical grammars for further prosodic analysis of legal texts. Automatic syntagmatic delimitation is still not solved for the Belarusian language. That is why developing NooJ grammars will assist the process of creating a system of prosodic marks (including punctuation and intonation marks) and further automatic segmentation of Belarusian texts of the legal domain.

References

National Legal Internet Portal of the Republic of Belarus (2019). List of legal acts [Electronic source]. Mode of access: https://pravo.by/document/?guid=3871&p0=H11800130. – Date of access: 18.07.2019.

Computational platform for electronic text and speech processing corpus.by (2019). [Electronic source]. Mode of access: http://corpus.by/. – Date of access: 12.07.2018.

Drahun A. (2019). “Semi-Automatic Proofreading of Belarusian and English texts”, A. Drahun, Yu Hetsevich, A. Bakunovich, Dz. Dzenisiuk, J. Shynkevich, In: International Conference NooJ 2019: Book of Abstracts. –Hammamet, Tunisia.

Hetsevich Y. (2016). “Semi-automatic Part-of-Speech Annotating for Belarusian Dictionaries Enrichment in NooJ”, Y. Hetsevich, V. Varanovich, E. Kachan, I. Reentovich, S. Lysy, In: Automatic Processing of Natural-Language Electronic Texts with NooJ: 10th International Conference, České

Page 33: 15 INTERNATIONAL CONFERENCE NooJ 2021

33

Budějovice, Czech Republic, June 9-11, Revised Selected Papers, ed. L. Barone, M. Monteleone, M. Silberztein, Springer, 2017, p. 101-111.

Reentovich I. (2016). “The First One-Million Corpus for the Belarusian NooJ Module”, I. Reentovich, Y. Hetsevich, V. Voronovich, E. Kachan, H. Kozlovskaya, A. Tretyak, U. Koshchanka, In: Automatic Processing of Natural-Language Electronic Texts with NooJ: 9th International Conference, NooJ 2015, Minsk, Belarus, June 11-13, Revised Selected Papers, Springer, ed. T. Okrut, Y. Hetsevich, M. Silberztein, H. Stanislavenka, Springer International Publishing, p. 3-15.

Page 34: 15 INTERNATIONAL CONFERENCE NooJ 2021

34

Negation usage in Croatian Parliament, Kristina Kocijan, Krešimir Šojat

Faculty of Humanities and Social Sciences, University of Zagreb, Croatia {krkocijan | ksojat}@ffzg.hr

Abstract

At the center of this research is an analysis of negation usage in Croatian Parliament. Research shows that, psychologically speaking, it is much harder to process a negative word followed by a positive adjective (ex. he is not happy) than an adjective with a negative prefix (ex. he is unhappy). We plan to investigate how the negation is used among Croatian politicians during the Parliament Sessions and whether its usage depends on the speaker’s gender, Party preference or the time when the Session was held.

Phonograms for Croatian Parliament sessions are available since 2003. Each phonogram keeps information on the date of a Session, speaker, his/her Party, followed by the speech. We have selected 4 points of reference per year since 2003 (January, May, September, December) for which the data exist in order to test whether the time period of the Session (just before and after the holidays) has an impact or not on the usage of a negation. The corpus is about 28 million tokens in size. Additionally, from this data, we were able to sort out different sub-corpora according to the gender of a speaker and the Party.

For the purpose of this experiment, a syntactic grammar will be designed with the aim of annotating different types of negation on a sentence level:

a) negative verb + positive adjective [nije sretan – en. isn’t happy];

b) positive verb + negative adjective [je nesretan - en. is unhappy];

c) negative verb + negative adjective [nije nesretan – en. isn’t unhappy].

References

Sherman M.A. (1973). “Bound to be easier? The negative prefix and sentence comprehension”, Journal of Verbal Learning and Verbal Behaviour, 12, p. 76-84.

Silberztein M. (2016). Formalizing Natural Languages: The NooJ Approach, Cognitive science series, Wiley-ISTE, London, UK. Žanpera N. (2020). Računalno prepoznavanje i označavanje negacije u hrvatskom (Recognizing and

Annotating Negation in Croatian), MA Thesis, University of Zagreb, Faculty of Humanities and Social Sciences, Retrieved from [https://urn.nsk.hr/urn:nbn:hr:131:050307]. Žanper N., Kocijan K., Šojat K. (2020). “Negation of Croatian Nouns” In: Formalizing Natural

Languages with NooJ 2019 and Its Natural Language Processing Applications (H. Fehri, S. Mesfar, M. Silberztein Eds.) Communications in Computer and Information Science, vol. 1153, p. 52-64.

Page 35: 15 INTERNATIONAL CONFERENCE NooJ 2021

35

From laws and decrees to a legal ontology, Ismahane Kourtin, Aziz Mouloudi, Samir Mbarki

ELLIADD Laboratory, Bourgogne-Franche-Comté University, Besançon, France EDPAGS Laboratory, Faculty of Science, Ibn Tofail University, Kenitra-Morocco

[email protected], [email protected], [email protected]

Abstract

The mass of information in the legal field, which is constantly increasing, has generated a capital need to organize and structure the content of the available documents, and thus transform them into an intelligent guide capable of providing complete and immediate answers to queries in natural language, and promoting the development of new forms of collective intelligence. Therefore, Question-Qnswering Systems (QAS)1 perfectly meet this need by offering different mechanisms to provide adequate and precise answers to questions expressed in natural languages. The general context of our work is the construction of a QAS in the legal field based on ontologies2, allowing users to ask a question on the desired information using natural language without having to browse through the documents. In this article, we will focus on the construction of a legal ontology based on legal laws and decrees3. The legal ontology, which we propose to build from laws and decrees, will bring together the terminological material in order to optimize the automated management of laws and decrees, particularly during the stages of transforming users' questions in natural language into SPARQL queries on the one hand, and on the other hand when looking for answers to users' questions. We have adopted a methodological framework in seven steps for the construction of the legal ontology:

- We have built a sample of laws and decrees from which we manually extracted the legal entities. Then we studied the syntactic forms of the extracted legal entities, and we developed the syntactic grammars with NooJ4 that will be used to automatically extract legal entities from a corpus of laws and decrees.

- We have developed syntactic grammars to extract candidate legal entities for the construction of the legal ontology.

- We have have eliminated the false positives from the candidate legal entities. - We have extracted and identified the semantic relations between legal entities.

References

Hirschman L., Gaizauskas R. (2001). Natural language question answering: the view from here. Natural Language Engineering, 7(4):275–300. URL http://dl.acm.org/citation.cfm?id=973891.

Gruber T. (1993). A translation approach to portable ontology specifications. Knowledge acquisition, 5(2):199–220. http://secs.ceas.uc.edu/~mazlack/ECE.716.Sp2011/Semantic.Web.Ontology.Papers/Gruber.93a.pdf.

Nico Borst W.: Construction of engineering ontologies for knowledge sharing and reuse. Universiteit Twente, 1997. http://doc.utwente.nl/17864.

Mondary T., Després S., Nazarenko A., Szulman S. (2008). Construction d’ontologies à partir de textes : la phase de conceptualisation.

Zaidi–Ayad S. (2013). Une plateforme pour la construction d’ontologie en arabe : Extraction des termes et des relations à partir de textes (Application sur le Saint Coran).

Silberztein, M. (2003). NooJ Manual. Available at: www.nooj-association.org. Updated regularly. Silberztein, M. (2016). Formalizing Natural Languages: the NooJ approach. Wiley-ISTE Ed.:

Hoboken, NJ, USA.

1 See (Hirschman, Gaizauskas, 2001). 2 See (Gruber, 1993) and (Nico Borst, 1997). 3 See (Mondary et al. 2008) and (Zaidi-Ayad 2013). 4 See (Silberztein 2003) and (Silberztein 2016).

Page 36: 15 INTERNATIONAL CONFERENCE NooJ 2021

36

Festival volunteers, committed festival-goers or the legacy of cultural practice,

Stéphane Laurent

IUT Belfort Montbéliard – ELLIADD, University of Franche-Comté, France [email protected]

Abstract

The official definition of volunteering was submitted in 1993 by the Economic, Social and Environmental Council: «A volunteer is any person who freely commits himself or herself to carry out a self-employed action towards others, outside his or her professional and family time». In particular, surveys differ on the concept of the duration of this commitment. Thus, it is difficult to be resolutely sure of the figure, however, in France, there are at least more than 11 million volunteers (INJEP, 2019).

The festival sector is no exception to this field, and holds in large part thanks to the commitment of its teams – of which on average by 60% are volunteers – to allow these events to exist. There are nearly 300.000 volunteers for all live performance festivals in France (on a total amount of more than 3100 festivals, according to data.gouv).

As part of the SO'FEST project, a scientific team was mobilized on nearly 100 festivals across France, whatever their formats and aesthetics may be. We tried to draw the portrait of the volunteer in his singularities of commitment.

More than a hundred qualitative interviews have been conducted, right in the middle of the events to question these volunteers on the deep reasons for their commitment, their role in organizations, their cultural practices and their vision of the world through their family and social origins.

The use of NooJ has allowed us to delineate in occurrences and co-occurrences, which causes this motley crowd to invest themselves over long periods of time alongside both ephemeral and intense events. We will also cross-reference the corpus thus created with a statistical look on the recurrence of the lexicon according to the age or seniority of the volunteer. In addition to breaking down certain clichés of this social group, we will draw a parallel between the history of festival audiences and volunteers, in so far as to what they intrinsically share or distinguish them.

References

INJEP (2019). Chiffres clés de la Jeunesse 2019. Djakouane A., Negrier E. (2020) for France. Festival So’Fest Study – Dec 1st 2020. Silberztein M. (2015). La formalisation des langues, l’approche de NooJ. ISTE Editions.

Page 37: 15 INTERNATIONAL CONFERENCE NooJ 2021

37

How to locate traces of subjectivity in diplomatic discourse with Nooj? The example of the French Ministry of Foreign Affairs,

Annabel Richeton

ELLIADD, University of Bourgogne Franche-Comté, France

Abstract

Our project, following studies on institutional discourse analysis (Oger et Ollivier-Yaniv 2006; Monte et Oger 2015), considers the discourse of the French Ministry of Foreign Affairs under the 5th Republic. Our corpus (nearly 11.000 words) gathers a sample of text data from the diplomatic communication outer face: it includes various monological and dialogical situations of communication addressed to the media and general public and involving Ministers’ or States Secretaries’ speeches, interviews, press releases, etc.

Following Villar (2006) or Yapparova, Ageeva and Adamka (2019), diplomatic discourse is characterized by its normalization, neutrality and standardization. As a consequence, a Ministry of Foreign Affairs’ speech shouldn’t include any subjective unit, be it affective or evaluative. Yet, subjectivity may show up with the creation of (new) words, used to express biases and opinions since prefixes and/or suffixes do change the original root meaning.

In this study, we will focus our interest on the word “Europe” and its derivations. Using NooJ (Silberztein 2015), we will first build a grammar able to develop all the occurrences derived from the French name “Europe” so as to include single wordforms and multi-word units: the adjectives européen, pro-européenne, anti(-)européens, europhobe, etc.; the nouns or names européanisation, alter-euroépanisme, eurocratie, Union européenne, etc. ; the verbs européiser, déseuropéaniser, etc. Then, NooJ will help us locate each wordform in the corpus, determine its context(s) of use thanks to the concordancer with which we will identify the types of sentences in which each word appears and their syntactic functions and roles. We will eventually confront these data with the corpus parameters so as to look at a possible evolution of the institutional discourse driven by, or depending on, among other variables, the presidential mandates, the ministers and their political party, the Government of the day.

Our study contributes to confirm the usefulness of NLP’s approach – and especially NooJ – in the field of the discourse analysis.

References

Monte M., Oger C. (2015). “La construction de l’autorité en contexte. L’effacement du dissensus dans les discours institutionnels”, Mots. Les langages du politique, vol. 107, p. 5-18. https://www-cairn-info.scd1.univ-fcomte.fr/revue-mots-2015-1-page-5.htm

Oger C., Ollivier-Yaniv C. (2006). “Conjurer le désordre discursif. Les procédés de « lissage » dans la fabrication du discours institutionnel”, Mots. Les langages du politique [En ligne], 81, mis en ligne le 01 juillet 2008, http://journals.openedition.org/mots/675

Silberztein M. (2015). La formalisation des langues : l’approche de NooJ, ISTE : Londres, 425 p. Villar C. (2006). Le discours diplomatique, Paris, L’Harmattan, 284 p. Nagimovna Yapparova V., Viktorovna Ageeva J., Adamka P. (2019). “Verbal Politeness as an

Important Tool of Diplomacy”, Journal of Politics and Law, Vol. 12, No. 5, Canadian Center of Science and Education.

Page 38: 15 INTERNATIONAL CONFERENCE NooJ 2021

38

Terms and Appositions: What unstructured texts tell us, Giulia Speranza, Maria Pia Di Buono, Johanna Monti

University of Naples “L’Orientale” - UNIOR NLP Research Group, Italy {gsperanza, mpdbuono, jmonti}@unior.it

Abstract

Specialized texts present a high percentage of terms, which very often result to be obscure to lay-people, making text comprehension a cumbersome task. Indeed, the final users may lack the domain knowledge to interpret and correctly understand specialized information. Nonetheless, some linguistic expedients can be found, such as appositions, which mainly have the function of clarifying terms through a description which simplifies terminological specialism.

Starting from the assumption that, from a linguistic perspective, information in texts can be encoded in semi-fixed linguistic structures according to the function and aim they seek to fulfil, our study focuses on investigating appositions as examples of semi-fixed elements in Italian, used to clarify terms in a specialised domain, namely archaeology.

Appositive structures or appositions – sometimes also called parentheticals – have been studied and defined in several ways according to different research perspectives. Most scholars agree on identifying appositions as constructions showing the juxtaposition of two or more noun phrases (NPs). Among the several metalinguistic proposals, we chose to follow the one proposed by Huddleston & Pullum (2002), who designate the first element of the appositive structure as anchor, e.g., frigidarium, and the second one as apposition, e.g., sala per il bagno freddo (cold bathing room).

Furthermore, as Meyer (1992) and Quirk et al. (1985) pointed out, the appositional construction, despite being much more frequent, may not always involve two or more NPs but also other types of syntactic classes.

On a syntactical and graphical level, appositions are also flagged with punctuation marks which enclose them, separating them from the main sentence (Burton-Roberts, 2006). Usually, appositions are placed between commas, but it is not rare to find them between brackets or dashes, which may contain even a single word. Among the punctuation marks, brackets seem to have a stronger separation effect, with the consequence of giving less attention to its content, hence considered as accessory and secondary, or, alternatively, giving more prominence to it.

Indeed, appositions are pragmatically and semantically used with an explicative function in mind, aimed at providing additional information about the anchor they are referring to or reformulating previous concepts, by means of relations of synonymity, hyponymy, etc. (Cf. Meyer, 1992).

Our hypothesis is that, due to their semi-fixed, easily recognizable structure and the semantic richness they convey, appositions are suitable linguistic constructions from which to derive additional and valuable information about technical terminology.

Once several patterns identifying different types of appositional structures have been defined, we aim at automatically extracting them from unstructured texts. More precisely, our case study is based on museums and archaeological sites textual data in Italian (i.e., leaflets, brochures, guides, webpages), which we have collected and gathered in a corpus, and process with Nooj (Silberztein, 2015) to perform text mining operations, setting up grammars able of automatically retrieving appositive structures.

Results of our study aim to create a glossary of Italian archaeological technical terms from unstructured texts. Each term entry is enriched with a simplified version of the term – extracted from the appositive structures present in the texts – in the form of a synonym or variant, e.g., armille (bracciali), armillas (bracelets), or a paraphrase, e.g., pithoi (grandi contenitori simili alle giare), pithoi (large container-like jars).

References

Page 39: 15 INTERNATIONAL CONFERENCE NooJ 2021

39

Burton-Roberts N. (2006). “Parentheticals”. In K. Brown (ed.-in-chief), Encyclopaedia of Language and Linguistics. Amsterdam: Elsevier, 2nd edition, 179-182.

Huddleston R., Pullum G. (2002). The Cambridge Grammar of the English Language. Cambridge: Cambridge University Press, 1, 23.

Meyer C. (1992). Apposition in contemporary English. Cambridge: Cambridge University Press. Quirk R. et al. (1985). A comprehensive grammar of the English language. London: Longman. Silberztein M. (2015). La formalisation des langues : l'approche de NooJ. ISTE Group.

Page 40: 15 INTERNATIONAL CONFERENCE NooJ 2021

40

The debates on the advent of the fifth generation of mobile telephony (5G), Yu Xia

University of Franche-Comté, France [email protected]

Abstract

In the 21st century, innovations, inventions and new technologies are growing every day like a mushroom. In order to keep abreast of current events, it is necessary to keep a technological watch. What interests us in this article is the watch on 5G (the fifth generation of mobile telephony). As a new mobile network, 5G is provoking a great debate not only in France, but all over the world. But what is 5G? What are the key telecommunications in the world? What are the stakes of 5G? In what areas will it contribute to our lives? These are the questions we are going to work on.

Using the Europresse database, our corpus consists of 100 articles, dating from February 2019 to November 2020, published in the written press and on French websites. Using the NooJ software, we have built five grammars around five themes: 5G suppliers, health impacts, ecological risks, security issues and the implementation of telemedicine through 5G.

Firstly, by building the grammar on 5G telecommunications, we have distinguished them in three parts: Asian suppliers, European suppliers and American suppliers. Based on our general knowledge and our text, it has been easy to name some suppliers. We have improved the grammar by adding new operators using concordance. After a statistical analysis, we have found that the score in chapter 16 is higher than plus 2. The second theme is health. We have construct the grammar using synonyms such as "virus" and through adjectives such as "health". When we have seen the concordance, we have been able to add nouns in front of the adjective "health", for example, "health crisis", "health impact", etc. We have also added nouns in front of the adjective "health". According to the statistical analysis, the result was that the peak for the theme "health" is to be found in the 93rd article of our corpus. We have chosen the word "environment" as the third theme. We have been looking for environment-related words such as "climate", "ecological" etc. It must be noted that the score of the statistical analysis lies between chapter 13 and chapter 14, i.e. the 73rd article of our corpus. The fourth theme is security. The grammar consists of nouns such as 'cyber security' and verbs such as 'spy'. Similarly, a statistical analysis has been carried out to observe whether the score was more than plus two or minus two. The theme 'security' is more important between chapters 19 and 20. Recently, the theme chosen is advanced telemedicine through 5G. This theme is more concrete and precise in comparison with the first "health" theme. If the first theme deals in a general way with the risks of 5G from a health point of view, the last theme insists on the implementation of 5G in daily life. When we talk about 5G, we always think first of mobile phones instead of medicine. However, with the construction of the grammar related to medicine, we have found that a certain period of time - the 53rd article of our corpus – deals a lot with medicine.

Through statistical and qualitative analysis, we have investigated why these words were important and why they suddenly appeared at a given moment, in other words, the reason for the existence of the peak on these five themes. In fact, there are many factors affecting the appearance of the peak. We have concluded four factors in this article: the political situation, public interest, political orientation and the date of publication of the articles.

As a result, carrying out such a technological watch allows us to better understand what 5G is, to discover its disadvantages as well as its advantages. In subsequent studies, we will try, on the one hand, to expand our corpus so as to not only study French written press and websites, but also newspapers from other countries; on the other hand, we will try to explore more themes on 5G such as democracy, politics, etc.

References

Fotopoulou A. (1996). “Analyse automatique des textes de spécialités et dictionnaires électroniques des termes de télécommunications : remarques sur la morphosyntaxe des termes

Page 41: 15 INTERNATIONAL CONFERENCE NooJ 2021

41

composés”, In: Linx, n°34-35, Lexique, syntaxe…automatique. Hommage à Jean Dubois, sous la direction de Danielle Leeman et Serge Meleuc, p. 89-95. URL: https://www.persee.fr/doc/linx_0246-8743_1996_num_34_1_1418

Poulard F. (2008). “Analyse quantitative et qualitative de citations extraites d’un corpus journalistique”, Rencontre des Etudiants-Chercheurs en Informatique et en Traitement Automatique des Langues, Avignon. URL: https://www.researchgate.net/publication/29602434_Analyse_quantitative_et_qualitative_de_ citations_extraites_d%27un_corpus_journalistique

Silberztein M. (2010). “La formalisation du dictionnaire LVF avec NooJ et ses applications pour l'analyse automatique de corpus”, Langages, vol. 179-180, no. 3, 2010, p. 221-241. URL: https://www.cairn.info/revue-langages-2010-3-page-221.ht

Silberztein M. (2016). Formalizing Natural Languages: the NooJ approach. Wiley Ed.

Page 42: 15 INTERNATIONAL CONFERENCE NooJ 2021

42

Part 4: Natural Language Processing Software Applications

Page 43: 15 INTERNATIONAL CONFERENCE NooJ 2021

43

Paraphrasing tool using the NooJ Platform, Amine Alassir ISG, Sondes Dardour, Héla Fehri

University of Gabès, Tunisia [email protected]

MIRACL Laboratory University of Sfax, Tunisia [email protected], [email protected]

Abstract

The dissemination of scientific research is gaining momentum and provides an opportunity for novice and established researchers to contribute to their field. Therefore, there is a particularly great and growing demand for paraphrasing tools, which can effectively and efficiently aid researchers in rewriting sentences to avoid plagiarism. Moreover, paraphrasing is useful in several NLP applications such as question-answering (Lin and Pantel, 2001), summarization (Barzilay, 2003) or machine translation (Callison et al., 2006). The aim of paraphrasing is to generate different expressions with the same meaning.

Over the past decades, new paraphrase tools have appeared such as Seomagnifier and Preposteo. Nevertheless, these tools give unsatisfied results. For example, the generated sentence may be infelicitous and does not jibe well with the specificities of the given language such as the sentence structure. Nowadays, this issue can be achieved with the advent of new NLP tools. Indeed, these tools give users several real time solutions, i.e. brief and concise, by using syntactic grammars.

The aim of this paper is to propose a method to paraphrase sentences in French. Our proposal is based on transducers and dictionaries. It consists in replacing some constituents of the sentence by synonyms or antonyms with negation, or in switching to the passive form. In our tool, the _dm dictionary is used with slight modifications. In this dictionary, synonyms of nouns, verbs and adjectives are added as well as antonyms of adjectives and verbs. In addition, we have divided these words into two parts – one part related to words starting with a vowel and another part for words that do not start with a vowel – in order to eliminate the ambiguity of the apostrophe form case. The linguistic resources are implemented using the NooJ platform (Silberztein, 2018). Experimentations of our paraphrasing tool show interesting results.

References

Lin D., Pantel P. (2001). “Discovery of inference rules for question answering” In: Natural Language Engineering, 7(4):342–360.

Barzilay R. (2003). Information Fusion for MultiDocument Summarization: Praphrasing and Generation. Ph.D. thesis, Columbia University

Callison-Burch C., Koehn P., Osborn M. (2006). “Improved statistical machine translation using paraphrases” In: Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, p. 17– 24, New York City, USA.

Silberztein M. (2018). “Using linguistic resources to evaluate the quality of annotated corpora” In: Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, 2-11.

Page 44: 15 INTERNATIONAL CONFERENCE NooJ 2021

44

Recognition and analysis of complex questions in standard Arabic using NooJ,

Essia Bessaies, Slim Mesfar, Henda ben Ghazela Riadi

ENSI, University of Manouba, Tunisia [email protected], [email protected], [email protected]

Abstract

Nowadays, most question-answering systems have been designed to answer factoid or binary questions (short and precise answers such as dates, locations), and only a few studies address complex questions. In this paper, we present a method for analyzing complex questions. The analysis of the question asked by the user is performed by using syntactic and morphological patterns. By applying these patterns to the question, NooJ can annotate the question and its semantic features and extract the focus and topic of the question.

We start with the implementation of the rules to identify and annotate the various named entities. Our named entity recognizer (NER) tool is able to find references to people, places and organizations as targets that will be used to find the correct answer. The NER is embedded in our question answering system. The task of QA is divided into three phases: question analysis, segmentation, and passage retrieval & answer extraction. Each phase plays a crucial role in the overall performance. We use the NooJ platform as a valuable linguistic development environment. The first evaluations show that the actual results are encouraging and could be deployed for further question types.

References

Low B.T., Chan K., Choi L.L., Chin M.Y, Lay S.L. (2001). “Semantic expectation-based causation knowledge extraction: A study on Hong Kong stock movement analysis”, In: David Cheung, Graham J. Williams, and Qing Li (Eds.), Advances in Knowledge Discovery and Data Mining, Lecture Notes in Computer Science: Vol. 2035, Springer, Berlin.

Khoo C., Chan S., Niu Y. (2000). “Extracting causal knowledge from a medical database using graphical patterns”, In: Proceedings of 38th Annual Meeting of the ACL, p. 336–343.

Mesfar S. (2007). Named Entity Recognition for Arabic Using Syntactic Grammars, NLDB Mesfar S. (2008). Analyse morpho-syntaxique automatique et reconnaissance des entités

nommées en arabe standard, Phd Thesis, Franche-Comte University, France. Silberztein M. (2003). NooJ Manual. Available for download at: www.nooj-association.org.

Updated regularly. Silberztein M. (2016). Formalizing Natural Languages: the NooJ approach. Wiley-ISTE Ed:

Hoboken, NJ, USA.

Page 45: 15 INTERNATIONAL CONFERENCE NooJ 2021

45

Answer validation in question answering system, Essia Bessaies, Slim Mesfar, Henda ben Ghazela Riadi

ENSI, University of Manouba, Tunisia [email protected], [email protected], [email protected]

Abstract

In our research on the closed domain of question-answering systems, every question waits for answers that belong to an explicit type. In this paper, we present a method aiming to verify that an answer given by a system corresponds to the proper type. This verification method combines criteria for checking the adequacy between a response and a type. We work with different methods dedicated to verify such appropriateness.

In this paper, we describe the similarity between two modules or phases of a question-answering system: question analysis and answer extraction. This paper reports the design and implementation of a question-answering system for the medical domain and interactive question-answer phases. The system is able to provide all types of answers such as complex ones. The integration of an answer validation step within the answer extraction phase guarantees the extraction of the correct answer.

We start with the implementation of the rules which identify and annotate the various medical entities. Our named entities recognizer tool (NER) can find references to people, places, organizations, diseases, viruses; they are used as targets to extract the correct answer from the user. The NER is embedded in our question-answering system. Our system is divided in four phases: question analysis, segmentation and passage retrieval, answers validation and finally answer extraction. Each phase plays a crucial role in the overall performance.

Finally, preliminary evaluations draw optimistic conclusions on the feasibility of our question-answering system for another closed domain such as economic and politic.

References

Girju R. (2003). Automatic Detection of Causal Relations for Question Answering, ACL. Mesfar S. (2007). Named Entity Recognition for Arabic Using Syntactic Grammars, NLDB. Tahri A., Tibermancine O. (2013). “DBPEDIA Based Factoid Question Answering System”,

International Journal of Web & Semantic Technology, vol.4, p. 1-16.

Page 46: 15 INTERNATIONAL CONFERENCE NooJ 2021

46

The use of NooJ’s functionalities to build an application for Arabic acquisition,

Ilham Blanchete, Mohammed Mourchid

MIC research team, Laboratory MISC, IbnTofail University Kenitra, Morocco [email protected] , [email protected]

Abstract

The linguistic platform NooJ1 allows both computational linguists and IT developers to implement linguistic phenomena, to analyze sophisticated corpora, and to build NLP resources etc... It also offers other functionalities2 to take advantage of the pre-built resources. Furthermore, developers can use NooJ’s outputs (annotations), to develop information systems. This paper shows how developers can exploit linguistic resources to build software applications

Resources that have been used to develop the application are:

1. Full verbs model:3 this model has been applied for a certain alphabet.

2. Masdars Model: linked to verbs’ model (nominalization).

3. Nouns and adjectives Model: applied for a certain alphabet.

4. Broken plurals model4: linked to their singular forms (nouns-adjectives).

We have used the abovementioned resources, which are all based on root and pattern approach, to develop an application that provides useful functionalities. This application could be used by those who are interested in Arabic language acquisition (teachers, students/learners, computational linguists, developers).

NooJ returns the linguistic analysis result in various types, this allows developers to extract linguistic data from these files and build their applications. Beyond that, software developers can also call NooJ platform from their applications, systems, and websites using a specific functionality that NooJ provides. We have called NooJ from our application to execute the linguistic analysis on our corpus. The application achieves the following tasks:

• the extraction of all verbs that share the same root (user entry) and different meanings of a selected verb will appear if any. The user then will be able to execute verb conjugation (we have taken into account all moods and tenses in standard Arabic language). The results in this section are the possible conjugated forms, and linguistic features of the selected verb.

• the extraction of Masdar forms that share the same root and link them to their verbs. • the extraction of nouns/adjectives that share the same root. The user can also extract the

possible broken plural forms of a selected noun/ adjective. • Other tasks will be added to the application.

Simple need of the application:

• language acquisition: a teacher asked students to • extract all action verbs that have a specific root class, e.g. hollow verbs (all possible verb

classes are available). • conjugate one of the hollow verbs. • extract all Masdar forms of the previous verb. • Linguist’s needs: a linguist will be able to extract all verbs, nouns, masdars, broken plurals,

and adjectives of a given root.

Steps to implement the application:

1. Excel files, manual insertion of first five alphabet letters.

1 See (Silberztein 2016). 2 See (Silberztein, 2003). 3 See (Blanchete et alii, 2018). 4 See (Blanchete et alii, 2019).

Page 47: 15 INTERNATIONAL CONFERENCE NooJ 2021

47

2. “mini-convertor” that converts excel files to NooJ dictionary format using python. 3. NooJ platform to build linguistic resources, which are all based on root and pattern approach. 4. Python programming language to develop the above-detailed application. 5. noojapply.exe to execute noojapply CMD. 6. PySharm to design GUI’s.

This paper shows how both developers and computational linguists can use NooJ platform and NooJ functionalities to realize software applications.

References

Blanchete I., Mourchid M., Mbarki S., Mouloudi A. (2018) Formalizing Arabic Inflectional and Derivational Verbs Based on Root and Pattern Approach Using NooJ Platform. In: Mbarki S., Mourchid M., Silberztein M. (eds) Formalizing Natural Languages with NooJ and Its Natural Language Processing Applications. NooJ 2017. Communications in Computer and Information Science, vol 811. Springer, Cham. https://doi.org/10.1007/978-3-319-73420-0_5

Blanchete I., Mourchid M., Mbarki S., Mouloudi A. (2019) Arabic Broken Plural Generation Using the Extracted Linguistic Conditions Based on Root and Pattern Approach in the NooJ Platform. In: Mirto I., Monteleone M., Silberztein M. (eds) Formalizing Natural Languages with NooJ 2018 and Its Natural Language Processing Applications. NooJ 2018. Communications in Computer and Information Science, vol 987. Springer, Cham. https://doi.org/10.1007/978-3-030-10868-7_3

Silberztein M. (2003). NooJ Manual. Available for download at: www.nooj-association.org. Updated regularly.

Silberztein M. (2016). Formalizing Natural Languages: the NooJ approach. Wiley & Sons.

Page 48: 15 INTERNATIONAL CONFERENCE NooJ 2021

48

Linguistics, applied research and NLP: using NooJ in a technical-operational context. Case-study, analysis and perspectives,

Nicolas Boffo, Philippe Lambert

Groupe TAL, CIMI, France [email protected]

Université de Lorraine, [email protected]

Abstract

NooJ is especially used to formalize languages and linguistic usages for research and didactic purposes, but also to be integrated into information processing technologies such as monitoring systems, semantic search engines, etc.

However, NooJ can also be considered as a major system component in a very functional way or operational context. Nowadays, more and more information systems connected to economic intelligence must technically respond to three important dimensions: reactivity, adaptability and interoperability. These constraints set up the basis of our operational approach.

The main purpose of our paper is to demonstrate the potential of NooJ not only for fundamental research in linguistics but also for operational objectives. Our paper will first present the linguistic work carried out on a poorly endowed language like Vietnamese for which we build up specific linguistics resources.

Then, our study will present the first results of a visual representation of the information extracted by NooJ within a technical-functional uses: dataviz, spatial visualization of events with free and open source Geographic Information System QGIS and timeline visualization.

Finally, we will analyze NooJ use in a technical-operational context by pointing out its weaknesses and strengths to propose innovative ways of improving NooJ tool.

References

Lambert P., Schwer S., Boffo N. (2012). “A new model of Time Expressions Detection and Annotation in Vietnamese: The hôm case” in International Conference on Asian Language Processing, Hanoi.

Lambert P., Sidhom S. (2010). « Using NooJ in a Process of Economic Intelligence: Producing Knowledge for Information Monitoring and Re-indexing Content », in NooJ 2010 Conference in Komotini (GR).

Schwer S., Boffo N., Lambert P. (2013). "Toward Time Universals Identification in NooJ: A New Model of Time Recognition in Vietnamese", Nooj International Conference, Saarbrücken - June 2013.

Silberztein M. (2015) "La formalisation des langues, L'approche de NooJ", ISTE 2015. Lambert P., Fournié M., Ho Dinh O. (2010). « VIET4Nooj A Vietnamese module for Nooj », in NooJ

conference 2010. Ho B. Q., Chevallet J., Bruandet M. (2003). « Mise en place d’un Système de Recherche

d’Informations en Vietnamien », in TALN 2003, 12. Nguyen T. B. et al. (2004). « Developing tools and building linguistic resources for Vietnamese

morpho-syntactic processing ». Silberztein M. (2020). NooJ V4, HAL Id: hal-02435923. Serrano L. (2014). "Vers une capitalisation des connaissances orientée utilisateur : extraction et

structuration automatiques de l’information issue de sources ouvertes", HAL Id : tel-01082975, v. 1. Charolles M. (2012). "Les cadres de discours et leurs frontières", HAL Id : hal-00665820, v. 1.

Page 49: 15 INTERNATIONAL CONFERENCE NooJ 2021

49

Construction of an educational game "VocabNooJ", Hela Fehri, Lazhar Arroum, Sameh Ben Aoun

MIRACL Laboratory, University of Sfax, Tunisia [email protected]

ISG Gabès, University of Gabès, Tunisia [email protected] ; [email protected]

Abstract

The rapid evolution of computer technology has created an upsurge interest in various activities, especially educational games. By introducing the notion of games, Educational Learning takes a new dimensional approach. Both the teaching of educators and the learning of students have benefited from the new world of interactive games and applied sciences within the learning environment. To obtain improved strategies for both learning and teaching, the principle of game-based learning (GBL) can be effectively used. Obviously, it means integrating games into instructional medium.

Game-based education involves training and entertainment. Thus, the objective of educational games is to immerse children in educational activities while playing. Indeed, children learn best when their learning is combined with fun play. This entails that Strategy-based games increase the brain's functioning process, encourage kids to learn new things, enhance their talents, and create an emotional bond to learn the subject matter.

The aim of this paper is to propose a game developed with NooJ platform. This game is a monolingual educational game that consists of developing the abilities of children and even adults in French language in a simple and fun way through a set of exercises. The child is therefore required to respond to exercises with a degree of difficulty that varies from one level to another. The objective is to help children learn better through play, strengthen acquired knowledge, achieve pedagogical objectives, especially for children, and master the French language. There are two essential steps in this game: review and evaluation. These two steps are independent but complementary. The first step can help the gamer succeed in the second step. The first step, labelled "Review", is to recognize the synonyms and antonyms of appropriate words. This step is named "Review" because it can help the gamer remember the synonyms and antonyms of a few words before beginning the "Evaluation" part. This can help to enrich his or her vocabulary and succeed in the second step. The second step, labelled "Evaluation", is composed of three exercises, the difference being the level of difficulty. A user can only move from one level to the other when he succeeds the current level.

The implementation of this game is based on _dm dictionary and transducers. To the _dm dictionary, the synonym and the antonym of each entry are added. Concerning the transducers, they allow the extraction of synonyms and antonyms of the appropriate word. This game is easy to play and does not require computer skills using Java interface. The obtained results are encouraging and satisfactory.

References

Fehri H., Ben Messaoud I. (2020). “Construction of Educational Games with NooJ” In: H. Fehri et al. (Eds.): NooJ 2019, Hammamet, Tunisia. CCIS 1153, p. 173–184.

Silberztein M. (2015). La formalisation des langues : l’approche NooJ. Collection Sciences Cognitive et Management des Connaissances. Edition ISTE, London.

Trouilleux T. (2012). “A new French dictionary for NooJ: le DM” In: Automatic Processing of Various Levels of Linguistic Phenomena: Selected Papers from the NooJ 2011 International Conference, Cambridge Scholars Publishing, p. 16-28.

Page 50: 15 INTERNATIONAL CONFERENCE NooJ 2021

50

Automatic analysis of finding predicates from a Lexicon-Grammar proposal,

Javiera Jacobsen, Mirian Muñoz, Walter Koza, Francisca Saiz

Pontificia Universidad Católica de Valparaíso, Chile – Proyecto FONDECyT 1171033 [email protected], [email protected], [email protected],

[email protected]

Abstract

The problem of polysemy constitutes one of the fundamental conditions of human languages, which justifies the development of theories designed to understand it, such as the Generative Lexicon Theory (TLG) (Pustejovsky, 1995) and the Lexicon-Grammar (M. Gross, 1975, 1998; G. Gross, 2014). Regarding the latter, Gross' proposal establishes that natural language must be described from the set of simple sentences that composed them. To this end, it proposes the creation of tables that gather lexical elements of a grammatical category that share properties (Tolone, 2012). In this way, it is possible to identify classes for the units that have common meanings (‘eat’, ‘devour’, ‘swallow’, etc.). On one hand, the same unit can belong to more than one class, such is the case ‘encontrar’ (‘find’), which belongs to at least three:

1.a. Juan encontró una moneda. (Juan found a coin).

1.b. Juan se encontró con María. (Juan met with María).

1.c. Juan se encuentra deprimido. (Juan is depressed).

To this end, it is possible to create resources for the automatic identification of meanings with an analysis of the predicates. This work aims to describe the predicates of the type 'find', 'discover', 'locate', etc. in Spanish with the purpose of developing an algorithm in NooJ for automatic analysis, from the LG. In this regard, it seems such a predicate projects an argument structure with three arguments: an “agent” (A0) finds “something” (A1) in “a place” (A2). Formal analysis consists in determining the semantic nature of such arguments and delimiting object classes. Subsequently, the possible transformations that can take place in the argument structure are determined. This information is transformed into LG tables and computer models on NooJ (Silberztein, 2016).

The computer work consisted of: (i) creation of an electronic dictionary with relevant information for automatic detection and generation; (ii) creation of grammars for automatic detection; (iii) creation of grammars for automatic generation.

The automatic detection was tested in 42 predicates in 1000 concordances (20 words – Predicate – 20 words) extracted randomly from a corpus of 5 million words composed by biomedical texts. The obtained results were the following: 96% precision, 88% coverage and 92% F measure.

Additionally, the generative grammar produced 325 sentences.

References

Gross M. (1975). Méthodes en syntaxe. París : Hermann. Gross M. (1998). Lexique, grammaires et cumulativité. Cahiers de l’institut de linguistique de

Louvain, 24, 23-41. Pustejovsky J. (1995) The generative lexicon. Cambridge, Massachusetts: MIT Press. Silberztein M. (2016). Formalizing natural languges. The NooJ approach. Londres: ISTE. Tolone T. (2012). Conversión de las tablas del Léxico-Gramática del francés en el léxico LGLex.

Ponencia presentada en el 2nd Argentinian Workshop on Natural Language Processing (WNLP’11), Córdoba, Argentina.

Page 51: 15 INTERNATIONAL CONFERENCE NooJ 2021

51

Arabic spelling error detection and correction using NooJ, Rafik Kassmi, Samir Mbarki, Abdelaziz Mouloudi

MISC Team, EDPAGS Laboratory, Faculty of Science, Ibn Tofail University, Kenitra-Morocco [email protected], [email protected], [email protected]

Abstract

Arabic has a rich and complex morphology, which may cause confusion and lead to produce erroneous text. These errors are generally divided into typographic, cognitive, and phonetic types.

According to Damereau, 80% of the above errors occur due to one or more of the following reasons:

- Insertion error consists of adding an extra character.

/ بوتتكم – بوتكم / makttūb à maktūb, written / the letter ت t is additionally inserted.

- Deletion error resulting by the absence of a character.

/ ةسدم – ةسردم / madsah à madrasah, school / the letter ر r is missing.

- Substitution error is the replacement of a character by another.

ةفیدح – ةقیدح / / ḥadifah à ḥadiqah, garden / the letter ق q is mistakenly substituted by ف f.

- Transposition or permutation error due to an exchange of characters.

حرب – رحب / / barḥ à baḥr, see / the letter ح ḥ is swapped.

A spell checker is a tool that handles words, identifies the spelling errors and helps to correct them. In case it has doubts about the spelling of the word, it suggests possible alternatives. It is either a standalone or embedded tool used to process natural language effectively in many applications such as machine translator, OCR, search engine, and word processor.

The main purpose of this research is to implement a spell checker for Arabic language benefiting from the power of NooJ platform by using its command-line program noojapply. We intend to use morphological and local grammars in NooJ to achieve our goal. This spell checker will:

- Detect all errors using the dictionary El-DicAr [3] combined with the improved morpho-syntactic grammar dealing with agglutination.

- Apply all correction grammars on detected errors.

- Suggest and classify possible corrections for each misspelled word.

References

Damerau F.J. (1964). “A technique for computer detection and correction of spelling errors”, Communication of the ACM, vol. 5, n 3, 171- 176.

Olani G., Midekso D. (2014). “Design And Implementation Of Morphology Based Spell Checker”, International Journal of Scientific & Technology Research, 3, p. 1-8.

Mesfar S. (2008). Analyse morpho-syntaxique automatique et reconnaissance des entités nommées en arabe standard, Phd Thesis, Franche-Comte University, France.

Kassmi R., Mourchid M., Mouloudi A., Mbarki S. (2018). “Processing Agglutination with a Morpho-Syntactic Graph in NooJ”, Formalizing Natural Languages with NooJ and Its Natural Language Processing Applications, NooJ 2017, vol 811. Springer, p.40-51, Cham.

Silberztein M. (2003). NooJ Manual. Available for download at: www.nooj-association.org. Updated regularly.

Siberztein M. (2015). La formalisation des langues, l’approche NooJ, ISTE Edition, London.

Page 52: 15 INTERNATIONAL CONFERENCE NooJ 2021

52

Automatic identification and disambiguation of abbreviations in the medical domain,

Walter Koza, Ninoska Godoy, Constanza Suy, Romanet Contreras, Sofía Koza, Fernanda Aguirre, Martín Díaz

Pontificia Universidad Católica de Valparaíso, Chile – Project FONDECyT 1171033 [email protected], [email protected], [email protected],

[email protected]

Emerger Servicios Médicos SA, Argentina [email protected]

Hospital Alemán, Argentina [email protected], [email protected]

Abstract

Among the interests of computational linguistics is the analysis of knowledge domains; medicine appears as one the most challenging domain because of its specialists’ writing peculiarities. One of the problems is the use of abbreviations, defined as the omission of some characters of a word or set of words (Banavent, Viana, Casanova, Villalba, Barberá & Moscardo, 2006; RAE, 2010). Since there is no agreement on the use of these terms, medical professionals use them as they see fit, creating ambiguity and difficulties when trying to identify the expression they are referring to. Additionally, automatic extraction tasks are hampered, because ordinary automatic PoS-taggers (Connexor, Freeling and Treetagger) are not able to properly disambiguate the expressions. Therefore, the aim of this work is to develop a methodology for an automatic detection of abbreviations of the medical domain for natural language texts, based on a constraint grammar (CG) (Karlsson, 1995). To this end, first, electronic dictionaries containing specific abbreviations and their possible meanings were created. Secondly, CG were developed to detect correct meanings of abbreviations taking into account the syntactic context. For the computer models, we used the Nooj software (Silberztein, 2016). The methodology has been tested on a corpus of 5.000 medical stories anonymized by the Hospital Alemán (169,670 words), in which 8.381 abbreviations has been detected. The results obtained have a recall of 70,73%, precision of 97,27% and F measure of 81,90%. With these results, we can conclude that the proposed methodology shows promising results for the tasks of automatic recognition and disambiguation of abbreviations of the medical domain.

References

Benavent A., Viana A., Casanova F., Villalba C., Barberá P., Moscardo P. (2006). “Uso y abuso de abreviaturas y siglas entre atención primaria, especializada y hospitalaria”, Papeles médicos, 15(2), 29-37.

Karlsson F. (1995). “Designing a parser for unrestricted text” In: Constraint Grammar: A Language-Independent Framework for Parsing Unrestricted Text. Mouton de Gruyter, Berlin / New York.

RAE (2010). Ortografía. Madrid: Real Academia Española. Silberztein M. (2016). Formalizing natural languages: The NooJ approach. Londres: ISTE.

Page 53: 15 INTERNATIONAL CONFERENCE NooJ 2021

53

Automatic detection and generation of argument structures of the medical domain,

Walter Koza, Constanza Suy Álvarez

Pontificia Universidad Católica de Valparaíso, Chile – Project FONDECyT 1171033 [email protected], [email protected]

Abstract

Representing the predicate-argument structure of the medical domain (EAMED) is important for the automatic analysis of texts (Cohen & Hunter, 2006; Shen et al, 2016). However, most works focus on the English language and there is no agreement as to how it should be represented. This work aims to describe the EAMD projected by verbs of the medical domain, along with its transformational possibilities in order to create resources for its computer analysis. To this end, the Lexicon-grammar (LG) model is used (Gross, 1975; 1998); it proposes a method to describe natural languages based on three conditions: (i) the syntax is not separated from the lexicon; (ii) the simple sentence is considered as the minimal unit of the analysis; and (iii) the formalization corresponds to the distributive and transformational analysis (Elia, Monteleone & Marano, 2011). The LG proposes the creation of tables, represented as matrix containing: (i) in rows, the elements of the corresponding category; (ii) in columns, the syntactic and semantic properties; and (iii) crossing rows or columns, the sign + or - depending on whether the lexical entry allows the property or not. This way, it is possible to determine object classes (Gross, 2012) (‘foods’, ‘metals’) from the meaning resulting from the argument structure. Subsequently, each simple sentence is analyzed according to its transformational properties (passive voice, nominalization, etc.).

Therefore, from a lexicographic and frequency criterion, 100 verbs of the medical domain from corpus CCM2009 (Burdiles, 2012) were selected. Firstly, we analysed them to determine the amount and type of arguments they select, and classified these arguments in object classes (OC) (Gross, 2012). An OC is a set made of names that give the same predicate, acquire the same meaning. For instance, the set formed by ‘pill’, ‘syrup’, ‘drops’ from the OC ‘medicines’ and their elements are assigned to a predicate like 'prescribe' (‘the doctor prescribed the pills/syrup/drops’). Secondly, a list has been built with all accepted transformations for each simple sentence containing predicate and arguments. With this information, computer models were created on Nooj for the detection of EAMD in a corpus and its automatic creation. This work involved the elaboration of electronic dictionaries, syntactic recognition and generative grammars. The detection has been performed on a corpus of 188.000 words conformed by texts from the gynecology and obstetrics area, achieving the following results: 72,5% coverage and 98% precision. Regarding generation, NooJ grammars allowed to obtain grammatical sentences of each EAMED that involved different transformations admitted by each particular class.

References

Burdiles G. (2012). Descripción de la organización retórica del género Caso Clínico de la medicina a partir del corpus CCM-2009, PhD thesis. Pontificia Universidad Católica de Valparaíso.

Cohen K., Hunter L. (2006). “A critical review of PASBio's argument structures for biomedical verbs”. BMC Bioinformatics, 7(3).

Gross M. (1975). Méthodes en syntaxe. Paris : Hermann. Gross M. (1998). “Lexique, grammaires et cumulativité”. Cahiers de l’institut de linguistique de

Louvain, 24, 23-41. Gross G. (2014). Manual de análisis lingüístico. Aproximación sintáctico-semántica al léxico.

Barcelona: Editorial UOC. Shen L., Nishimura Y., Matsuda F., Ishii J., Kondo A. (2016). “Overexpressing enzymes of the Ehrlich

pathway and deleting genes of the competing pathway in Saccharomyces cerevisiae for increasing 2-phenylethanol production from glucose”, Journal of bioscience and bioengineering, 122(1), 34–39.

SNOMED, CT (2016). SNOMED CT Edición en Castellano. [ https://www.achisa.cl/snomed-ct/]

Silberztein M. (2016). Formalizing natural languges. The NooJ approach. ISTE Eds.

Page 54: 15 INTERNATIONAL CONFERENCE NooJ 2021

54

Geoparsing with NooJ Italian toponym resolution for environmental crimes,

Raffaele Manna, Annarita Magliacane, Antonio Pascucci, Wanda Punzi Zarino, Vincenzo Simoniello

University of Naples “L’Orientale” - UNIOR NLP Research Group, Italy [rmanna, apascucci, wzarino, vsimoniello]@unior.it

University of Liverpool, England [email protected]

Abstract

Recent trends in NLP research focus on exploiting language technologies and social media in relation to environmental issues (Domala et al., 2020). By using social media as a source of information, NLP systems are supplied with constantly updated news and user-generated reporting posts related to environmental crises and crimes (Carrera-Ruvalcaba et al., 2019). In order to monitor the environmental issues and to aid timely actions during environmental crises, NLP systems have been developed to analyze the content of user-generated texts in social media. In this context, textual information derived from social media is used to model classifiers capable of filtering and discriminating textual content related to emergency circumstances during environmental crises (Maldonado et al., 2016; Tarasconi et al., 2017). Once the emergency-related texts are filtered out, geo-spatial textual information is processed in NLP monitoring systems to extract toponyms from texts (Suwaileh et al., 2020). The extraction and resolution of toponyms is a task in NLP known as Geoparsing (Gritta et al., 2019). However, extracting geographical locations is a complex task since they are not always textually represented as an individual toponym but often occur as a set of place names held together by conjunctive words and prepositions (Laparra & Bethard, 2020).

Within the framework of the project Crowd for the Environment (C4E)1 which aims at identifying and monitoring environmental crimes and human-related disasters with use of NLP techniques, we collected reporting tweets in Italian to build a reference corpus (UNIOR Eye corpus) dealing with illegal spills such as illegal landfills, micro-dumps and waste burnings occurring over the years (from January 2013 to August 2020). Along with these environmental crimes, we also collected reporting tweets regarding water-related crimes, hazardous substances and environmental fires. During the first phase of the C4E project, we exploited the UNIOR Eye corpus by applying machine learning techniques capable of detecting alert tweets, i.e. user-generated reports of emergency situations and ongoing crime against the environment (Manna et al., 2020).

As a step forward in the C4E project, we annotate crime, location and period of time textual mentions in a subset of alert tweets from UNIOR Eye in order to apply text extraction methods. Specifically, we focus on the extraction of non-individual toponyms and location mentions. In fact, in the alert tweets, textual patterns such as dall’A1 da Caserta (on the A1 motorway from Caserta) and nel tratto di strada compreso tra via Y e via Z (in the stretch between street Y and street Z) show non-individual geographical references and different user viewpoints in the linguistic realization of geographical entities.

Since these phenomena imply linguistic and syntactic variation by means of different textual realization, our aim is to exploit NooJ's functionalities and grammars to build a location extraction system able to: - analyze the linguistic occurrences in Italian for crime, location and period of time textual mentions;

- detect and extract non-individual toponyms and fuzzy location patterns.

In modeling an extraction system with NooJ, we will be able to offer responders a more accurate geographical position of the locations where environmental crimes and human-related disasters occur.

1 https://sites.google.com/view/c4e-crowdfortheenvironment/home-page

Page 55: 15 INTERNATIONAL CONFERENCE NooJ 2021

55

References

Carrera-Ruvalcaba E., Ekedum J., Hancock A., Brock B. (2019). Leveraging Natural Language Processing Applications and Microblogging Platform for Increased Transparency in Crisis Areas, SMU Data Science Review, 2(1), 6.

Domala J., Dogra M., Masrani V., Fernandes D., D'souza K., Fernandes D., Carvalho T. (2020). “Automated Identification of Disaster News for Crisis Management using Machine Learning and Natural Language Processing”, In: 2020 International Conference on Electronics and Sustainable Communication Systems (ICESC), p. 503-508.

Gritta M., Pilehvar M. T., Collier N. (2019). “A pragmatic guide to geoparsing evaluation”, Language Resources and Evaluation, p. 1-30.

Laparra E., Bethard S. (2020). “A Dataset and Evaluation Framework for Complex Geographical Description Parsing”, In: Proceedings of the 28th International Conference on Computational Linguistics, p. 936-948.

Maldonado M., Alulema D., Morocho D., Proaño M. (2016). “System for monitoring natural disasters using natural language processing in the social network Twitter”, In: 2016 IEEE International Carnahan Conference on Security Technology (ICCST), p. 1-6.

Manna R., Pascucci A., Punzi Zarino W., Simoniello V., Monti J. (2020). “Monitoring Social Media to Identify Environmental Crimes through NLP A Preliminary Study”, In: Proceedings of the Seventh Italian Conference on Computational Linguistics (Clic-IT).

Silberztein M. (2016). Formalizing Natural Languages: The NooJ Approach., John Wiley & Sons. Suwaileh R., Imran M., Elsayed T., Sajjad H. (2020). “Are We Ready for this Disaster? Towards

Location Mention Recognition from Crisis Tweets”, In: Proceedings of the 28th International Conference on Computational Linguistics, p. 6252-6263.

Tarasconi F., Farina M., Mazzei A., Bosca A. (2017). “The role of unstructured data in real-time disaster-related social media monitoring”, In: 2017 IEEE International Conference on Big Data (Big Data), p. 3769-3778.

Page 56: 15 INTERNATIONAL CONFERENCE NooJ 2021

56

Integrated NooJ environment for Arabic linguistic disambiguation improvement using MWEs,

Dhekra Najar, Slim Mesfar, Henda Ben Ghezela

RIADI, University of Manouba, Tunisia [email protected], [email protected], [email protected]

Abstract

Language resources are a necessary component to language Development in NLP. They are useful for any empirical language study including linguistic analysis, language translation and language disambiguation.

The linguistic development environment NooJ allow to formalizing complex linguistic phenomena such as compound words generation, processing as well as analysis. NooJ offers the possibility to use the dynamic library NoojEngine.dll or the command-line program: noojapply.exe. In this study, we will take advantage of noojapply.exe program that is freely available in Standard edition of NooJ. Noojapply.exe allows users to apply dictionaries and grammars automatically to texts from external environments.

In this paper, we introduce a module for Arabic MWEs recognition that is based on rules grammar. MWEs module allows recognizing several types of morphosyntactic variations that can occur to a MWE. Then, we study the impact of multi-word expressions recognition on Word Disambiguation in Arabic language texts. These linguistic resources are compiled to be used as parameters in the command-line noojapply.exe in order to be integrated within an Arabic language processing environment for linguistic disambiguation.

Our work is divided into three sections. First, we deal with a literature review on disambiguation task in the Arabic language. Then, we give a detailed description of our Integrated NooJ environment for Arabic linguistic disambiguation and the associated grammars. Finally, a set of tests and experiments is carried to measure the impact of multiword expressions recognition in Word Disambiguation.

References

Silberztein M. (2015). La formalisation des langues: l'approche de NooJ. ISTE Group, 2015. Najar D., Mesfar S., Ben Ghezela H (2015). A large terminological dictionary of Arabic compound

words. In : International Conference on Automatic Processing of Natural-Language Electronic Texts with NooJ. Springer, Cham, 2015. p. 16-28.

Page 57: 15 INTERNATIONAL CONFERENCE NooJ 2021

57

Approach to the automatic treatment of gerunds in Spanish and Quechua: A pedagogical application of NooJ,

Andrea Rodrigo, Maximiliano Duran, María Yanina Nalli

CETEyHIPL, Universidad Nacional de Rosario, Argentina [email protected]

Université de Franche-Comté, Besançon, Francia [email protected]

Universidad Tecnológica Nacional, Rosario, Argentina [email protected]

Abstract

Within the framework of a pedagogical application of NooJ undertaken by the Argentina team (Centro de Estudios sobre Tecnología Educativa y Herramientas Informáticas de Procesamiento del Lenguaje, CETEyHIPL, UNR, Argentina), this work aims to present an automatic treatment of gerunds in Spanish and Quechua. We seek to show the contrast between SVO and SOV languages (Greenberg, 1966) and teach Spanish as a foreign language for Quechua speakers, and vice versa, with a special focus placed on gerunds. We have used NooJ (Silberztein 2016) to formalize the behavior of the gerund in Spanish and Quechua. Our work includes creating dictionaries and grammars to analyze Spanish gerunds in periphrastic forms or in subordinate clauses, and translate them into Quechua. Spanish gerunds use the suffix -ndo after a thematic vowel, which varies according to the verb conjugation. Thus, the vowel -a- corresponds to the first conjugation and the diphthong -ie (or its orthographic variant -ye) corresponds to the second and third conjugations (RAE 2009). The Quechua gerund uses four suffixes: -spa, -stin, -pti, -chka, and the reduplicated combination of the suffix -n in a verb root. These suffixes can be combined with other verb suffixes to create different gerund forms, depending on various time and aspect constraints.

To define a Spanish standard, we have exploited a corpus from the Argentinian Ministry of Education and Sports constituted by regulations from several historical periods (available online). Using this corpus allowed us to picture a more diagrammatic and standardized use of the gerund. We then studied certain language structures ranging from the simplest to the most complex. We used examples from the corpus to establish a taxonomy that distinguishes gerunds in relation to their verb of reference. We developed grammars to illustrate the different types of Spanish gerunds, i.e., as part of a periphrasis or as a part of a subordinate clause (with adjectival or adverbial value). Finally, we developed grammars that translate these sentences into Quechua, highlighting Spanish syntactic richness and Quechua morphological richness. By displaying Spanish phrases or sentences translated into Quechua, we can show learners the differences between these languages: this metalinguistic approach is crucial for teaching a language. In addition, using NooJ to compare two languages proves that there is no one-to-one correspondence between Spanish and Quechua.

References

Greenberg J. (1966). “Some Universals or Grammar with particular Reference to the order of Meaningful Elements” In: Universals of Language, J. Greenberg (ed.), Cambridge (MA): MIT Press, 2nd ed. p. 72-113.

Real Academia Española; Asociación de Academias de la Lengua Española (2009). Nueva gramática de la lengua española. Espasa-Calpe. Madrid, España.

Silberztein M. (2016). Formalizing Natural Languages: The NooJ Approach. Iste Ediciones. London. Ministerio de Educación y Deportes. Presidencia de la Nación Argentina. Repositorio institucional:

http://repositorio.educacion.gov.ar/dspace

Page 58: 15 INTERNATIONAL CONFERENCE NooJ 2021

58

Automatic generation of intonation marks and prosodic segmentation in Belarusian,

Yauheniya Zianouka, Dzmitry Dzenisiuk, David Latyshevich, Yuras Hetsevich

United Institute of Informatics Problems, Minsk, Belarus [email protected], [email protected], [email protected],

[email protected]

Abstract

Automatically localizing intonation boundaries in a text is one of the main tasks of prosodic processors, considered as a mandatory unit in any speech recognition system. The syntagmatic articulation of the speech flow allocates minimal semantic units and reflects the structural and semantic components of utterances. The automatic selection of syntagmas is complicated by the lack of deep parsing, leading to the search for new approaches to the development of machine algorithms, methods and techniques by defining sequences of linguistic elements associated with certain semantic relationships.

To solve the problem of automatic delimitation in NooJ, we have collected a Belarusian text corpus from the medical domain. It comprises texts of news from medical online portals and consists of nearly 500 texts, 120.000 word forms, more than 8.000 sentences. This work is a continuation of a previous research in which we have analyzed sentence parts separated by punctuation and developed most punctuation marks for such sentences (up to 5 words, but the most frequent being three-word syntagmas). Now, we are planning to expand our study with texts in which the number of syntagmas in a sentence can significantly exceed the number of punctuation marks.

The delimitation of syntagmas is connected with the sentence structure, the word order, the presence of homogeneous members, the nature of word combinations and other linguistic parameters. All the mentioned components should be taken into account and noted in separate syntagmas during developing new syntactic and morphological NooJ grammars.

Hence, we hope to improve the synthetic speech generated by Belarusian text-to-speech systems by using prepared algorithms and grammars from Belarusian medical domain corpus in NooJ for the automatic generation of prosodic transcription of long sentences.

References

Dzenisiuk D. (2019). “Automatic Generation of Right Intonational Marks and Speech for Medical domain in Belarusian”, Dz. Dzenisiuk, Yu. Hetsevich, A. Drahun, A. Bakunovich, J. Shynkevich, In: International Conference NooJ 2019: Book of Abstracts. Hammamet, Tunisia.

Okrut T. (2015). “Resources for Identification of Cues with Author’s Text Insertions in Belarusian and Russian Electronic Texts”, T. Okrut, Y. Hetsevich, B. Lobanov, Y. Yakubovich, In: Formalising Natural Languages with NooJ 2014 / UK; ed. Johanna Monti, Max Silberztein, Mario Monteleone and Maria Pia di Buono. Newcastle: Cambridge Scholars Publishing, p.1 29-139.

Hetsevich Y. (2016). “Grammars for Sentence into Phrase Segmentation: Punctuation Level”, Y. Hetsevich, T. Okrut, B. Lobanov, In: Automatic Processing of Natural-Language Electronic Texts with NooJ: 9th International Conference, Minsk, Belarus, June 11-13, ed. T. Okrut, Y. Hetsevich, M. Silberztein, H. Stanislavenka. — Springer International Publishing, p. 74-82.

Hetsevich Y. (2015). “Grammars for the Sentence into Phrase Segmentation: Punctuation Level”, Y. Hetsevich, T. Okrut, B. Lobanov, In: International Scientific Conference on the Automatic Processing of Natural-Language Electronic Texts “NooJ’2015”: June 11-13, Minsk, Belarus), ed. B.M. Lobanov, Yu.S.Hetsevich, p. 25

Page 59: 15 INTERNATIONAL CONFERENCE NooJ 2021

59