User-driven Language Technology Infrastructure – the Case of CLARIN- PL

36
CLARIN-PL User-driven Language Technology Infrastructure – the Case of CLARIN-PL Maciej Piasecki Wrocław University of Technology Institute of Informatics G4.19 Research Group [email protected] 2014-06-18-19

description

User-driven Language Technology Infrastructure – the Case of CLARIN- PL. Maciej Piasecki Wrocław University of Technology Institute of Informatics G4.19 Research Group maciej . piasecki @ pwr.wroc.pl 2014 - 06 - 18-19. Users make problems. Users make all software systems imperfect - PowerPoint PPT Presentation

Transcript of User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Page 1: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

CLARIN-PL

User-driven Language Technology Infrastructure – the Case of CLARIN-PL

Maciej PiaseckiWrocław University of Technology

Institute of Informatics

G4.19 Research Group

[email protected]

Page 2: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Users make problems

Users make all software systems imperfect However if a software system is not used, it does not exist Context of Use determines the perspective from which the

users perceive systems A very few users are willing to read how to use a system Users want to use a system, not to study it Interaction is an active process based on understanding

Who can use language technology?

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 3: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Basic Notions

Language Technology (LT) language resources and tools robust in terms of quality and coverage multipurpose component based

Language Technology Infrastructure a software framework (architecture or platform) for combining language tools with language resources into

processing chains (or pipelines) the defined processing chains are next applied to language

data sources interoperability, also with the external systems

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 4: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

LT in Humanities and Social Sciences: Barriers

Physical – language tools and resources are not accessible in the network

Informational – descriptions are not available or there is no means for searching

Technological – lack of commonly accepted standards for LT, lack of a common platform, varieties of technological solutions, insufficient users’ computers

Related to knowledge – the use of LT requires programming skills or knowledge from the area of natural language engineering

Legal – licences for language resources and tools (LRTs) limit their applications

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 5: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

LTI for H&SS: Lowering Barriers

CLARIN ERIC consortium of several countries member countries contribute parts of the LTI

CLARIN Mission Lowering the barriers for LT in H&SS integration of different LT components into one interoperable

system one sign on and one login into the distributed infrastructure common standards common licences and promotion of the open access installation-free, web-based user interface

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 6: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Different ways to LTI

Bottom-up a collected offer approach based on linking together the already existing Language

Resources and Tools focused on accessibility, technical interoperability and

processing chains Top-down

based on user-centred design paradigm research applications for H&SS are a starting point

Bi-directional linking of Language Resources and Tools combined with the development of research applications

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 7: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Bottom-up LTI development

Characteristic features Collecting and linking the already existing Language

Resources and Tools (LRTs) into processing chains Improved accessibility and technical interoperability of LRTs a common system of IPR licences distributed authorisation system (one login-in) federated search mechanisms: meta-data and content

Disadvantages will the users know what to look for and what to use? specific background knowledge required, e.g. complex Slavic

tagsets or processing phases not directly related to the research in H&SS

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 8: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Top-down LTI development

Process Starting point: complete research applications (or research

tools) for the final users – H&SS scientists Context of Use Analysis LT components and the underlying network infrastructures

designed and developed according to the requirements Nice idea, but hard to be implemented

many existing LT components can be re-used research tools to be designed

Innovative dependent on the new or emerging research methods research methods are stimulated by the appearing LT-based

research tools too long way process for the funders’ patience

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 9: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Bi-directional LTI development

Idea development of the necessary elements

a distributed network infrastructure basic LT processing chain

combined with user-centred approach based on the development of research applications

Characteristic features a metaphor of the Agile-like light weight software designing

method close co-operation with key users from the H&SS domain application development stimulates the construction of

technical fundaments inspirations and identification of the further user needs

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 10: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Context of Use Analysis

User Analysis & Task Analysis = Context of Use Analysis Context of Use (ISO 9241):

Users Tasks Physical environment Social and cultural environment

(„social” encompasses organisational)

Two phases: Gathering information about Context of Use Analysing

an ordered description of the Context of Use Models: User Model and Task Analysis

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 11: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Context of Use Analysis

Taskis a sequence of actions, ordered in time, whose performance is aimed towards some goal,

Goal is a state of real-world which is intended by the user.

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 12: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Context of Use Analysis: information gathering

observing, listening to, and talking with users Contextual Inquiries Role Playing and Staged Scenarios

interviewing users and others Process Analysis Ethnographic Interviews

interviews with key informants observing – including active participation

working with users away from their work site traditional market research techniques, e.g. Focus Groups traditional system development techniques, e.g.

Requirements-Gathering Sessions

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 13: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Usability (ISO 9241)

Usability — extent to which a product can be used by specified users to achieve specified goals with

effectivenessaccuracy and completeness with which users achieve specified goals,

efficiency resources expended in relation to the accuracy and completeness with which users achieve goals (primarily time),

satisfactionfreedom from discomfort, and positive attitudes towards the use of the product,

in a specified context of use.

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 14: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Usability and LTI

Web applications Access to resources Individual services Adaptable workflows for language processing Do not guarantee the LTI usability

Usability aspects based on the Context of Use Analysis results Presentation according to the unknown (sic!) user needs Adaptation to the unknown (sic!) user tasks

But, …, enabling technology New technology creates new needs Sample applications illustrate the possibilities and potentially

inspire

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 15: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

CLARIN-PL

Polish scientific consortium Wrocław University of Technology, G4.19 Research Group Institute of Computer Science, Polish Academy of Science Polish-Japanese Institute of Information Technology, Chair of

Multimedia University of Łódź, PELCRA group at Chair of English Language

and Applied Linguistics Institute of Slavic Studies, Polish Academy of Science Wrocław University

Goal: implementation of the Polish part of the CLARIN ERIC LTI

Generously financed by the Polish Ministry of Science and Higher Education (about 4 millions Euro for three years)

An example of the bi-directional approach

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 16: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

CLARIN-PL structure

Context many basic LRTs for Polish were still lacking at the start of

CLARIN-PL Deeper technological barrier

e.g. the lack of a robust dependency parser for Polish

Pillars: CLARIN-PL Language Technology Centre

www.clarin-pl.eu the Polish node of the CLARIN distributed infrastructure

Complete set of basic LRTs for Polish Research applications for H&SS – first created for key users

and selected H&SS sub-domains.

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 17: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

CLARIN-PL Langauge Technology Centre: bottom-up B-type centre, located in Wrocław University of Technology

based on modified D-Space system from Lindat (Czech CLARIN) Distributed authorisation

linked to the national identity federation one sign-on, one login

Proper repository system supporting persistent identifiers for resources and tools, CMDI meta-data format

Interface for Federated Content Search On meta-data and content of corpora

Depositing service for researchers from H&SS focused on LRTs adherence to all CLARIN specifications about standards and protocols

Web Services for LRTs: the basic processing chain of Polish Flexible composition of the specialised processing chains SOAP & REST interfaces

An active K-type centre in several areas

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 18: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

CLARIN-PL: Bottom-uphttp://nlp.pwr.edu.pl/synat

Page 19: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Bi-directional: bottom-up part

LRTs and LRT chains can be useful … if the required tools and resources exist, and, they are robust!

What is the minimal set of LRTs? What kind of language resources and tools can be called

robust? automated applications in H&SS seem to require high quality

of language tools and mostly large coverage of resource BLARK – The Basic Language Resource Kit

“the minimal set of language resources that is necessary to do any precompetitive research and education at all” (Krauwer, 2003) and also basic processing chains

possible reference point to compare LRts for different languages

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 20: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

CLARIN-PL: language resources

Good starting point, e.g. a huge National Corpus of Polish (1 billion tokens) plWordNet – a very large wordnet for Polish Korpus Politechniki Wrocławskiej – an open Polish corpus

with rich annotation Main goals

completing the construction of selected resources building bi-lingual resources and specialised corpora

facilitating the envisaged needs of H&SS Bilingual resources crucial for interoperability

Large number of language pairs vs limited funds Priority given to Polish-English resources

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 21: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

CLARIN-PL: selected resources in development

plWordNet 3.0 a comprehensive description of the Polish lexico-semantic system

(~200 000 lemmas, ~280 000 senses) mapping to enWordNet – an expanded Princeton WordNet 3.1

A large lexicon of the Multi-word Expressions described with the minimal constraints on their lexico-syntactic

structures linked to plWordNet

NELexicon 2.0 - ~2.5 million distinct PNs, semantically classified Dynamic lexicons – tools for automated expansion of the manual core A large semantic valency lexicon for Polish predicative lexical units Corpora:

a transcribed training-testing Polish speech corpus parallel corpora, historical Polish corpus of text news…

Several systems for searching text and speech corpora

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 22: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

CLARIN-PL: language tools

1. Segmentation into tokens and sentences

2. Morphological analysis

3. Morphological guessing of unknown words (both without context and context sensitive)

4. Morpho-syntactic tagging

5. Word Sense Disambiguation

6. Chunker and shallow syntactic parser

7. Named Entity Recognition and disambiguation

8. Co-reference and anaphora resolution

9. Temporal expression recognition

10. Semantic relation recognition

11. Event recognition

12. Shallow semantic parser

13. Deep syntactic parser with disambiguated output: dependency and constituent

14. Deep semantic parser (is really needed?)

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 23: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

CLARIN-PL: language tools

A generic set of morpho-syntactic tools for Polish that can be adapted to a domain specified by the user

Tools for the extraction of the semantic-pragmatic information from documents and collections of documents, e.g. keywords, semantic relations between text fragments and text summaries

Web Services will be provided for all LRTs and systems already implemented: segmentation, morphological analysis,

tagging, chunking, Named Entity Recognition, and WSD accessible via REST or SOAP and described by CMDI

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 24: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Web Service test for Named Entity Recognition

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 25: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Bi-directional: top-down part

Digital H&SS (also e-Humanities and e-Social Science) relatively new domains with not many fixed procedures a rich variety of specific solutions, often data-driven

Observational methods requires good selection of users Key users (key informants) as a first group

scientists from H&SS who have already started using digital language-based methods

or are interested in applying LT in their research Finding users

previous personal contacts as a seed resulted in further contacts participation in a couple of H&SS conferences and … the snow ball is starting …

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 26: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Top-down part: process

1. Establishing contacts with users

2. Identification of key users

3. Context of Use Analysis: users, their tasks and environments

4. Identification of the key applications corresponding to these users' tasks that can be supported by the available LT

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 27: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Top-down part: selection of applications

Criteria to cover a maximal variety of research areas but also to co-operate first with the most active users matching the available LT for Polish a few application but broadening our understanding of the

domain First applications

Spokes – a search system for the corpus of conversational data (users from inside of CLARIN-PL)

A system for collecting Polish text corpora from the Web A open textometric and stylometric system focused on Polish Semantic text classification for sociology Literary Map

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 28: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

System for collecting Polish text corpora from the Web

Requests from users revealed gaps in the available technology

Existing corpus building systems were too sensitive to text encoding errors found in the web

A system for collecting Polish text corpora from the Web had to be constructed: based on solutions developed in Masaryk University in Brno applies morphological analysis to detect texts including larger

number of errors Supports semi-automated extraction of texts from blogs

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 29: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Open textometric and stylometric system

Several textometric and stylometric tools available But not designed for languages of rich inflection like Polish Enabling the use of features defined on any level of the

linguistic structure: from the level of word forms up to the level of the semantic-pragmatic structures.

Re-use of several existing components Available as Web Application and a Web Service Stylometric techniques appear to be applicable in many

tasks of H&SS sociology (characteristic features that are for different

subgroups), political studies (similarity and differences between political parties), literary studies …

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 30: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Semantic text classification for sociology

Users: Collegium Civitas, Warsaw Initially:

Text document classification according to the manually annotated examples

Finally: Whole system from corpus gathering to tuning machine

learning methods for the semantic classification of text snippets

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 31: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Semantic text classification for sociology

1. Corpus building

2. Pre-processing Text segmentation utilising the original structure Morpho-syntactic tagging, parsing

3. Sample selection Collection distribution Clustering – different techniques

4. Manual annotation Abstract definitions of semantic classes Availability of open annotation editors

5. Training classifiers

6. Analysis of the results Error estimation

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 32: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

GeTClasS – Generalised Text Classification for Sociology

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 33: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Literary Map

Users Digital Humanities Centre of The Institute of Literary Research PAS)

Idea

to identify all geographical names in the literary text (or a corpus) and map them onto the geographical map

Technical requirements Named Entity Recognition combined with geo-location PNs recognised in text must be grouped into expression

recognised by Google Recognition of semantic relations between non-spational PNs

and locations Parallel research on the method and its applications

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 34: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Literature MapUser Involvement

Utrecht 2014-06-19

CLARIN-PL

Page 35: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

Conclusions

Quality of Service notion is very rarely used in relation to LRTs and LTI for what research tasks and what scenarios are our LRTs good enough? error level of 5% can bias a lot statistical analysis of data

extracted from a corpus goal: fully automated procedures or semi-automatic support?

Application of LT to the research in H&SS seem to be much more challenging than in commercial systems!

Error monitoring and management in LT-based applications is required

Any model of LTI we aim for, users must be the starting point and constantly focused as the target

Language TechnologiesLjubljana

2014-10-09

CLARIN-PL

Page 36: User-driven Language Technology Infrastructure – the Case of CLARIN- PL

CLARIN-PL

Thank you very much for your attention!

Supported by the Polish Ministry of Science and Higher Education [CLARIN-PL] and the EU’s 7FP under grant agreement no 316097 [ENGINE]