Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI...

33
Converting and Representing Social Media Corpora into TEI: Schema and Best Practices from CLARIN-D Michael Beißwenger, Eric Ehrhardt, Axel Herold, Harald Lüngen, Angelika Storrer

Transcript of Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI...

Page 1: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

Converting and Representing Social Media Corpora into TEI: Schema and

Best Practices from CLARIN-DMichael Beißwenger, Eric Ehrhardt,

Axel Herold, Harald Lüngen, Angelika Storrer

Page 2: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

Background of this talk: CMC 2 TEI

Computer-mediated communication (CMC) / ‘social media’: interaction mediated through computer networks (the internet), e.g. chats, forums, communication on social network sites, blog comments, tweets, Wikipedia talk pages, SMS and whatsapp conversations etc. New genres which the TEI “has not yet envisioned”

(citation from TEI-P5 chapter about customization) TEI-SIG “computer-mediated communication”

(CMC-SIG): How can TEI-P5 be adapted for the representation of CMC genres and corpora?☛ Options: 1) customization 2) extension of standard CMC: not a niche phenomenon but with a high

(still growing) impact on people’s everyday lives, interactions, language and society

Page 3: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

TEI-SIG “computer-mediated communication”(CMC-SIG): Next planned step (spring 2017): feature request for a basic structure for CMC based on the models discussed and schemas developed so farResults of the SIG work so far: Three schemas for modeling CMC corpora in TEI Several existing corpora (French & German) for

different CMC genres (Wikipedia discussions, SMS, Twitter, chat etc.) that have entirely been represented using these schemas

Most recent schema: ‘CLARIN-D CMC-TEI’: focus especially (but not only) on written CMC genres – developed and used for remodeling an existing chat corpus for German

Background of this talk: CMC 2 TEI

Page 4: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication

Schemas and resources of the CMC-SIG

Page 5: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

Structure of the talk

1) Project background of the most recent schema (‘CLARIN-D CMC-TEI’)

2) General challenge of modeling CMC in TEI

3) Outline of the schema – on the example of selected features

4) Summary and outlook

Page 6: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

ChatCorpus2CLARIN: Project background

Period: May 2015 – February 2016

Task: develop a workflow and resources for the integration of an existing chat corpus into the CLARIN-D research infrastructure for language resources and tools in the Humanities and the Social Sciences (http://clarin-d.de). One work package: remodeling of the corpus in a TEI-compliant format.

Project team: Michael Beißwenger (U Dortmund / DUE), Angelika Storrer, Eric Ehrhardt (U Mannheim), Harald Lüngen (IDS Mannheim), Axel Herold (BBAW, Berlin) + other colleagues at the CLARIN-D hubs at IDS and BBAW.

Curation project of the CLARIN-D F-AG 1 “German Philology”

http://www.clarin-d.de/en/curation-project-1-3-german-philology

Page 7: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

The original resource (chat corpus 1.0)

Dortmund Chat Corpus http://www.chatkorpus.tu-dortmund.de

scope: Language use and linguistic variation in German chats

corpus size: 478 logfiles with 140240 user posts / 1 million words

availability: online for download since 2005 (together with a simple query tool, addressees: linguists) + as a collection of HTML pages (for online browsing, addressees: German teachers)

XML-annotated on basis of a homegrown XML format (‘ChatXML’) which describes:

(1) the basic structure and properties of logfiles and postings (“messages”)(2) selected items on the micro-level of user posts (emoticons, acronyms, addressing terms, nickname mentions)(3) selected metadata about the chat platforms and users.

Page 8: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

Goals of the project (1)

Sustainability: Integrate the only CMC corpus for German which is freely availabe (and which is used by many researchers) into the CLARIN-D corpus infrastructure

Additional lingustic annotations (part-of-speech) that will allow or more sophisticated corpus queries

Interoperability: Represent the corpus compliant to an established standard in the Digital Humanities and thus make it interoperable with other resources in the CLARIN-D corpus infrastructure

Page 9: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

Goals of the project (2)

Create a showcase which demonstrates what researchers can gain when

− CMC corpora are made available for the community, − CMC corpora – as part of big, annotated corpus

collections – can be analyzed in combination with other language resources (text and speech corpora at IDS and BBAW)▶ Why TEI? Therefore!

Intended model character of the solutions developed in the project: solutions should be useful not only for the modeling and integration of the chat corpus but also for the modeling and integration of other CMC corpora into CLARIN-D (future projects)

Page 10: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

Fundamental challenge:

Written CMC shares characteristics both with text andspoken conversation ...

Modeling CMC in TEI: The challenge

o Just like spoken conversation (and different from text), CMC is dialogic interaction in which each communivative move creates/changes the context for follow-up moves.

o Just like text documents and different from spoken conversation, written CMC is organized through the exchange of stretches of written text which have completely been composed before they are transmitted and read.

A basic model for the representation of user contributions to written CMC should reflect these properties.

Page 11: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

Schema ODD and RNG files:http://wiki.tei-c.org/index.php?title=SIG:CMC/clarindschema

CLARIN-D CMC-TEI @TEI CMC SIG

Page 12: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

CLARIN-D CMC-TEI

Customisation: Introduction of specific models for CMC-specific concepts: e.g model.divPart.cmc, <post>, @auto

Definition of best practices for the use of standard TEI models (without modification; e.g. <div>, <w>,@type, <participantList>, <timeline>

Page 13: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

The basic element for CMC: <post>

Example from a social chat:

A04: milkaq: wir kennten dich dann auch im Beiboot aussetzenA11: OK. Ich kann ja schwimmen.A02: KäptnMcMike holt ein Haustierchen aus der Kajüte

Post: a stretch of text characterized by the following features:(1) produced to function as a contribution to an ongoing

CMC interaction;(2) the process of verbalization is prior to (and finished

before) the act of making it available for the addressees/interaction partners;

(3) it is the largest unit of “user generated content” that is handed over to the technical system at once;

(4) it is the atomic building block of CMC macrostructures (logfiles, threads, twitter timelines, …)

Page 14: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

Example: posts in chat

Page 15: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

model.divPart.cmc (HTML from ODD)

Page 16: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

<post> declaration (HTML from ODD)

Page 17: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

Extending the <post> model with attributes

@replyTo

indicates to which previous post the current post replies or refers to.

optional attribute which can be used for the annotation of thread structures

post metadata giving information that the researcher has reconstructed through interpretation and analysis:

Page 18: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

Extending the <post> model with attributes

@auto

new attribute in att.global (type: data.xTruthValue).

marks whether the content of the respective element was automatically generated.

Posts which have been created not by a human parti-cipant of the interaction but by the system should be marked as automatically generated.

The same holds for elements in the content of posts which have been automatically created, e.g. user signatures in posts on Wikipedia talk pages.

Page 19: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

Best practices: Usage of available attributes at <post>

@type, @who, @synch, @rend, …

<post auto="true" rend="color:blue" synch="#f1101004.t046" type="event"

who="#f1101004.A01_System" xml:id="f1101004.m137">

<time> 22:01 </time>

<name corresp="#f1101004.A04" type="nickname">

[_MALE-TEACHER-A04_]</name>

entered the room

<name type="roomname">[_ROOMNAME_]</name>

at <time> 22:01:55 </time>

</post>

Page 20: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

Best practices: POS tags

Inline annotation using <w> and @type Tag set STTS-IBK (Beißwenger et al. 2015)

<post auto="false" rend="color:black" synch="#f1101004.t047" type="standard"

who="#f1101004.A03" xml:id="f1101004.m138">

<time> 22:02 </time>

<anchor type="sentence_start"/>

<w lemma="gut" type="ADJD" xml:id="f1101004.m138.t1">gut</w>

<w lemma="," type="$," xml:id="f1101004.m138.t2">,</w>

<w lemma="die" type="PDS" xml:id="f1101004.m138.t3">das</w>

<w lemma="sollen" type="VMFIN" xml:id="f1101004.m138.t4">sollte</w>

[…]

</post>

Page 21: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

Best practices: Chat metadata in <teiHeader>

We use <profileDesc>:<particDesc> for descriptions of the chat users – this is in line with standardisation for speech corprora (ISO 24624, as of 2016)

<profileDesc>

<particDesc>

<listPerson>

<person role="system" xml:id="f1101006.A01_System">

<persName type="nickname">system</persName>

<sex evidence="estimated">system</sex>

</person>

<person role="expert" xml:id="f1101006.A02">

<persName type="nickname">[_MALE-EXPERT-

A02_]</persName>

<sex evidence="estimated">male</sex>

Page 22: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

Best practices: Chat metadata in <teiHeader>

Unlike in the standard for speech corpora, we put the <timeline> under <textDesc>:<interaction>

In the timeline, absolute timepoints are recorded

<textDesc>

[…]

<interaction>

<timeline>

<when absolute="11:04:00" xml:id="f1101006.t001"/>

<when absolute="11:11:00" xml:id="f1101006.t002"/>

<when absolute="11:18:00" xml:id="f1101006.t003"/>

Page 23: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

Best practices: Chat metadata in <teiHeader>

We used <recordingStmt> in <sourceDesc> for metadata about media (chat platform)

<recordingStmt>

<recording>

<equipment>

<p>plattformName=<name type="OTH">[_CHATPLATFORM_]</name></p>

<p>plattformURL=<ref type="url">[_WWWURL_]</ref></p>

</equipment>

[…]

Page 24: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

Result of the project: CLARIN-D-conformant resource

Dortmund Chat corpus 2.0# chat log files 470# posts 131,033# tokens 1,005,166file Size (TEI-XML) 100MB

Page 25: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

Availability of the integrated resource

Integration in CLARIN-D repository at IDS: done Integration in CLARIN-D repository at BBAW: this week Downloadable only when full anonymisation is finished

Page 26: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

Availability of the integrated resource

To be integrated in the German Reference Corpus DeReKo at IDS and searchable through COSMAS II

To be integrated in the DWDS corpus query interface at BBAW

Will also be searchable via CLARIN web services and the CLARIN Federated Content Search

Page 27: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

Summary and outlook (1)

On the example of a schema developed for remodeling the Dortmund chat corpus we presented a rationale and selected models & best practices for the representation of CMC genres in TEI (based on models discussed and tested in the TEI-SIG „computer-mediated communication“)

Goal of this talk: Draw the attention of the TEI community the current state of the schemas developed by members of the SIG – in order to ask for feedback, comments etc. on our ideas.

Next milestone for the SIG: Submit a feature request on Github based on the schemas developed so far and the comments received from the community (feature requests planned for spring/summer 2017)

Page 28: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

How to encode <post>-specific meta data?

In the project Chat2CLARIN, we chose to encode them using more and more attributes (regular and customised)

<post auto="true" rend="color:blue" synch="#f1101004.t046" type="event"

who="#f1101004.A01_System" xml:id="f1101004.m137">

[…]

Summary and outlook (2): open issues

Page 29: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

How to encode <post>-specific meta data?

However, you might alternatively want to encode them in a separate „<postHeader>“ in the <post>, containing e.g. feature structures

<post auto="true" >

<postHeader>

<fs>

<f name=“status">

<symbol value=“delivered"/>

</f>

[…]

Summary and outlook (2): open issues

Page 30: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

How to encode <post>-specific meta data?

Or define <post> as a declaring element and point to post-specific metadata contained in declarableelements in the teiHeader (cf. e.g. Stadler et al. 2016)

<post auto="true“ decls=„#correspDesc_01" >

[…]

Summary and outlook (2): open issues

Page 31: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

Summary and outlook (3): Get in touch!

Wiki page of the CMC-SIG with schemas (ODD & RNG), documentation of previous activities etc.:http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication

Google Group (for feedback, comments etc.):https://groups.google.com/forum/#!forum/tei-cmc

[email protected]

... or in the SIG meeting right after the coffee break!

Page 32: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

SIG meeting

Page 33: Converting and Representing Social Media Corpora into TEI ... · Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / ‘social media’: interaction mediated

Converting and Representing Social Media Corpora into TEI: Schema and

Best Practices from CLARIN-D

Thank [email protected]

[email protected]