Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ......

Post on 27-Jun-2019

221 views 1 download

Transcript of Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ......

IntroducingCzech National Corpus

Brown University, 04/09/16

Václav Cvrček

IntroductionIj

Czech National Corpus project

Basic facts about the CNC

▶ est. in 1994 by prof. František Čermák

▶ 2 departments of Faculty of Arts, Charles University in Prague(ICNC & ITCL)

▶ in 2012 acknowledged by MEYS as a research infrastructurefor social sciences and humanities

▶ 4,500+ registered users▶ 1,900 queries per day▶ web portal: www.korpus.cz

Czech National Corpus project

Basic facts about the CNC

▶ est. in 1994 by prof. František Čermák▶ 2 departments of Faculty of Arts, Charles University in Prague

(ICNC & ITCL)

▶ in 2012 acknowledged by MEYS as a research infrastructurefor social sciences and humanities

▶ 4,500+ registered users▶ 1,900 queries per day▶ web portal: www.korpus.cz

Czech National Corpus project

Basic facts about the CNC

▶ est. in 1994 by prof. František Čermák▶ 2 departments of Faculty of Arts, Charles University in Prague

(ICNC & ITCL)▶ in 2012 acknowledged by MEYS as a research infrastructure

for social sciences and humanities

▶ 4,500+ registered users▶ 1,900 queries per day▶ web portal: www.korpus.cz

Czech National Corpus project

Basic facts about the CNC

▶ est. in 1994 by prof. František Čermák▶ 2 departments of Faculty of Arts, Charles University in Prague

(ICNC & ITCL)▶ in 2012 acknowledged by MEYS as a research infrastructure

for social sciences and humanities▶ 4,500+ registered users

▶ 1,900 queries per day▶ web portal: www.korpus.cz

Czech National Corpus project

Basic facts about the CNC

▶ est. in 1994 by prof. František Čermák▶ 2 departments of Faculty of Arts, Charles University in Prague

(ICNC & ITCL)▶ in 2012 acknowledged by MEYS as a research infrastructure

for social sciences and humanities▶ 4,500+ registered users▶ 1,900 queries per day

▶ web portal: www.korpus.cz

Czech National Corpus project

Basic facts about the CNC

▶ est. in 1994 by prof. František Čermák▶ 2 departments of Faculty of Arts, Charles University in Prague

(ICNC & ITCL)▶ in 2012 acknowledged by MEYS as a research infrastructure

for social sciences and humanities▶ 4,500+ registered users▶ 1,900 queries per day▶ web portal: www.korpus.cz

DataIj

Language data

SYN2.3 bil.

InterCorp1.4 bil.

ORAL5 mil.

Diakorp3.4 mil.

SYN series

Written (i.e. published) synchronic texts:SYN2000 100M representative (1990–1999)SYN2005 100M representative (2000–2004)SYN2006PUB 300M journalistic texts (1989–2004)SYN2009PUB 700M journalistic texts (1995–2007)SYN2010 100M representative (2005–2009)SYN2013PUB 935M journalistic texts (2005–2009)SYN2015 100M representative (2010–2014)SYN (v. 3) 2,3G union of all SYN* corpora

All corpora are: lemmatized, morphologically tagged and enrichedby metadata (biblio information + text-type/genre classification)

SYN series

Written (i.e. published) synchronic texts:SYN2000 100M representative (1990–1999)SYN2005 100M representative (2000–2004)SYN2006PUB 300M journalistic texts (1989–2004)SYN2009PUB 700M journalistic texts (1995–2007)SYN2010 100M representative (2005–2009)SYN2013PUB 935M journalistic texts (2005–2009)SYN2015 100M representative (2010–2014)SYN (v. 3) 2,3G union of all SYN* corpora

All corpora are: lemmatized, morphologically tagged and enrichedby metadata (biblio information + text-type/genre classification)

ORAL series

Unprepared, dialogical, informal spoken language

One-layer transcription corpora:ORAL2006 1.0M Bohemian Czech onlyORAL2008 1.0M sociolinguistically balanced, Bohemian

Czech onlyORAL2013 2.8M sociolinguistically balanced, whole CR,

text-to-sound alignment

Older spoken corpora: Prague spoken corpus (0.5M), Brno spokencorpus (0.5M)

ORAL series

Unprepared, dialogical, informal spoken language

One-layer transcription corpora:ORAL2006 1.0M Bohemian Czech onlyORAL2008 1.0M sociolinguistically balanced, Bohemian

Czech onlyORAL2013 2.8M sociolinguistically balanced, whole CR,

text-to-sound alignment

Older spoken corpora: Prague spoken corpus (0.5M), Brno spokencorpus (0.5M)

Diachronic corpus

DIAKORP

▶ diachronic part of the CNC – 2.5 mil. words (v. 5)▶ the end of the 13th century to the beginning of the SYN

section (1945)▶ texts are transcribed, not transliterated▶ current focus on 19th century – lemmatization

Multilingual parallel corpus InterCorp

Czech texts with translations to or from 30+ languages

InterCorp (v. 8)

▶ core (=fiction) and collections (=journalism, subtitles…)

Core Collectionscs 85M 90Mforeign 194M 1,229M

▶ partly lemmatized and tagged▶ uneven amount of texts in language pairs

Multilingual parallel corpus InterCorp

Czech texts with translations to or from 30+ languages

InterCorp (v. 8)

▶ core (=fiction) and collections (=journalism, subtitles…)

Core Collectionscs 85M 90Mforeign 194M 1,229M

▶ partly lemmatized and tagged▶ uneven amount of texts in language pairs

Multilingual parallel corpus InterCorp

Czech texts with translations to or from 30+ languages

InterCorp (v. 8)

▶ core (=fiction) and collections (=journalism, subtitles…)

Core Collectionscs 85M 90Mforeign 194M 1,229M

▶ partly lemmatized and tagged▶ uneven amount of texts in language pairs

Multilingual parallel corpus InterCorp

Czech texts with translations to or from 30+ languages

InterCorp (v. 8)

▶ core (=fiction) and collections (=journalism, subtitles…)

Core Collectionscs 85M 90Mforeign 194M 1,229M

▶ partly lemmatized and tagged

▶ uneven amount of texts in language pairs

Multilingual parallel corpus InterCorp

Czech texts with translations to or from 30+ languages

InterCorp (v. 8)

▶ core (=fiction) and collections (=journalism, subtitles…)

Core Collectionscs 85M 90Mforeign 194M 1,229M

▶ partly lemmatized and tagged▶ uneven amount of texts in language pairs

Language Core Totalbg Bulgarian 5.2M 28.1Mda Danish 3.0M 53.0Mde German 27.7M 77.1Men English 15.5M 113.9Mes Spanish 17.5M 103.9Mfi Finnish 3.4M 45.2Mfr French 9.2M 87.0Mhr Croatian 15.5M 34.6Mhu Hungarian 5.4M 58.1Mit Italian 7.2M 65.6Mpl Polish 17.5M 79.9Mru Russian 3.3M 13.4Msk Slovak 7.4M 44.5Msl Slovenian 0.9M 49.8Msr Serbian 8.8M 29.6Muk Ukrainian 5.1M 5.3M

ToolsIj

CNC Tools

main concordancer analysis of variants derivational morphology

discourse analysis translation equivalents

All tools are available on-line within the portal www.korpus.cz

CNC Tools

main concordancer analysis of variants derivational morphology

discourse analysis translation equivalents

All tools are available on-line within the portal www.korpus.cz

CNC research portal www.korpus.cz

KonText – CNC concordancer

KonText – CNC concordancer

SyD – exploring variants

SyD – exploring variants

SyD – exploring variants

SyD – exploring variants

SyD – exploring variants

Treq – translation equivalents

Translation candidates for “workshop”

% Czech English39.4 dílna (‘workroom‘) workshop30.4 seminář (‘seminar‘) workshop8.7 workshop workshop4.6 pracovní (‘workring‘) workshop2.3 kurs (‘course‘) workshop1.7 garáž (‘garage‘) workshop1.7 krejčovna (‘tailor’s shop‘) workshop0.9 ateliér (‘studio‘) workshop0.8 továrna (‘factory‘) workshop

User ServicesIj

User services

▶ hosting of corpora

▶ providing data packages (NLP)▶ analysis of user data▶ consulting, education, training

User services

▶ hosting of corpora▶ providing data packages (NLP)

▶ analysis of user data▶ consulting, education, training

User services

▶ hosting of corpora▶ providing data packages (NLP)▶ analysis of user data

▶ consulting, education, training

User services

▶ hosting of corpora▶ providing data packages (NLP)▶ analysis of user data▶ consulting, education, training

Repository Biblio

User forum and user support

CNC Wiki

www.korpus.cz