Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ......

44

Transcript of Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ......

Page 1: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language
Page 2: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

IntroducingCzech National Corpus

Brown University, 04/09/16

Václav Cvrček

Page 3: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

IntroductionIj

Page 4: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

Czech National Corpus project

Basic facts about the CNC

▶ est. in 1994 by prof. František Čermák

▶ 2 departments of Faculty of Arts, Charles University in Prague(ICNC & ITCL)

▶ in 2012 acknowledged by MEYS as a research infrastructurefor social sciences and humanities

▶ 4,500+ registered users▶ 1,900 queries per day▶ web portal: www.korpus.cz

Page 5: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

Czech National Corpus project

Basic facts about the CNC

▶ est. in 1994 by prof. František Čermák▶ 2 departments of Faculty of Arts, Charles University in Prague

(ICNC & ITCL)

▶ in 2012 acknowledged by MEYS as a research infrastructurefor social sciences and humanities

▶ 4,500+ registered users▶ 1,900 queries per day▶ web portal: www.korpus.cz

Page 6: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

Czech National Corpus project

Basic facts about the CNC

▶ est. in 1994 by prof. František Čermák▶ 2 departments of Faculty of Arts, Charles University in Prague

(ICNC & ITCL)▶ in 2012 acknowledged by MEYS as a research infrastructure

for social sciences and humanities

▶ 4,500+ registered users▶ 1,900 queries per day▶ web portal: www.korpus.cz

Page 7: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

Czech National Corpus project

Basic facts about the CNC

▶ est. in 1994 by prof. František Čermák▶ 2 departments of Faculty of Arts, Charles University in Prague

(ICNC & ITCL)▶ in 2012 acknowledged by MEYS as a research infrastructure

for social sciences and humanities▶ 4,500+ registered users

▶ 1,900 queries per day▶ web portal: www.korpus.cz

Page 8: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

Czech National Corpus project

Basic facts about the CNC

▶ est. in 1994 by prof. František Čermák▶ 2 departments of Faculty of Arts, Charles University in Prague

(ICNC & ITCL)▶ in 2012 acknowledged by MEYS as a research infrastructure

for social sciences and humanities▶ 4,500+ registered users▶ 1,900 queries per day

▶ web portal: www.korpus.cz

Page 9: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

Czech National Corpus project

Basic facts about the CNC

▶ est. in 1994 by prof. František Čermák▶ 2 departments of Faculty of Arts, Charles University in Prague

(ICNC & ITCL)▶ in 2012 acknowledged by MEYS as a research infrastructure

for social sciences and humanities▶ 4,500+ registered users▶ 1,900 queries per day▶ web portal: www.korpus.cz

Page 10: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

DataIj

Page 11: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

Language data

SYN2.3 bil.

InterCorp1.4 bil.

ORAL5 mil.

Diakorp3.4 mil.

Page 12: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

SYN series

Written (i.e. published) synchronic texts:SYN2000 100M representative (1990–1999)SYN2005 100M representative (2000–2004)SYN2006PUB 300M journalistic texts (1989–2004)SYN2009PUB 700M journalistic texts (1995–2007)SYN2010 100M representative (2005–2009)SYN2013PUB 935M journalistic texts (2005–2009)SYN2015 100M representative (2010–2014)SYN (v. 3) 2,3G union of all SYN* corpora

All corpora are: lemmatized, morphologically tagged and enrichedby metadata (biblio information + text-type/genre classification)

Page 13: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

SYN series

Written (i.e. published) synchronic texts:SYN2000 100M representative (1990–1999)SYN2005 100M representative (2000–2004)SYN2006PUB 300M journalistic texts (1989–2004)SYN2009PUB 700M journalistic texts (1995–2007)SYN2010 100M representative (2005–2009)SYN2013PUB 935M journalistic texts (2005–2009)SYN2015 100M representative (2010–2014)SYN (v. 3) 2,3G union of all SYN* corpora

All corpora are: lemmatized, morphologically tagged and enrichedby metadata (biblio information + text-type/genre classification)

Page 14: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

ORAL series

Unprepared, dialogical, informal spoken language

One-layer transcription corpora:ORAL2006 1.0M Bohemian Czech onlyORAL2008 1.0M sociolinguistically balanced, Bohemian

Czech onlyORAL2013 2.8M sociolinguistically balanced, whole CR,

text-to-sound alignment

Older spoken corpora: Prague spoken corpus (0.5M), Brno spokencorpus (0.5M)

Page 15: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

ORAL series

Unprepared, dialogical, informal spoken language

One-layer transcription corpora:ORAL2006 1.0M Bohemian Czech onlyORAL2008 1.0M sociolinguistically balanced, Bohemian

Czech onlyORAL2013 2.8M sociolinguistically balanced, whole CR,

text-to-sound alignment

Older spoken corpora: Prague spoken corpus (0.5M), Brno spokencorpus (0.5M)

Page 16: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

Diachronic corpus

DIAKORP

▶ diachronic part of the CNC – 2.5 mil. words (v. 5)▶ the end of the 13th century to the beginning of the SYN

section (1945)▶ texts are transcribed, not transliterated▶ current focus on 19th century – lemmatization

Page 17: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

Multilingual parallel corpus InterCorp

Czech texts with translations to or from 30+ languages

InterCorp (v. 8)

▶ core (=fiction) and collections (=journalism, subtitles…)

Core Collectionscs 85M 90Mforeign 194M 1,229M

▶ partly lemmatized and tagged▶ uneven amount of texts in language pairs

Page 18: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

Multilingual parallel corpus InterCorp

Czech texts with translations to or from 30+ languages

InterCorp (v. 8)

▶ core (=fiction) and collections (=journalism, subtitles…)

Core Collectionscs 85M 90Mforeign 194M 1,229M

▶ partly lemmatized and tagged▶ uneven amount of texts in language pairs

Page 19: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

Multilingual parallel corpus InterCorp

Czech texts with translations to or from 30+ languages

InterCorp (v. 8)

▶ core (=fiction) and collections (=journalism, subtitles…)

Core Collectionscs 85M 90Mforeign 194M 1,229M

▶ partly lemmatized and tagged▶ uneven amount of texts in language pairs

Page 20: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

Multilingual parallel corpus InterCorp

Czech texts with translations to or from 30+ languages

InterCorp (v. 8)

▶ core (=fiction) and collections (=journalism, subtitles…)

Core Collectionscs 85M 90Mforeign 194M 1,229M

▶ partly lemmatized and tagged

▶ uneven amount of texts in language pairs

Page 21: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

Multilingual parallel corpus InterCorp

Czech texts with translations to or from 30+ languages

InterCorp (v. 8)

▶ core (=fiction) and collections (=journalism, subtitles…)

Core Collectionscs 85M 90Mforeign 194M 1,229M

▶ partly lemmatized and tagged▶ uneven amount of texts in language pairs

Page 22: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

Language Core Totalbg Bulgarian 5.2M 28.1Mda Danish 3.0M 53.0Mde German 27.7M 77.1Men English 15.5M 113.9Mes Spanish 17.5M 103.9Mfi Finnish 3.4M 45.2Mfr French 9.2M 87.0Mhr Croatian 15.5M 34.6Mhu Hungarian 5.4M 58.1Mit Italian 7.2M 65.6Mpl Polish 17.5M 79.9Mru Russian 3.3M 13.4Msk Slovak 7.4M 44.5Msl Slovenian 0.9M 49.8Msr Serbian 8.8M 29.6Muk Ukrainian 5.1M 5.3M

Page 23: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

ToolsIj

Page 24: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

CNC Tools

main concordancer analysis of variants derivational morphology

discourse analysis translation equivalents

All tools are available on-line within the portal www.korpus.cz

Page 25: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

CNC Tools

main concordancer analysis of variants derivational morphology

discourse analysis translation equivalents

All tools are available on-line within the portal www.korpus.cz

Page 26: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

CNC research portal www.korpus.cz

Page 27: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

KonText – CNC concordancer

Page 28: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

KonText – CNC concordancer

Page 29: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

SyD – exploring variants

Page 30: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

SyD – exploring variants

Page 31: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

SyD – exploring variants

Page 32: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

SyD – exploring variants

Page 33: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

SyD – exploring variants

Page 34: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

Treq – translation equivalents

Page 35: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

Translation candidates for “workshop”

% Czech English39.4 dílna (‘workroom‘) workshop30.4 seminář (‘seminar‘) workshop8.7 workshop workshop4.6 pracovní (‘workring‘) workshop2.3 kurs (‘course‘) workshop1.7 garáž (‘garage‘) workshop1.7 krejčovna (‘tailor’s shop‘) workshop0.9 ateliér (‘studio‘) workshop0.8 továrna (‘factory‘) workshop

Page 36: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

User ServicesIj

Page 37: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

User services

▶ hosting of corpora

▶ providing data packages (NLP)▶ analysis of user data▶ consulting, education, training

Page 38: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

User services

▶ hosting of corpora▶ providing data packages (NLP)

▶ analysis of user data▶ consulting, education, training

Page 39: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

User services

▶ hosting of corpora▶ providing data packages (NLP)▶ analysis of user data

▶ consulting, education, training

Page 40: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

User services

▶ hosting of corpora▶ providing data packages (NLP)▶ analysis of user data▶ consulting, education, training

Page 41: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

Repository Biblio

Page 42: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

User forum and user support

Page 43: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

CNC Wiki

Page 44: Introducing Czech National Corpus - Brown University, 04/09/16 · Czech National Corpus project ... webportal:. Czech National Corpus project Basic facts about the CNC ... Language

www.korpus.cz