Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer...

13
Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium on Language Resources in Asia Thai Linguistic Resources

Transcript of Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer...

Page 1: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Virach Sornlertlamvanich

Information R&D Division (iTech)

National Electronics and Computer Technology Center (NECTEC)

THAILAND

19 January 2001

Symposium on Language Resources in Asia

Thai Linguistic Resources

Page 2: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

How Important !

Language Processing

DefiningRules

LinguisticKnowledge

StatisticalModeling

TrainingResources

LinguisticKnowledge

Top-Down Bottom-Up

Evaluation

Models

Adjust Adjust

EvaluationResources

• Linguistic resources are necessary even in top-down and bottom-up design

• Exploitable in modeling and evaluation

Page 3: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

What we need ?

Linguistic Resources

FundamentalLinguistic Tools

Applications

• Lexicon / Dictionary (30k)

• Tagged Text (2MB) / Speech Corpora

• Language Model

• Word Extraction (ML; p=85%; r=56%)

• Word Segmentation / POS tagger (ML; 96-97%)

• Sentence Segmentation (ML; 85-89%)

• Grapheme-to-Phoneme Conversion (PGLR; 73-90%)

• Word Sense Disambiguation

• Corpus / UNL / UW (concept) Editor

• MT (ParSit; http://come.to/parsit) / UNL

• Text Summarization

• Speech Recognition / Synthesis

Page 4: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Our Workbench …

Prosody-coverage

Phonetically-balance

Vocabulary-coverage

WordExtraction

CorpusEditor

Lexicon

Corpus-based

Dictionary

InterlingualConcept

LanguageModel

RawText

WordSegmentation

POSTagging

SentenceExtraction

Graphemeto Phoneme

WordDisambiguation

UNLMachine

Translation

TextSummarization

SpeechRecognition

SpeechSynthesis

Linguistic Tools Applications

Linguistic Resources

XML TaggedCorpus

Page 5: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Open Linguistic Resources • LEXiTRON v 1.1 (a corpus based T-E dictionary, 1994)

• About 11,000 Thai entries; 9,000 English entries• http://www.links.nectec.or.th/lexit

• ORCHID POS-Tagged Corpus (supported by CRL, 1997)• 160 documents; 2MB text; 400K words• XML tagged for Paragraph, Sentence, Word, Part-of-Speech (47 tags)• http://www.links.nectec.or.th/orchid

• Thai Royal Institute Dictionary (T-T dictionary)• Basic term 32,000 entries• Technical term 15,339 entries• http://www.royin.go.th/

• ParSit (http://come.to/parsit, 2000)

Page 6: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Ongoing : Thai Speech Corpus #1

Scope (2001)

• Large Vocabulary Continuous Speech Recognition (LVCSR) Corpus- Phonetically-balanced sentences- 5K vocabulary coverage sentences

• Corpus for Text-to-Speech Synthesis- 400 phonetically and prosodic-balanced sentences- For probabilistic prosody generation

• Dialog speech corpus (collaboration with ATR)- 50 conversations, 2,099 sentences- 5,000 words, 866 phonetically-balanced sentences- 40 speakers (males and females)

Page 7: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Ongoing : Thai Speech Corpus #2

Procedure

Word Segmentation

Sentence Extraction

POS Tagging

Grapheme-to-Phoneme

RawText

CorpusEditor

XML TaggedCorpus

Sentence Selection Process

Speech Recordingand Tagging

Tagged SpeechCorpus

Phonetically-balanced

Vocabulary coverage

Prosody-balanced

Page 8: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Ongoing : Thai Speech Corpus #3

Tools

Plain Text

Corpus EditorXML Corpus

Page 9: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Ongoing : Thai Speech Corpus #4

Text Sources

• Technology Promotion Association (Thailand-Japan)

• Amarin Printing Co., Ltd.

• Matichon Public Co., Ltd.

Project Collaboration

• Kasetsart University

• Thammasat University

• King’s Mongkut University of Technology Thonburi

• Prince of Songkhla University

Page 10: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Ongoing : Thai Speech Corpus #5

JNAS T IMIT WSJCAMO NECTEC(2001-2006)

Vocab size 5K, 20K - 20K, 64K 20K

# sent -PB -Vocab

503< 15,000

4501,890

< 1,500< 14,000

< 866< 10,000

# speaker 306 630 140 200

# sent/speaker 150(100 Vocab+50 PB)

10 100(Vocab+PB)

100(80 Vocab+20 PB)

Record time 60 hrs.(16 CDROM)

1 CDROM - 1GB

Page 11: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Ongoing : LEXiTRON v 2.0 #1

Scope (2001)

• Entries- 25,000 Thai - English- 25,000 English - Thai

• Fields- Translation- Phonetics- Root of vocabulary- Part-of-speech- Synonym- Antonym- Sentence sample

Procedure

WordExtraction

ExistingDictionary

RawText

VocabularySelection

DictionaryEditing

ExistingDictionary

Corpus-basedSentenceSamples

LEXiTRON v 2.0

Page 12: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Ongoing : LEXiTRON v 2.0 #2

ToolsDictionary DB

Phonetic Symbols

Wordnet

Corpus-based Sample Sentences

Page 13: Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Discussion

• Language difficulties; 13 Tai-family languages• Text sources• Common tagset• Resource center• Institutional collaboration