A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang [email protected]/...

15
A Proposed Tag Set for Exchanging Word- Segmented Text Corpora Jing-Shin Chang Jing-Shin Chang [email protected] [email protected] http://www.bdc.com.tw/~sh http://www.bdc.com.tw/~sh in/ in/ Behavior Design Corporati Behavior Design Corporati on on

Transcript of A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang [email protected]/...

Page 1: A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang shin@bdc.com.twshin/ Behavior Design Corporation.

A Proposed Tag Set for Exchanging Word-Segmented

Text CorporaJing-Shin ChangJing-Shin Chang

[email protected]@bdc.com.tw

http://www.bdc.com.tw/~shin/http://www.bdc.com.tw/~shin/

Behavior Design CorporationBehavior Design Corporation

Page 2: A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang shin@bdc.com.twshin/ Behavior Design Corporation.

Technical Issues in Designing a Tag Set for a Stratified WS Standard

Why a Stratified WS Standard ?Why a Stratified WS Standard ? ““Words” are generated by various mechanismsWords” are generated by various mechanisms No common agreement on all WS criteria due No common agreement on all WS criteria due

to different processing models of different to different processing models of different researchers and institutions on the mechanismsresearchers and institutions on the mechanisms

Stratification help the exchange of corpora and Stratification help the exchange of corpora and conversion to appropriate private word unitsconversion to appropriate private word units

What Tags and Attributes, and Why?What Tags and Attributes, and Why?

Page 3: A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang shin@bdc.com.twshin/ Behavior Design Corporation.

Text Generation Mechanisms Behind Word Stratification Lexicon SelectionLexicon Selection

Basic Lexicon (“Standard Dictionary”)Basic Lexicon (“Standard Dictionary”) Derivational Processes (non-enumerable)Derivational Processes (non-enumerable)

simple variants (color/colour;simple variants (color/colour; 呆子呆子 // 獃子獃子 ;; 兇手兇手 // 凶凶手手 ))

regular expressions (numbers, word patterns)regular expressions (numbers, word patterns) regular derivational processes (proper nouns, abbreviatregular derivational processes (proper nouns, abbreviat

ions, compounding, …)ions, compounding, …)

Text PlanningText Planning Writing Variants (symbols, punctuations)Writing Variants (symbols, punctuations)

Page 4: A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang shin@bdc.com.twshin/ Behavior Design Corporation.

What Tags/Attributes, and Why?

Tags for carrying linguistics informationTags for carrying linguistics information word boundaryword boundary level of stratification (in terms of a standard)level of stratification (in terms of a standard)

misc. (e.g., symbols and punctuations in text)misc. (e.g., symbols and punctuations in text)

Tags for carrying conforming informationTags for carrying conforming information standard/substandard of conformancestandard/substandard of conformance

so as to convert to-and-from private systems easilyso as to convert to-and-from private systems easily to allow user extension on (sub)standard(s) & overcto allow user extension on (sub)standard(s) & overc

ome time variant issuesome time variant issues

Page 5: A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang shin@bdc.com.twshin/ Behavior Design Corporation.

Tags for carrying linguistics information Tags:Tags:

<w0>(~<w0>(~ 信級詞信級詞 ): words in standard dictionary): words in standard dictionary <w1>(~<w1>(~ 達級詞達級詞 ): morphologically derived): morphologically derived <w2>(~<w2>(~ 雅級詞雅級詞 ): derived through compounding re): derived through compounding re

gularitygularity

Attributes: Attributes: POS (part of speech), tt (token type, derived word tyPOS (part of speech), tt (token type, derived word ty

pe), hwds (embedded head words), rel (relationship pe), hwds (embedded head words), rel (relationship among embedded head words)among embedded head words)

Page 6: A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang shin@bdc.com.twshin/ Behavior Design Corporation.

Tags for carrying conforming information Tags: <wstxt>, <ws0p>, <ws1p>, <ws2p>, Tags: <wstxt>, <ws0p>, <ws1p>, <ws2p>,

<p> (un-segmented para.)<p> (un-segmented para.) paragraphs of various stratification level conforparagraphs of various stratification level confor

ming to specified standard/substandardsming to specified standard/substandards Attributes: WS, Dict, MR, NUM, NAM, CAttributes: WS, Dict, MR, NUM, NAM, C

MPR, DR, GR, specifying:MPR, DR, GR, specifying: conformed “standard resources”conformed “standard resources” user extension:user extension:

e.g., Dict=“CNS-WS-Dict-1998-1,X-BDC-WS-Dict-1998-2”e.g., Dict=“CNS-WS-Dict-1998-1,X-BDC-WS-Dict-1998-2”

Page 7: A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang shin@bdc.com.twshin/ Behavior Design Corporation.

Attributes On Standardized Resources: Why ? Official standard (and thus tags) should be dOfficial standard (and thus tags) should be d

efined in terms of explicitly specified and uefined in terms of explicitly specified and unambiguously testifiable resources!!nambiguously testifiable resources!! with (optional) mechanism for user extensionwith (optional) mechanism for user extension e.g., Charset registry, RFC (Internet standards)e.g., Charset registry, RFC (Internet standards)

Every resource is assigned a symbolic name Every resource is assigned a symbolic name (referenced in attribute) for conforming test(referenced in attribute) for conforming test

for conversion to/evaluation in private systemsfor conversion to/evaluation in private systems

Page 8: A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang shin@bdc.com.twshin/ Behavior Design Corporation.

Attributes On Standardized Resources (Cont.) WS: WS standard, the collection of a set of WS: WS standard, the collection of a set of

substandards (such as Dict, MR, ...).substandards (such as Dict, MR, ...). e.g., CNS-WS-1998-1=CNS-WS-{Dict, MR, …}-1998-1e.g., CNS-WS-1998-1=CNS-WS-{Dict, MR, …}-1998-1

Dict: standard dictionary (basic lexicon)Dict: standard dictionary (basic lexicon) qualified basic wordsqualified basic words POS: optional (referred by other substandards) POS: optional (referred by other substandards)

MR: morphological rules/standardMR: morphological rules/standard qualified affix/prefix/suffixqualified affix/prefix/suffix qualified combination patternsqualified combination patterns

Page 9: A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang shin@bdc.com.twshin/ Behavior Design Corporation.

Attributes on Standardized Resources (Cont.) [& Arguable] NUM: numbering rules/patternsNUM: numbering rules/patterns NAM: naming rules/patternsNAM: naming rules/patterns

qualified family namesqualified family names length constraints, abbreviations, standard length constraints, abbreviations, standard

translations of foreign names, ...translations of foreign names, ...

Page 10: A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang shin@bdc.com.twshin/ Behavior Design Corporation.

Attributes on Standardized Resources (Cont.) [& Arguable] CMPR: compound formation rules/patternsCMPR: compound formation rules/patterns DR: other derivational rules not in the abovDR: other derivational rules not in the abov

e substandardse substandards GR: private rules/patterns/descriptionGR: private rules/patterns/description

Page 11: A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang shin@bdc.com.twshin/ Behavior Design Corporation.

Example: Simplest Encoding <!-- The whole segmented text is enclosed by the "wstxt" tag; conforming standard is specified with <!-- The whole segmented text is enclosed by the "wstxt" tag; conforming standard is specified with

attributes. Words are space-delimited, and are conforming to the “w0”, “w1” or “w2” standard depenattributes. Words are space-delimited, and are conforming to the “w0”, “w1” or “w2” standard depending on weather they are enclosed in “ws0p”, “ws1p” or “ws2p” -->ding on weather they are enclosed in “ws0p”, “ws1p” or “ws2p” -->

<wstxt dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-<wstxt dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3">2.3">

<!-- use spaces as default word boundaries w/o using word tags --><!-- use spaces as default word boundaries w/o using word tags --> <ws0p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" verified=TRUE><ws0p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" verified=TRUE> 中文 分詞 標準 必須 一 步 一 步 小心 地 制定 中文 分詞 標準 必須 一 步 一 步 小心 地 制定 .. </ws0p></ws0p> <ws1p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1<ws1p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1

998-2.3" verified=TRUE>998-2.3" verified=TRUE> 中文 分詞 標準 必須 一步一步 小心地 制定 中文 分詞 標準 必須 一步一步 小心地 制定 .. </ws1p></ws1p> <ws2p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1<ws2p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1

998-2.3" CMPR="CNS-WS-CMPR-COMPOUND-DICT-1998-1.2" verified=TRUE>998-2.3" CMPR="CNS-WS-CMPR-COMPOUND-DICT-1998-1.2" verified=TRUE> 中文分詞標準 必須 一步一步 小心地 制定 中文分詞標準 必須 一步一步 小心地 制定 .. </ws2p></ws2p>

</wstxt></wstxt>

Page 12: A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang shin@bdc.com.twshin/ Behavior Design Corporation.

Example: Using Word Tags <!-- use word tags to identify word boundaries --><!-- use word tags to identify word boundaries --> <ws0p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" verified=TRUE><ws0p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" verified=TRUE>

<w0><w0> 中文中文 </w0><w0></w0><w0> 分詞分詞 </w0><w0></w0><w0> 標準標準 </w0><w0></w0><w0> 必須必須 </w0></w0> <w0 pos=quan><w0 pos=quan> 一一 </w0><w0></w0><w0> 步步 </w0><w0></w0><w0> 一一 </w0><w0></w0><w0> 步步 </w0></w0> <w0><w0> 小心小心 </w0><w0></w0><w0> 地地 </w0><w0></w0><w0> 制定制定 </w0><w0>.</w0></w0><w0>.</w0>

</ws0p></ws0p> <ws1p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3” verified=TR<ws1p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3” verified=TR

UE>UE> <w0><w0> 中文中文 </w0> <w0></w0> <w0> 分詞分詞 </w0> <w0></w0> <w0> 標準標準 </w0> <w0></w0> <w0> 必須必須 </w0></w0> <w1><w0><w1><w0> 一一 </w0><w0></w0><w0> 步步 </w0><w0></w0><w0> 一一 </w0><w0></w0><w0> 步步 </w0></w1></w0></w1> <w1><w0><w1><w0> 小心小心 </w0><w0></w0><w0> 地地 </w0></w1> <w0></w0></w1> <w0> 制定制定 </w0> <w0>.</w0></w0> <w0>.</w0>

</ws1p></ws1p> <ws2p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3” CMPR="CN<ws2p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3” CMPR="CN

S-WS-CMPR-COMPOUND-DICT-1998-1.2" verified=TRUE>S-WS-CMPR-COMPOUND-DICT-1998-1.2" verified=TRUE> <w2><w0><w2><w0> 中文中文 </w0><w0></w0><w0> 分詞分詞 </w0><w0></w0><w0> 標準標準 </w0></w2></w0></w2> <w0><w0> 必須必須 </w0></w0> <w1><w0><w1><w0> 一一 </w0><w0></w0><w0> 步步 </w0><w0></w0><w0> 一一 </w0><w0></w0><w0> 步步 </w0></w1></w0></w1> <w1><w0><w1><w0> 小心小心 </w0><w0></w0><w0> 地地 </w0></w1></w0></w1> <w0><w0> 制定制定 </w0><w0>.</w0></w0><w0>.</w0>

</ws2p></ws2p>

Page 13: A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang shin@bdc.com.twshin/ Behavior Design Corporation.

Example: Derived Words and Token Type (TT) Attribute <ws1p MR="CNS-WS-MR-1998-2.3" DR="CNS-WS-DR-1988.1.2"><ws1p MR="CNS-WS-MR-1998-2.3" DR="CNS-WS-DR-1988.1.2"> <!-- examples of derived w1 words (from "w0" words) --><!-- examples of derived w1 words (from "w0" words) -->

<w1 tt=(common_noun,suffix) MR="CNS-WS-MR-1998-2.3"><w1 tt=(common_noun,suffix) MR="CNS-WS-MR-1998-2.3"> <w0><w0> 孩子孩子 </w0><w0></w0><w0> 們們 </w0></w0>

</w1></w1> <w1><w1>

<w0 pos=quan><w0 pos=quan> 一萬一萬 </w0></w0> <w0><w0> 朵朵 </w0></w0> <w0 pos=quan><w0 pos=quan> 一萬一萬 </w0></w0> <w0><w0> 朵朵 </w0></w0>

</w1><w0></w1><w0> 地地 </w0><w0></w0><w0> 送送 </w0></w0> </ws1p></ws1p>

Page 14: A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang shin@bdc.com.twshin/ Behavior Design Corporation.

Example: Application of (Hwrds, Rel) Attributes for Punctuations <!-- examples of tagging punctuation enclosed/delimited words --><!-- examples of tagging punctuation enclosed/delimited words --> <w1 hwds="<w1 hwds=" 高中高中 ,, 高職高職 " rel=AND_OR>" rel=AND_OR>

<w0><w0> 高中高中 </w0><w0>(</w0><w0></w0><w0>(</w0><w0> 職職 </w0><w0>)</w0></w0><w0>)</w0> </w1></w1> <!-- words with the same (hwds,rel) could be normalized to the same internal representat<!-- words with the same (hwds,rel) could be normalized to the same internal representat

ion of a private system -->ion of a private system --> <w1 hwds="<w1 hwds=" 中山南路中山南路 ,, 中山北路中山北路 " rel=AND>" rel=AND>

<w0><w0> 中山中山 </w0><w0></w0><w0> 南南 </w0><w0></w0><w0> 、、 </w0><w0></w0><w0> 北北 </w0><w0></w0><w0> 路路 </w0></w0> </w1></w1> <w0><w0> 與與 </w0></w0> <w1 hwds="<w1 hwds=" 中山南路中山南路 ,, 中山北路中山北路 " rel=AND>" rel=AND>

<w0><w0> 中山中山 </w0><w0></w0><w0> 南南 </w0><w0>(</w0><w0></w0><w0>(</w0><w0> 北北 </w0><w0>)</w0><w0></w0><w0>)</w0><w0>路路 </w0></w0>

</w1></w1> <w0><w0> 意義意義 </w0><w0></w0><w0> 相同相同 </w0><w1>...</w1><w1>...</w1></w0><w1>...</w1><w1>...</w1>

Page 15: A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang shin@bdc.com.twshin/ Behavior Design Corporation.

Future Issues

Specification of the Official WS StandardSpecification of the Official WS Standard standard resources and substandards to be definstandard resources and substandards to be defin

ed in the first official versioned in the first official version Construction of Basic LexiconConstruction of Basic Lexicon

basic basic vsvs. derivational words. derivational words standardization of the derivational partsstandardization of the derivational parts

Registration of User Extension & Evolution Registration of User Extension & Evolution of the Official Standardof the Official Standard