Post on 29-Apr-2018
1
:
.
Abstract The aim of this research is the comparative evaluation of d ifferent statistical techniques employed to Automatic Text Categorization in Modern Greek. The statistical methods considered were taken from the family of multivariate data analysis methods and consist of: a) Cluster Analysis b) Multiple Linear Regression c) Discriminant Function Analysis. Furthermore, a number of d ifferent pred ictor variables were used in order to investigate their relative contribution to the overall discriminatory performance of each statistical method.
- :
, , ,
, ,
1.
( )
( )
.
,
, .
(Discriminant Function Analysis) (Mikros & Carayannis
2000, Stamatatos, Fakotakis, Kokkinakis 2001), (Cluster Analysis)
(Tambouratzis et al. 2000) (Multiple Regression) (Stamatatos,
Fakotakis, Kokkinakis 1999, Stamatatos, Kokkinakis, Fakotakis 2000). ,
.
( )
.
2.
900
.
2
(
1):
1:
( )
33.692
225
136,3
81
969
150
24.087
161
622,5
80
5748
150
26.976
180
106,8
83
739
150
35.136
234
124,4
81
1434
150
102.906
686
1499,7
81
5960
150
Media 31.395
209
117,5
80
671
150
254.192
282
692,9
900
.
:
1. « » .
«Media» « »,
.
« » « »,
.
,
« »
.
2.
.
(75%) (< 250 ),
2% 1000 .
( . Mikros 2002),
. ,
500
(Baillie 1974, Ledger & Merriam 1994: 244)
.
3
.
3.
(register)
.
(Rudman 1998: 357)
.
3.
.
:
:
.
, WordNet (Junker & Abecker 1997, Scott &
Matwin 1999, Buenaga et al. 1997).
,
(spam mail) (G mez et al. 2000, Sahami et al. 1998).
.
(Burrows 1992, Burrows & Craig 1994),
( . Koller & Sahami 1997, Zaiane & Antonie 2002).
:
(
, , , . .)
.
.
(Karlgren 1999: 161)
(Forsyth & Holmes 1996: 170).
:
. . .
(Hoch 1994).
. :
4
( ):
, .
Type/ Token ratio: (types) .
500 (Biber 1988)
Yule s K:
(Tweedie & Baayen 1998: 350).
:
« » ( , ,
) ( , ,
, ).
( )
( )
:
, , . . . .
15
1
15 .
: 5
: ( ), ! % -
( ):
.
( ): 50
.
( ): 20
. 6 120 (20 6)
60
(20 3).
500
.
5
4.
( . . Dixon & Mannion 1993, Matthews & Merriam
1993, Holmes & Forsyth 1995).
Forsyth & Holmes (1996: 163) .
.
.
. Stamatatos, Fakotakis,
Kokkinakis (2000)
( . . BNC ).
(Stamatatos, Fakotakis, Kokkinakis
(2001) 30 50 (
).
,
, .
.
.
(frequency profiling).
( . Hofland
& Johansson 1982, Rayson et al. 1997, Granger & Rayson 1998).
Iakovou,
Markopoulos, Mikros (2002) (2003)
.
:
1. .
2. - : . .
4
4 - ,
. . ( - ( ), -
( ), - ( ), - ( ) . . .
3. ( ) - .
6
4. -
. ( ) ~ ( , , )
5. k
-
3 - .
6. -
7. n ( 4 k)
n
( 5) 2 (Schutze et al. 1995), (mutual information),
(information gain) (Lewis & Ringuette 1994),
(Principal Component Analysis) (Wiener et al. 1993)
(log likelihood) (Dunning 1993).
,
(Kageura
1997).
, 2 (Manning & Schutze 1999: 174).
5.
(cross-
validation) U leave-one-out .
1 .
.
.
(macro-averaging).
F1
, .
:
F2
1 ( 1)
. :
7
: 6
. 3 ( , ,
)
:
150 .
75 .
:
.
:
o
o ( M + )
o +
o +
o
o
o
2 .
.
3 6
3 6 12
6 .
6.
(Multiple
Linear Regression), (Discriminant Function Analysis
DFA)
(Hierarchical Cluster Analysis).
6.1
(I )
.
,
.
.
Ward
(El-Hamdouchi & Willet 1986).
8
, ,
(Pearson Correlation) .
,
,
.
.
.
,
( ), ( )
(
1,
2):
1: ,
2:
0
10
20
30
40
50
60
YM++
F1 6
3
05
101520253035404550
F1
150 75
6
3
0
10
20
30
40
50
60
F1
6 3
9
6.2
( . Biebericher et al. 1988, Fuhr et al. 1991,
Yang & Chute 1994).
( ) ( ).
, ,
. « »
. « »
. « »
2:
y= a + w1X1 + w2X2 + w nXn ( 2)
: y= EM
a=
w=
X= AM
Stepwise, Enter . .
Enter .
,
( ) (
3):
3:
,
6.3
a
priori .
« »
.
01020304050607080
F1
150 75
6
3 0102030405060708090
++
F1 6
3
10
k-1
, k .
( )
3:
Djk= a + w1X1k + w2X2k + ... + w nXnk ( 3)
:
Djk= j k.
a=
w i= i
Xik= i k
( . Karlgren & Cutting 1994)
.
( ), ( )
(
4):
4: , .
(
5):
5:
.
0102030405060708090
100
++
F1 6
3
0102030405060708090
F1
150 75
6
3
Function 1
1062-2-6-10
Fun
ctio
n 2
8
6
4
2
0
-2
-4
-6
-8
Group Centroids
MEDIA
MEDIA
3
Function 1
86420-2-4-6-8
Fun
ctio
n 2
6
4
2
0
-2
-4
-6
Group Centroids
11
7.
6 :
6: .
.
,
92%.
.
7:
7:
0
10
20
30
40
50
60
70
80
90
100
++
F1
,
.
.
0
20
40
60
80
F1 150 75
0
10
20
30
40
50
60
70
80
90
F1 6
3
13
2: ( )
. ,
,
.
Baillie, D. W. 1974. Authorship attribution in Jacobean dramatic texts . Computers in the
humanities, ed. by J.L.Mitchell, Edinburgh: Edinburgh University Press.
Biber, Douglas. 1988. Variation across speech and writing. Cambridge: Cambridge University
Press.
Wilks' L F10 % 0,887 22,763 1,51746E-2114 "- 0,893 21,421 2,77415E-2025 0,909 17,890 6,22179E-1754 , 0,935 12,488 9,77125E-1277 5 0,946 10,256 1,4206E-0989 0,960 7,537 6,05704E-0793 : 0,960 7,368 8,81204E-0794 9 0,961 7,259 1,12183E-0699 10 0,963 6,826 2,92779E-06
100 11 0,965 6,544 5,45472E-06106 ; 0,966 6,228 1,09362E-05107 6 0,967 6,183 1,20659E-05111 15 0,967 6,098 1,45413E-05112 0,967 6,069 1,55028E-05116 Yule K 0,969 5,727 3,27666E-05123 12 0,972 5,200 0,00010324127 13 0,973 4,978 0,00016708131 0,977 4,240 0,000815163138 14 0,978 4,063 0,001187014140 4 0,979 3,924 0,001591351156 0,982 3,298 0,005865007160 7 0,983 3,120 0,008449015163 STTR 0,983 3,076 0,009253816165 0,983 3,072 0,00931835170 0,985 2,811 0,015814403175 0,986 2,533 0,027445062180 0,987 2,362 0,038356071186 "- 0,988 2,083 0,065315221191 1 0,989 1,954 0,083157476192 8 0,989 1,945 0,084477584193 3 0,990 1,884 0,094539568197 . . 0,991 1,548 0,172454647202 2 0,995 0,854 0,511480483
14
Biebericher, Peter, Fuhr, Norbert, Lustig, Gerhard , Schwantner, Michael, Knorz, Gerhard . 1988.
The automatic indexing system AIR/ PHYS
from research to application . Proceedings
of the 11th International Conference on Research and Development in Information Retrieval
(SIGIR 88), 333-342.
Buenaga Manuel, G mez Jose, & Diaz Belen. 1997. Using WordNet to complement training
information in text categorization . In Milkov R., Nicolov N., and Nikolov N. editors,
Proceedings of RANLP-97, 2nd International Conference on Recent Advances in Natural
Language Processing, ed. by R. Milkov, N. Nicolov, and N. Nikolov, 202 207. Tzigov:
Chark, BL.
Burrows, John F. & Craig, Hugh. 1994. Lyrical d rama and the Turbid Montebanks : Styles of
d ialogue in romantic and renaissance tragedy . Computers and the Humanties 28: 63-86.
Burrows, John F. 1992. Not unless you ask nicely: the interpretive nexus between analysis and
information . Literary and Linguistic Computing 7. 91-109.
Dixon, Peter & Mannion, David . 1993. Goldsmth s period ical essays: a statistical analysis of
eleven doubtful cases . Literary and Linguistic Computing 8. 1-19.
Dunning, Ted . 1993. Accurate methods for the statistics of surprise and coincidence .
Computational Linguistics 19. 61-74.
El-Hamdouchi, A. & Willet, Peter. (1986). Hierarchic document classification using Ward 's
clustering method . Proceedings of the 9th annual international ACM conference on
research and development in information retrieval (SIGIR 86), 149-156.
Forsyth, Richard , S. & Holmes, David , I. 1996. Feature-finding for text classification . Literary
and Linguistic Computing 11. 163-174.
Fuhr, Norbert, Hartmann, Stephan, Lustig, G., Schwantner, Michael, Tzeras, Konstad inos. 1991.
AIR/ X - a ru le-based multistage indexing system for large subject fields . Proceedings of
the RIAO'91, 606-623.
Granger, Sylviane & Rayson Paul. 1998. Automatic profiling of learner texts . Learner English
on Computer, ed. by Sylviane Granger, 119-131. Longman: London and New York.
G mez Jose & de Buenaga Manuel. 1997. Integrating a lexical database and a training
collection for text categorization . Proceedings of the ACL/EACL Workshop on Automatic
Information Extraction and Building of Lexical Semantic Resources for NLP. 39-44.
Hoch, Rainer. 1994. Using IR techniques for text classification in document analysis . 17th
Annual International ACM SIGIR Conference on Research and Development in Information
Retrieval (SIGIR 94), 31-40.
Hofland Knut & Johansson Stig. 1982. Word frequencies in British and American English. The
Norwegian Computing Centre for the Humanities: Bergen, Norway.
Iakovou, aria., Markopoulos, George., Mikros, George. (2003).
:
. 6 , 18-21 2003, .
Iakovou, aria., Markopoulos, George., Mikros, George. 2002.
15
. 2
, .
Junker Markus & Abecker Andreas. 1997. Exploiting thesaurus knowledge in ru le induction
for text classification . Proceedings of RANLP-97, 2nd International Conference on Recent
Advances in Natural Language Processing, ed . by R. Milkov, N. Nicolov, and N. Nikolov.
202 207. Tzigov: Chark, BL.
Karlgren, Jussi. 1999. Stylistic experiments in information retrieval . Natural Language
Information Retrieval, ed. by T. Strzalkowski, 147-166. Kluwer: Dodrecht.
Karlgrenn, Jussi & Cutting, Douglass. 1994. Recognizing text genres with simple metrics using
d iscriminant analysis . Proceedings of the 15th. International Conference on Computational
Linguistics (COLING 94), volume II, 1071-1075. Kyoto, Japan.
Koller, Daphne & Sahami, Mehran. 1997. Hierarchically classifying documents using very few
words. International Conference on Machine Learning, Nashville, volume 14. 170-178.
Morgan-Kauffman: San Francisco.
Ledger, Gerard & Merriam, Thomas, V. N. 1994. Shakespeare, Fletcher, and the Two Noble
Kinsmen . Literary and Linguistic Computing 9. 235-248.
Lewis, David , D., & Ringuette, Marc. 1994. A comparison of two learning algorithms for text
categorization . Proceedings of the Third Annual Symposium on Document Analysis and
Information Retrieval (SDAIR 94), 81-93.
Manning, Christopher, D. & Schutze, Hinrich. 1999. Foundations of statistical natural language
processing. Cambridge, Massachusetts: MIT Press.
Mikros, George & Carayannis, George. 2000. Modern Greek Corpus Taxonomy . Proceedings of
the 2nd International Conference on Language Resources and Evaluation, volume I. 129-134.
Athens, Greece.
Mikros, George. 2002. Quantitative parameters in corpus design: Estimating the optimum text-
size in Modern Greek language . Proceedings of the 3rd International Conference on
Language Resources and Evaluation, volume III. 834-838. Gran Canaria, Spain.
Rayson, Paul, Leech, Geoffrey., & Hodges, Mary. 1997. Social d ifferentiation in the use of
English vocabulary: some analyses of the conversational component of the British
National Corpus . International Journal of Corpus Linguistics. 2. 133 - 152.
Rudman, Joseph. 1998. The state of authorship attribution stud ies: some problems and
solutions . Computers and the Humanities 31. 351-365.
Sahami Mehran, Dumais Susan, Heckerman David , & Horvitz Eric. 1998. A bayesian approach
to filtering junk e-mail . Learning for Text Categorization: Papers from the 1998 Workshop.
AAAI Tech. Rep. WS-98-05. 55-62.
Schutze Hinrich, Hull, David A., Pedersen Jan, O. 1995. A comparison of classifiers and
document representations for the routing problem . 18th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR 95), 229-237.
Scott Sam & Matwin Stan. 1999. Feature engineering for text classification . Proceedings of
ICML-99, 16th International Conference on Machine Learning, ed. by I. Bratko and S.
Dzeroski. 379 388. Morgan Kaufmann Publishers: San Francisco, US.
16
Stamatatos, Efstathios, Fakotakis, Nikolaos, Kokkinakis, George, 1999. Automatic authorship
attribution . Proceedings of the Ninth Conference of the European Chapter of the Association for
the Computational Linguistics (EACL 99), July 8-12, 1999. 158-164. Bergen, Norway.
Stamatatos, Efstathios, Fakotakis, Nikolaos., Kokkinakis, George, 2000. Automatic Text
Categorization in Terms of Genre and Author . Computational Linguistics 26. 471-495.
Stamatatos, Efstathios, Fakotakis, Nikolaos., Kokkinakis, George. 2001. Computer-based
authorship attribution without lexical measures . Computers and the Humanities 35. 193
214.
Tambouratzis, George, Markantonatou, Stella, Xairetakis, Nikolaos, Carayannis, George. 2000.
Automatic style categorization of corpora in the Greek language . Proceedings of the 2nd
International Conference on Language Resources and Evaluation, volume I. 135-140. Athens,
Greece.
Tweedie, Fiona & Baayen, Harald , R. 1988. How variable a constant can be? Measures of
lexical richness in perspective . Computers and the Humanities 32. 323-352.
Wiener, Erik, Pedersen, Jan, O., Weigend , Andreas, S. 1993. A neural network approach to
topic spotting . Proceedings of the Fourth Annual Symposium on Document Analysis and
Information Retrieval, 22-34.
Yang, Yiming & Chute, Christopher G. 1994. An example based mapping method for text
categorization and retrieval . ACM Transaction on Information System (TOIS), 12. 252-277.
Zaiane Osmar R. & Antonie Maria-Luiza. 2002. Classifying text documents by associating
terms with text categories . Proceedings of the Thirteenth Australasian Database Conference
(ADC2002), Melbourne, Australia. Conferences in Research and Practice in Information
Technology, volume 5. ed. by Zhou, X., 215-222.
This document was created with Win2PDF available at http://www.daneprairie.com.The unregistered version of Win2PDF is for evaluation or non-commercial use only.