STYLOMETRY IN IR SYSTEMS
Leyla BİLGE
Büşra ÇELİKKAYA
Kardelen HATUN
OUTLİNE
Stylistics and Stylometry Applications of stylometry History of stylometric researches Stylistic features Recent Studies Our approach Conclusion
4/2
0/2
00
7
2
Stylometry in IR Systems
STYLISTICS The theoritical framework for stylistic
combines; Halliday’s Language Theory Sander’s Theories of Stylistic
Halliday says:“A text is what is meant, selected from the total set of opinions that constitute what can be meant”
Sander says:“Style is the result of choices made by an author from a range of possibilities offered by the language system”
4/2
0/2
00
7
3
Sty
lom
etry
in IR
Syste
ms
STYLISTICS
Stylistic variation depends on Author preferences and competence Familiarity Genre Communicative context Expected characteristics of the intended
audience
Modeling, representing and utilizing this variation is the business of stylistic analysis.
4/2
0/2
00
7
4
Sty
lom
etry
in IR
Syste
ms
STYLOMETRY
The application of the study of linguistic style
Style refers to the linguistic choices of authors that persist over their works, independently of content
Aim is to describe a text from a rather formal perspective like; Number of words Number of repetitions Sentence length
4/2
0/2
00
7
5
Sty
lom
etry
in IR
Syste
ms
APPLICATIONS OF STYLOMETRY
Authorship attribution Forensic author identification To find the author of an anonymous text
Observation of the “characteristics” of a particular author
Organization and retrieval of documents based on their writing style
Systems for genre-based information retrieval
4/2
0/2
00
7
6
Sty
lom
etry
in IR
Syste
ms
HISTORY OF STYLOMETRY
Stylometry grew out of analyzing text for evidence of authenticity, authorial identity
According to modern practice of discipline, there are distinctive patterns of a language to identify authors
After development of computers and their capacities Large data sets can be analyzed New methods can be generated and easily
applied
4/2
0/2
00
7
7
Sty
lom
etry
in IR
Syste
ms
HISTORY OF STYLOMETRY, CONT’D
Current researches uses techniques based on term frequency counts Frequency data are collected for common terms These data are then analyzed using a range of
fairly standard statistical techniques
However, they cannot guarantee quality ouput yet, i.e. Ulysses
4/2
0/2
00
7
8
Sty
lom
etry
in IR
Syste
ms
METHODOLOGY Use a subset of structural and stylometric
features on a set of authors without consideration of author characteristics
Currently, authorship attribution studies are dominated by the use of lexical measures
Generally used statistics: Word length Syllables per word Sentence-length Sentence count Text length in words Use of punctuation marks
STYLISTIC FEATURES Lexically-Based Methods
Vocabulary richness of the author Frequencies of occurrence of individual words
Vocabulary diversity: Type-token ratio V/N
V: size of vocabulary of sample text N: number of tokens
Hapax legomena How many words occur once
Frequencies of occurrence: Function words
STYLISTIC FEATURES
Problems: Text length dependent Unstable for short texts Function word set requires manual effort Specific to the group of authors considered
Solution: Use set of most frequent words Both content-words and function words
RELATED STUDIES
Analysis of the text by a natural language processing tool: Use existing NLP tool Sentence and Chunk Boundaries Detector (SCBD)
Use sub-word units like character N-grams instead of word frequencies: Character sequences of length n Most frequent n-grams provide information about
author’s stylistic choices on lexical, syntactical and structural level
WORD BASED FEATURES
Bag-of-words Apply stemming and stopword list
Function words Content-free
POS Annotation Feature Selection Semantic Disambiguation
LINGUISTIC CONSTITUENTS
Structure of natural language sentences show word occurrences follow a specific order
Words are grouped into syntactic units called “constituents”
Use word relationships by extracting constituents for feature construction Subdivide document into sentences Construct a syntax tree for each sentence
SYNTAX TREE
Use a syntax tree representation of different authors sentences as features
OUR APRROACH
4/2
0/2
00
7Sty
lom
etry
in IR
Syste
ms
16
Use Stylometry to analyze the following Texts translated by
the same translator but written by different authors
Texts translated by different translators but written by the same authors
PROPOSED STEPS
1. Feature Extraction Determine which features represent the style
best
2. Training Training the classifier with a training set Many methods present, (SVM, bayesian…)
3. Recognition and Classification of texts4. Analyzing the results of classification
4/2
0/2
00
7
17
Sty
lom
etry
in IR
Syste
ms
1. FEATURE EXTRACTION
The stylometric features of a text can be: Word length Sentence length Paragraph length Character n-grans Function words
Feature choices affect classification results seriously.
Then obtain a feature vector with n-dimensions V = {v1,v2,v3 … vn}
4/2
0/2
00
7
18
Sty
lom
etry
in IR
Syste
ms
2. TRAINING
4/2
0/2
00
7Sty
lom
etry
in IR
Syste
ms
19
Choose training data for every class May be randomly
selected texts May be manually
picked Determine the
corresponding parameters to each class
3. RECOGNITION AND CLASSIFICATION
4/2
0/2
00
7Sty
lom
etry
in IR
Syste
ms
20
Use the parameters we obtained from training data
Compute the distance
Label the data Classify the data
RESULTS OF THE CLASSIFICATION
We will have two set of results The original texts classified by author The translated texts classified by no prior class
information These results will give us a clue about the
two issues we stated at the beginning Example: “The Picture of Dorian Gray” is
translated into Turkish by many translators Look if these are clustered in one class or separate
classes
4/2
0/2
00
7
21
Sty
lom
etry
in IR
Syste
ms
OUR AIM
With the right classification we will be able to identify If sytlometric analysis works in finding an author
in two different languages If translations carry more of their translators’
style or if they still have their authors’ style
“…yet, to date, no stylometrist has managed to establish a methodology which is better able to capture the style of a text than that based on lexical items.”
4/2
0/2
00
7
22
Sty
lom
etry
in IR
Syste
ms
CONCLUSION
Today there are many useful applications of stylometry. Authorship attribution, plagiarism detection,
genre-based information retrieval
What features are valuable for analysis is still an important question.
We aim to find the stylistic connection between a text and its translation.
4/2
0/2
00
7
23
Sty
lom
etry
in IR
Syste
ms
REFERENCES Computational Stylistics in Forensic Author
Identifiction, Carole E. Charsi Style vs. Expression in Literary Narratives,
Özlem Uzuner, Boris Katz Computer-Based Authorship Attribution
Without Lexical Measures, E. Stamatatos, N. Fakotakis, G. Kokkinakis
Ensemble-Based Author Identification Using Character N-grams, E. Stamatatos
Combining Text and Linguistic Document Representations for Authorship Attribution, A. Kaster, S. Siersdofer, G. Weikum
4/2
0/2
00
7
24
Sty
lom
etry
in IR
Syste
ms
4/2
0/2
00
7
25
Sty
lom
etry
in IR
Syste
ms
Top Related